Distributed Algorithms for Learning Balanced Partitions in Heterogeneous … · 2006. 8. 28. ·...

POLITECNICO DI MILANO

Corso di Laurea in Ingegneria Informatica

Dipartimento di Elettronica e Informazione

Distributed Algorithms for Learning

Balanced Partitions in

Heterogeneous Multiagent Systems

AI & R Lab

Artificial Intelligence

and Robotics Laboratory of the Politecnico di Milano

Advisor:

Prof. Andrea Bonarini

Co-Advisors:

Eng. Marcello Restelli

Eng. Enrique Munoz de Cote

Thesis Dissertation of:

Maurizio Lattuada, matricola 666971

Academic Year 2005-2006

POLITECNICO DI MILANO

Corso di Laurea in Ingegneria Informatica

Dipartimento di Elettronica e Informazione

Algoritmi Distribuiti per

Apprendere Partizioni Bilanciate in

un Sistema a piu Agenti Eterogenei

AI & R Lab

Laboratorio di Intelligenza Artificiale e Robotica

del Politecnico di Milano

Relatore:

Prof. Andrea Bonarini

Correlatori:

Ing. Marcello Restelli

Ing. Enrique Munoz de Cote

Tesi di Laurea di:

Maurizio Lattuada, matricola 666971

Anno Accademico 2005-2006

A mamma, per tutto quel che mi ha dato e insegnato.

A papa, perche sarebbe stato molto orgoglioso di me e fiero di vedere “ul me fioeu

diventa ingegne”.

Summary

Nowadays, Reinforcement Learning techniques (RL for short) are widely applied

in many real environments as a plain trial-and-error framework. This peculiar

characteristic makes RL techniques very tricky to apply without having any dy-

namic environment model. RL has been also applied in multiagent systems. As a

consequence, we can formalize different behaviors of players as well as interactions

among them.

The aim of this thesis is to study how and why different types of agents form

coalitions (thus cooperate) in order to satisfy a goal that has certain characteristics.

In this dissertation we have formalized a new type of games in order to study such

social behaviors. We have studied how to formalize an environment when we have

such agents behaving in order to reach an optimal configuration. We have obtained

encouraging results with this framework. We have also demonstrated the existence

of a social cooperation among coalitions of different types of agents and that this

kind of cooperation is needed to reach a goal.

I

Sommario

In questa tesi abbiamo studiato le tecniche di apprendimento per rinforzo (Reinfor-

cement Learning, RL) in ambito multiagente. RL punta a imitare gli esseri viventi,

in modo particolare per quel che riguarda come essi apprendono delle abilita in un

ambiente a loro sconosciuto. Tramite l’apprendimento, un organismo impara a

essere autonomo e a interagire in maniera ottimale con l’ambiente in cui opera

(si veda [17]). Questo campo dell’intelligenza artificiale (AI) modella questa atti-

vita di apprendimento attraverso un agente che opera in un ambiente, e da questa

interazione, formalizzata con un segnale di rinforzo, affina la propria politica di

azione.

Prima si e citato il termine “agente”: ma cos’e un agente? Possiamo modelliz-

zare un agente come una entita che percepisce l’ambiente in cui opera attraverso

dei sensori e agisce in tale ambiente attraverso degli attuatori. Di conseguenza,

possiamo identificare come agente un robot che si muove in una stanza oppure un

programma che bilancia opportunamente il carico in una rete di calcolatori.

Un’altra fondamentale caratteristica degli agenti che deve essere presente in

RL riguarda la razionalita delle proprie azioni. Un agente e detto razionale se le

proprie azioni possono permettere a tale agente di raggiungere un obiettivo, che

solitamente e formalizzato come l’estremizzazione di una funzione di utilita.

RL e stato ampiamente studiato nel caso singolo agente e al giorno d’oggi puo

essere considerata una disciplina matura. La fase di apprendimento e completa-

mente autonoma e non e soggetta in alcun modo a una supervisione iniziale in cui

si istruirebbe l’agente con degli esempi noti. Invece, in questo caso si ha a che

III

fare con un puro paradigma “impara dagli errori” (trial-and-error) in cui l’agente

impara come comportarsi basandosi sulla propria esperienza passata.

Una chiara e immediata estensione di questo ambiente vede la presenza di

piu agenti che operano nello stesso ambiente (sistemi a molti agenti, RL-MAS).

Le conoscenze maturate nel caso a singolo agente sono state impiegate anche in

questa estensione con delle opportune modifiche. Ovviamente questa nuova forma-

lizzazione presenta maggiori difficolta che il caso a singolo agente, infatti bisogna

comprendere le interazioni tra gli agenti e l’ambiente e tra gli agenti stessi. Di con-

seguenza, si ha a che fare con un ambiente non stazionario, perche e influenzato

dalla politica di ogni agente.

Inoltre, le interazioni tra i diversi agenti possono essere modellizzate in modo

cooperativo o competitivo per soddisfare l’obiettivo da raggiungere (di solito e un

equilibrio di Nash, NE). In questa tesi ci focalizzeremo su comportamenti coope-

rativi, dove si ha un obiettivo che, per essere soddisfatto, necessita una forma di

cooperazione tra gli agenti. Questo comportamento cooperativo puo essere indotto

attraverso opportuni segnali di rinforzo immediato (reward) assegnati agli agenti.

A questo punto e naturale prevedere un’ulteriore estensione del caso a molti

agenti introducendo la possibiltia di formare coalizioni tra agenti. Chiaramente, in

questo caso si dovra tener conto anche delle interazioni tra le coalizioni di agenti.

Il rinforzo ora assume un carattere globale, dato che e assegnato a ogni coalizione.

Si necessitano dunque dei meccanismi per suddividere tale rinforzo tra tutti gli

agenti della coalizione (i piu noti sono il core e lo Shapley’s value, si veda [16]);

con questi meccanismi si puo dare diversa importanza e/o priorita a determinati

agenti.

Il contributo dato da questa tesi e duplice: dapprima sono stati analizzati diversi

metodi per coordinare un insieme di agenti individuando i pro e i contro in parti-

colari ambienti conosciuti in letteratura. In seguito, abbiamo proposto una nuova

tipologia di giochi (task allocation via coalition formation games) che riguardano i

problemi di allocazione delle risorse con la presenza di agenti eterogenei.

La coordinazione tra agenti analizzata si basa sull’approccio COIN (COllective

INtelligence, [23]) di Wolpert et al. e punta a indurre forme di cooperazione tra

IV

agenti senza bisogno di avere un modello della dinamica del mondo e senza che

questi possano comunicare tra loro, ma solo attraverso l’interazione con l’ambiente

(naturalmente attraverso i segnali di rinforzo immediato). Questa metodologia e

stata analizzata e verificata in problemi noti in letteratura che prevedono la coo-

perazione tra agenti per poter raggiungere un obiettivo (in particolare, il classico

mondo a griglia e il Bar Problem di Brian Arthur, [24] e [6]). Sono state eviden-

ziate le peculiaria di questo approccio e nel contempo sono state analizzate anche

le carenze in particolari ambienti di tipo non stazionario. Un grosso punto a fa-

vore di questo approccio prevede, oltre al fatto di non avere alcun modello della

dinamica, la disponibilita di una funzione di utilita globale (world utility) che e

impiegata per valutare il comportamento globale del mondo che emerge dai sin-

goli comportamenti degli agenti. Con questa funzione di utilita globale, vengono

opportunamente calcolati i segnali di rinforzo immediato da distribuire agli agenti

in modo tale da indurre una forma di cooperazione implicita. Analizzando questo

approccio sono state evidenziate delle lacune: alcune riguardano le prestazioni in

ambienti non stazionari (che sono comunque buone, ma non ottime come presenta-

to in letteratura), altre riguardano il modo con cui i segnali di rinforzo immediato

sono calcolati (questi possono essere simmetrici, dunque si potrebbero avere delle

velocita di convergenza non ottimali).

A questo punto si e deciso di complicare ulteriormente il problema introducendo

una diversificazione tra gli agenti, vale a dire la presenza di piu agenti, ma con

ruolo diverso. Queste diverse tipologie di agenti hanno un duplice scopo: devono

trovare una forma di coalizione e con questa devono cercare di raggiungere un

obiettivo prefissato. In letteratura, il campo che studia la creazione di coalizioni

(coalition formation) e stato studiato e analizzato nei suoi diversi aspetti ([16]),

ma quel che noi proponiamo e una ulteriore estensione che vede la presenza di

coalizioni nei giochi che si occupano di allocazione di risorse. E stata definita una

metodologia di assegnamento dei rinforzi che pone particolare riguardo alle richieste

computazionali per fare in modo che ogni agente sia in grado di discernere la

migliore struttura di coalizione da formare per poter poi raggiungere l’obiettivo; se

poi questo obiettivo non e stato raggiunto, gli agenti sono in grado autonomamente

V

(cioe senza avere alcuna possibilita di comunicazione, dunque senza una forma

di coordinamento esplicito) di cambiare la struttura della coalizione per tentare

nuovamente di raggiungere l’obiettivo prefissato.

I risultati ottenuti con questo approccio sono particolarmente incoraggianti,

soprattutto nel problema analizzato (si tratta di una estensione del Bar Problem,

[25]) in cui sono state prese in considerazione diverse configurazioni per poter

verificare l’estensione del problema da noi proposta.

Questo studio apre nuovi campi di ricerca futuri in diverse direzioni. E interes-

sante studiare l’influenza che ha lo spazio di stato su queste particolari tipologie di

problemi: a questo punto potrebbe tornare utile un approccio simile a LEAP ([2])

in cui si passa da uno spazio a uno spazio di caratteristiche piu compatto, oppure

una diversa caratterizzazione dello spazio di stato attuata da ciascun agente in

relazione alle coalizioni formate finora.

Un altro aspetto da studiare riguarda la definizione di contributo marginale da

noi proposta. Dato che questa funzione e strettamente legata alla definizione di

funzione caratteristica di una coalizione, si possono formalizzare diversi comporta-

menti con quest’ultima e dunque si rende necessario studiare le differenti prestazioni

ottenute. Inoltre, la funzione di contributo marginale si preoccupa principalmente

di assegnare un rinforzo alla coalizione di agenti, ma non di come questo sia poi

suddiviso tra essi. Il contributo marginale e il nocciolo del valore di Shapley. Sicco-

me quest’ultimo e computazionalmente pesante da calcolare, sarebbe interessante

trovare una relazione tra tale valore e la funzione di contributo marginale in modo

tale da ricostruire una approssimazione o un valore di Shapley atteso futuro in

modo da poter poi essere utilizzato per dividere effettivamente il rinforzo tra tutti

gli agenti di una coalizione.

VI

Ringraziamenti

Prima di tutto vorrei ringraziare il Relatore, Prof. Andrea Bonarini, per avermi

introdotto e motivato in questo lavoro e per la grande disponibilita accordatami.

Un gigantesco grazie va a Marcello, Enrique e Alessandro per tutto il tempo

che mi hanno dedicato, per le innumerevoli discussioni (e risate) e soprattutto per

la vicinanza dimostratami in particolari periodi “extra-tesi”. Senza il loro aiuto

e soprattutto senza la loro amicizia non sarei arrivato fino a questo traguardo.

Grazie di cuore.

Ringrazio anche gli amici che mi hanno accompagnato in questa avventura

chiamata “Poli” (Gabriele, Emanuele, Massimo, Bedda, il Guso, il Vince, Ciusipp,

. . .) e tutti quelli con cui ho diviso il tempo in AIRLab (Simone, Daniel e la sua

sangria, Mario, il Lazza, Matteo, . . .) scambiandoci consulenze tecniche su come

smontare il PRLT (per la gioia di Alessandro). Ovviamente non dimentico anche

altri amici: Homer, Centu, Mini, Lindi, Monfro, Passe, quelli de “LZD”, Lele,

Dino, . . .

Un grazie e piu dovuto anche per Jessica che ha avuto la pazienza e la forza di

sopportarmi in quest’anno di tesi, e anche per tante altre cose che non sto qui a

scrivere.

Infine un ringraziamento enorme va a mamma e papa per . . . per . . . per tutto!

A mamma, perche possa essere orgogliosa di me. A papa, perche ha fatto in tempo

a leggere un primissimo abbozzo di questa tesi e perche tanto desiderava esserci

. . . anche se ora sara in Qualche altro posto.

VII

Contents

Summary I

Sommario III

Ringraziamenti VII

List of Figures v

List of Tables ix

1 Introduction 1

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 5

I State of the Art 7

2 Reinforcement Learning 9

2.1 Learning from Interaction . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 TD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.2 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Multi-Agent Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 Change or Learn Fast . . . . . . . . . . . . . . . . . . . . . . 13

2.2.2 Change & Keep . . . . . . . . . . . . . . . . . . . . . . . . . 13

i

2.2.3 Minimax Q-learning . . . . . . . . . . . . . . . . . . . . . . 14

2.2.4 Nash Q-learning and Friend or Foe Q-learning . . . . . . . . 15

3 COIN: COllective INtelligence 17

3.1 Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.1 Artificial Intelligence and Machine Learning . . . . . . . . . 20

3.2.2 Social Science-Inspired Systems . . . . . . . . . . . . . . . . 22

4 A Framework Designed for COINs 25

4.1 Problems with a Model-Based Approach . . . . . . . . . . . . . . . 26

4.2 Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2.1 Preliminary Definitions and Terminology . . . . . . . . . . . 27

4.2.2 Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2.3 Learnability . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3 A Descriptive Framework for COINs . . . . . . . . . . . . . . . . . 33

4.3.1 Candidate Salient Characteristics . . . . . . . . . . . . . . . 33

4.3.2 Factoredness . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3.3 Wonderful Life Utility . . . . . . . . . . . . . . . . . . . . . 36

4.3.4 How to Induce these Salient Characteristics? . . . . . . . . . 39

5 Experimental Applications 41

5.1 Packet Routing in a Network . . . . . . . . . . . . . . . . . . . . . . 42

5.1.1 COIN for Network Routing . . . . . . . . . . . . . . . . . . 43

5.1.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 45

5.2 Learning Sequences Of Actions . . . . . . . . . . . . . . . . . . . . 46

5.2.1 COIN Solution . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3 Bar Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

ii

II Innovation 53

6 Theoretical Considerations 55

6.1 Class of Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.1.1 Matrix Games . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.1.2 Stochastic Games . . . . . . . . . . . . . . . . . . . . . . . . 57

6.1.3 Differences between Grid world and Bar Problem . . . . . . 58

6.2 Delayed Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.3 Reward Function of the Bar Problem . . . . . . . . . . . . . . . . . 60

6.4 Q-learning Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.5 Introduction to Coalition Formation . . . . . . . . . . . . . . . . . . 66

6.5.1 Coalition Structure Generation . . . . . . . . . . . . . . . . 67

6.5.2 Optimization within a Coalition . . . . . . . . . . . . . . . . 69

6.5.3 Payoff Division . . . . . . . . . . . . . . . . . . . . . . . . . 69

7 Task Allocation via Coalition Formation 73

7.1 Game Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7.1.1 Curse of the State Space Size . . . . . . . . . . . . . . . . . 76

7.1.2 Fuzzy Games and Groups of Agents . . . . . . . . . . . . . . 78

7.2 Utility Functions of the Game . . . . . . . . . . . . . . . . . . . . . 79

7.2.1 Reward Distribution among Agents . . . . . . . . . . . . . . 79

7.2.2 Characteristic and Reward Functions . . . . . . . . . . . . . 81

7.3 Testbed Problem: Cooking Teams . . . . . . . . . . . . . . . . . . . 83

7.3.1 Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7.3.2 Reward Functions . . . . . . . . . . . . . . . . . . . . . . . . 87

7.3.3 State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

8 Results 91

8.1 Grid world . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

8.1.1 First Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

8.1.2 Second Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

8.2 Bar Problem and its Reward Functions . . . . . . . . . . . . . . . . 98

8.2.1 First Bar Configuration . . . . . . . . . . . . . . . . . . . . 99

iii

8.2.2 Second Bar Configuration . . . . . . . . . . . . . . . . . . . 105

8.2.3 Q-learning Dynamics of the Bar Problem . . . . . . . . . . . 109

8.3 Cooking Teams Problem . . . . . . . . . . . . . . . . . . . . . . . . 116

8.3.1 Nonempty State Space . . . . . . . . . . . . . . . . . . . . . 117

8.3.2 Empty State Space . . . . . . . . . . . . . . . . . . . . . . . 128

9 Conclusions and Future Works 133

9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

9.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

Bibliography 139

iv

List of Figures

5.1 Network architectures (from [22]) . . . . . . . . . . . . . . . . . . . 43

5.2 Overall delay of the networks (from [22]) . . . . . . . . . . . . . . . 45

5.3 System performance with 10 agents on a 10×10 grid (from [24]) . . 49

5.4 System performance with 100 agents on a 32×32 grid (from [24]) . . 50

5.5 Average performance of the Bar Problem (from [6]) . . . . . . . . . 51

6.1 Exponential functions of the Bar Problem . . . . . . . . . . . . . . 60

6.2 WLU rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

8.1 Untypical grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

8.2 Results of the untypical grid . . . . . . . . . . . . . . . . . . . . . . 95

8.3 Grid proposed by ’t Hoen and Bohte (from [18]) . . . . . . . . . . . 96

8.4 Results of the grid proposed by ’t Hoen and Bohte . . . . . . . . . . 97

8.5 Results of the first bar configuration with the exponential reward

functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

8.6 Results of the first bar configuration with the Gaussian reward func-

tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

8.7 Mobile mean of the WLU functions relative entropy . . . . . . . . . 103

8.8 Mobile mean of the TG and UD utility functions relative entropy . 104

8.9 Attendance of the first bar configuration . . . . . . . . . . . . . . . 104

8.10 Results with the second bar configuration with the exponential re-

ward functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

v

8.11 Results of the second bar configuration with the Gaussian reward

functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

8.12 Mobile mean of the WLU functions relative entropy . . . . . . . . . 107

8.13 Mobile mean of the TG and UD utility functions relative entropy . 108

8.14 Attendance of the second bar configuration . . . . . . . . . . . . . . 109

8.15 Bar dynamics with 8 agents . . . . . . . . . . . . . . . . . . . . . . 111

8.16 τ for the bar dynamics with 8 agents . . . . . . . . . . . . . . . . . 112

8.17 Bar dynamics with 8 agents and τ = 0.14 . . . . . . . . . . . . . . . 113

8.18 Uniform policies obtained with Πu . . . . . . . . . . . . . . . . . . . 114

8.19 Policies of agents obtained with Πc . . . . . . . . . . . . . . . . . . 115

8.20 Bar-1 and Bar-4 with α = 0.5, ǫ = 0.1; the standard characteristic

function of Equation (7.5) is used to compute both the world utility

and the quality of a coalition of agents attending the bar . . . . . . 118

8.21 Bar-1 and Bar-4 with α = 0.5, ǫ = 0.1; the characteristic function of

Equation (7.5) is used to compute the world utility, while the char-

acteristic functions of Equations (8.6) and (8.7) are used to evaluate

the quality of a coalition of agents attending the bar . . . . . . . . . 120

8.22 Q-table visits for cooks and helpers in Bar-4 with α = 0.5, ǫ = 0.1;

the characteristic function of Equation (7.5) is used to compute the

world utility, while the characteristic functions of Equations (8.6)

and (8.7) are used to evaluate the quality of a coalition of agents

attending the bar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

8.23 Bar-4 with α = 0.5, ǫ = 0.3 over 150,000 weeks (here we used the

characteristic functions of Equations (7.5), (8.6) and (8.7)). These

graphs are an average mean over 10 different runs . . . . . . . . . . 123

8.24 Comparison between the performance of Bar-4 (SU function) ob-

tained using different values of ǫ (0.1, 0.3, 0.5, 0.7, 0.9, 1.0), with

Equation (7.5) used for the world utility and Equations (8.6) and

(8.7) used for the characteristic function. Each experiment is a mean

of 5 different runs and we plot one world utility value every 100 val-

ues (that is this experiment was executed over 500,000 weeks) . . . 125

vi

8.25 Comparison between the performance of Bar-4 (SU function) ob-

tained using ǫ = 0.3 and high q-values, with Equation (7.5) used for

the world utility and Equations (8.6) and (8.7) used for the charac-

teristic function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

8.26 Bar-1 and Bar-4 with αS = 0.5, αNS = 0.1, ǫ = 0.1 over 150,000

weeks (here we used the characteristic functions of Equations (7.5),

(8.6) and (8.7)). These graphs are an average mean over 10 different

runs. Each agent runs the CoLF algorithm . . . . . . . . . . . . . . 127

8.27 Bar-1 and Bar-2 with α = 0.5 and ǫ = 0.1 over 100,000 weeks (here

we used the characteristic functions of Equations (7.5), (8.6) and

(8.7)). These graphs are an average mean over 10 different runs . . 129

8.28 Bar-3 and Bar-4 with α = 0.5 and ǫ = 0.1 over 100,000 weeks (here

we used the characteristic functions of Equations (7.5), (8.6) and

(8.7)). These graphs are an average mean over 10 different runs . . 130

vii

List of Tables

5.1 Source–destination pairings for the three traffic loads . . . . . . . . 45

7.1 Optimal coalition structures for the four bar problems . . . . . . . . 86

7.2 Another admissible coalition structure for the Bar-2 . . . . . . . . . 87

ix

Chapter 1Introduction

Joyful the sound, the world goes around

From father to son, to son. . .

“Father to Son” – Queen

Contents

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 5

1

1.1 Overview 1. Introduction

1.1 Overview

In this thesis we have studied Reinforcement Learning techniques (RL for short) in

a multiagent field. RL aims to mimic natural living beings, in particular the way

how they learn in an uncertain environment. An organism learns to be autonomous

and to interact in an optimal fashion with the environment where it behaves. This

field of Artificial Intelligence (AI for short) models the learning activity of an agent

acting in an environment, and from that interaction it hones its action policy.

First, what is an agent? There is not a unique definition of agent: we can

imagine an agent as an entity that can perceive the environment where it acts

through sensors, and it acts upon that environment through actuators. With this

general definition we can identify as agent a robot moving in a room or a program

that controls the load in a computer network.

Another important characteristic related to agents is their rationality. An agent

is said to be rational if its actions can achieve one of the agent’s goals. In order to

achieve that goal it usually maximizes a utility function.

We have cited the learning phase and we said that an agent learns how to

interact in a perceived environment. According to [13], a computer program is said

to learn from experience E with respect to some class of tasks T and performance

measure P, if its performance at task in T, as measured by P, improves with

experience E. So, what does an agent learn? In a RL problem, an agent learns a

policy, i.e., which actions it has to perform in each state in order to reach a goal

by the maximization of some measure of the long-term future expected payoff.

RL has been widely studied in a single agent case and nowadays it may be

considered mature. The learning phase is autonomous and it is not a result of a

supervised learning, where an agent has been thought about how to learn. Hence,

in this case we are dealing with a plain trial-and-error framework, where an agent

learns how to act given only its past experience, thus without any kind of external

directions. However, in the beginning its logical operations must be supervised

by a designer in the form of a careful optimization of its internal parameters.

Obviously, this framework lacks of grounding knowledge in order to compare this

artificial learning technique with the natural one of living beings (reuse of past

2

1. Introduction 1.1 Overview

knowledge in similar domains, making up complex actions derived from simple

ones, . . .).

A clear extension of this framework foresees the presence of more than one

agent acting in an environment. This new extension of RL is usually known as RL

MultiAgent System (RL-MAS for short) and it is relatively novel with respect to

the single agent case. This expansion is straightforward, since we can deal with

large and complex problems involving distributed reasoning knowledge and data

managing. It is easy to understand this framework extension is even more com-

plex, in fact we must understand that interactions among agents and environment

and agents themselves. As a consequence, agents deal with a non-stationary envi-

ronment, because it is influenced by each agent’s policy, thus the learning phase

becomes more and more articulated.

In addition to this difficult problem, agents’ behavior may be formalized in

a cooperative fashion or in a selfish one, and obviously that behavior must be

compatible with agents’ main goal (that is usually a Nash Equilibrium). In this

thesis we mainly focus on cooperative behavior: we have a goal that, in order to be

fully satisfied, must be reached in a cooperative fashion by all agents. If we allow

any kind of selfish behavior, we may get into two main situations: Tragedy of the

Commons (TOC for short) or liquidity trap ([23]). The former is concerned with

the avarice of each individual that works to lower world utility, thus how the overall

emerging agents’ behavior is rated. The latter happens if a behavior of a subset of

agents, if adapted by all agents, results in lower values of the world utility.

Even in this case, RL approach is valid, since it focuses on a functional treat-

ment of goal oriented problems. The goal is formalized as a reward signal assigned

to each agent by the environment and it is related to agents’ behavior. Hence,

RL is mainly based on the reward signal and how it is used by agents. By the

reward signal definition we are able to model different behaviors, especially in the

multiagent case. By the reward signal use we can impose to an agent different ways

of learning an optimal action policy. An innovative approach is the COllective IN-

telligence (COIN for short), proposed by D. Wolpert and K. Tumer (see [6]), that

computes the reward to be assigned to each agent in an intuitive way (see Chap-

3

1.2 Main Contributions 1. Introduction

ter 4). This reward computation is compared with other trivial approaches (team

game, selfish utility, . . . – see [6] and [24]) and it is analyzed using the Q-learning

dynamics (see [20]) in order to show how COIN can induce a cooperative behavior

among agents obtaining good performance.

Furthermore, there exists another extension of this framework: we allow to a

subset of agents to form a coalition in order to let agents to reach a goal with

peculiar characteristics ([16]). Once again, RL approach is still useful in this new

MAS case study. As stated above, besides the formalization of interactions among

environment and agents, here we must deal with interactions among coalitions of

agents in order to find a suitable learing policy to reach a goal. Another critical

point is the reward usage, which is assigned to each coalition of agents by the

environment. The way we split the reward signal among agents is crucial in order to

assign different priorities and/or importance to each agent belonging to a coalition.

The two most important techniques used to split the reward among agents are the

Shapley’s value and the core ([16]). The former lies on the joining order of agents

in a coalition, thus the reward of each agent is bounded on that joining order as

well as to the joint action. The latter focuses on the stability of a coalition, that

is we have not any further coalition change because any agents can’t achieve more

by changing their policy.

1.2 Main Contributions

The aim of this thesis is twofold. At first, we analyze the different methods used

to cooordinate a set of agents and we study the pros and cons derived from these

methods in particular known environments in literature. Next, we propose a new

typology of games (task allocation via coalition formation games) which involves

task allocation problems, but with a set of heterogeneous agents. Plain task allo-

cation problems focus on how to associate a number of tasks to a (great) number

of agents with a suitable partition ([6] and [19]). Coalition formation games ([16])

focus on games where we have a set of coalitions of agents, and each coalition may

be seen as a super-agent acting in an environment. Each coalition gains a reward

4

1. Introduction 1.3 Outline of the Thesis

based on the joining order of agents in that coalition as well as on the joint ac-

tion. Furthermore, the reward must be distributed among agents belonging to the

coalition. The new class of games here proposed takes the most significant part of

coalition formation games and task allocation ones in order to formalize different

real situations, where we are dealing with different types of agents and a set of

tasks that must be executed with some given balance of different types of agent.

The way how they learn to coordinate themselves will induce the overall behavior

and, given that behavior, we can evaluate how these agents have acted.

By adopting this new kind of game we have faced well known problems in liter-

ature (the Cooking Teams Problem, [25]) and we show how these games formalize

agents’ behavior with different configurations of the environment and of agents.

Here we have found interesting results about the curse of the state space and we

motivate how this game does not work well with any kind of state information with

different configurations of agents and environment.

1.3 Outline of the Thesis

This thesis is organized as follows:

Chapter 2 : we give a brief introduction about RL, both from single agent and

multiagent viewpoints. We describe the most popular algorithms and we

show how these algorithms were adapted to be used in the multiagent case.

Furthermore, we depict how some concepts of Game Theory were used when

we are dealing with a multiagent environment and their pros and cons about

the solutions found.

Chapter 3 : we give a glance about the COllective INtelligence theory (COIN)

and its theoretical basis on different scientific fields in order to understand

how it is structured and which is its ground key idea.

Chapter 4 : we analyze in depth COIN and which kind of problems is designed

to solve. A key factor of this theory is to avoid to build a model of the envi-

ronment dynamic we want to understand in order to find a suitable solution,

5

1.3 Outline of the Thesis 1. Introduction

that is how to induce a behavior to agents acting in that environment in order

to satisfy a goal. Instead, with this backward theory, given an environment,

a set of agents and a goal, we are able to find a convenient solution to the

learning problem without any necessity to build a dynamic model of different

interactions among agents. At first, we show some useful functions used to

measure the learning goodness, then we describe the desired characteristics

of such functions in order to have good, inexpensive and reusable solutions

to the learning phase.

Chapter 5 : we briefly describe different environments where COIN has been

applied, and we show how it has been applied, the results obtained and the

key factors of that theory.

Chapter 6 : we introduce different challenging problems of RL related to COIN

and to the examples presented in Chapter 5. We motivate these difficulties

using the Q-learning dynamics. At last we introduce the coalition formation

approach that will be extended in Chapter 7.

Chapter 7 : we introduce a new kind of games involving both coalition formation

games and task allocation ones. This new class of games takes the most

significant features from task allocation and coalition formation games in

order to formalize different real environments, where we have different types

of agents undertaking tasks.

Chapter 8 : we explain the results obtained with the experiments proposed in

Chapter 5 and how COIN was changed in order to improve agents’ perfor-

mance. Furthermore we propose the results obtained with the new class of

games in a well known problem in literature (Cooking Teams problem, [25]).

Chapter 9 : we discuss the work developed in this dissertation and we present

some future directions that can be furthermore studied starting from this

thesis.

6

Part I

State of the Art

7

Chapter 2Reinforcement Learning

Reinforcement learning is learning what to do

R. Sutton, A. Barto

Contents

2.1 Learning from Interaction . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 TD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.2 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Multi-Agent Learning . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 Change or Learn Fast . . . . . . . . . . . . . . . . . . . 13

2.2.2 Change & Keep . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.3 Minimax Q-learning . . . . . . . . . . . . . . . . . . . . 14

2.2.4 Nash Q-learning and Friend or Foe Q-learning . . . . . 15

9

2.1 Learning from Interaction 2. Reinforcement Learning

As reported in [17] with reinforcement learning (RL for short) a generic agent

(hardware or software) discovers which actions to execute. It is interesting to note

in most cases actions taken may affect not only the immediate reward, but also

the future ones. These characteristics (trial-and-error and delayed reward) are the

most significant of the reinforcement learning problem formulation.

Reinforcement Learning became a standard used for studying how agents can

learn a curse of actions when they act in an unknown or merely uncertain environ-

ment. Notice that reinforcement learning is completely different from supervised

learning: in the latter case we have a set of pre-classified of examples given by a su-

pervisor that is used by the learner to build an approximated relationship between

the elements an the results obtained by that set.

In the next sections we briefly introduce these facets of Artificial Intelligence

explaining some methodologies applied in this thesis. A deeper presentation may

be found in [15] and [17].

2.1 Learning from Interaction

In many problems we may not have a deeper knowledge of the environment where

an agent acts. The aim of reinforcement learning is to build an action policy based

on one of the following two methodologies:

model based : with these methods agents use their past experience to build

a model of the environment where they act which allows to construct an

approximation of state transition and reward functions.

model free : these models try to learn value functions directly from the reward

signal without building any kind of model of the environment.

Usually, the model free methods aim to estimate the expected value of the

reward signal using a formula like

Xk = Xk−1 + αk · (xk − Xk−1), (2.1)

where Xk is the new estimated value, which includes information from k samples

xk of a random variable, while αk ∈ [0, 1] is the learning rate parameter. Given

10

2. Reinforcement Learning 2.1 Learning from Interaction

some particular values of αk we obtain different update rules from Equation (2.1):

• with αk = 1k

we have the sample mean of the instances {x1, x2, . . . , xk};

• with αk = α, 0 < α < 1 we have a weighted mean on recent values of xk.

In RL many algorithms exist working in different ways. Some of the most used

algorithms belong to the class of Temporal Difference (TD) algorithms. They are

classified as on policy and off policy : the former evaluate and try to improve the

same policy used in the learning stage and in the control stage (e.g. SARSA), while

the latter typically use two different policies for the learning step and the control

step (e.g. Q-learning). Many other methodologies in this wide area of artificial

intelligence may be found in [15] and [17].

2.1.1 TD

These methods learn directly from past experience without any kind of model of the

environment (model free methods). The easiest method is TD(0) which updates

the value of function V (state) with the following update rule:

V (st) = V (st) + α [rt+1 + γ · V (st+1) − V (st)] , (2.2)

where rt+1 is the reward obtained and st is the actual state at time step t. Agent

chooses an action at based on its policy π(st) and, after executing it, it is in the

state st+1 and then it updates the value of function V (·) as stated in Equation

(2.2).

2.1.2 Q-learning

One of the most important and most used method in RL is the Q-learning al-

gorithm: given an action at based on the policy π(st) and a reward rt+1 ob-

tained in st+1 after executing that action, this algorithm updates the function

Q(state, action) as follow:

Q(st, at) = Q(st, at) + α[rt+1 + γ · max

a′

Q(st+1, a′) − Q(st, at)

](2.3)

11

2.2 Multi-Agent Learning 2. Reinforcement Learning

2.2 Multi-Agent Learning

The algorithms sketched before are widely used in such systems where there is only

one agent operating. The Q-learning algorithm has been applied in environments

where many agents behave with some underlying changes to allow cooperation

between them rather than operating in a self-interested fashion.

First of all it is important to note that multiagent environments are intrinsi-

cally non-stationary, because the agents are learning, so their policies are changing.

These non-stationary changes cannot be foreseen by other agents and related pay-

offs may be misleading, thus negatively affecting cooperation.

Some learning algorithms for multiagent systems focus on each single agent’s

behavior to find some admissible policies leading it to an equilibrium (typically the

Nash equilibrium). We point out that these algorithms may impose strong con-

ditions to converge and, with the presence of some other conditions (e.g. there is

more than one Nash equilibrium), we need some coordination mechanisms, treated

by other theories rather than reinforcement learning. Some algorithms more widely

studied and applied are minimax Q-learning (minimax-Q) in [7], Nash Q-learning

(Nash-Q) in [8] and [12], Friend or Foe Q-learning (FoF-Q) and Correlated Equi-

librium Q-learning (CE-Q) in [9].

Other algorithms focus on maximizing the reward obtained by an agent acting

in an environment supposing that its actions have not any kind of side effect on

other agents. As a consequence, an agent learns a policy that fits actions of other

agents. Some algorithms are Infinitesimal Gradient Ascent (IGA), Win or Learn

Fast Gradient Ascent (WoLF-IGA) and Win or Learn Fast Policy Hill Climbing

(WoLF-PHC ), all discussed in [4].

Many other algorithms learn an optimal policy for an agent focusing on co-

operation among other agents acting in the same environment. These algorithms

may require more or less strong conditions to hold like knowledge about actions

taken by other agents. In this class we include Change or Learn Fast (CoLF )

and Change & Keep (CK ) (see [1]), Independent Learner (IL) and Joint Action

Learner (JAL) in [5], Distributed Q-learning in [11].

In the following sections we briefly explain some interesting algorithms men-

12

2. Reinforcement Learning 2.2 Multi-Agent Learning

tioned before.

2.2.1 Change or Learn Fast

The CoLF algorithm [1] suggests a variable learning rate to learn quickly while the

agent is losing and slowly while the agent is winning. To improve learning, these

different learning rates were proposed to foster cooperation: if an agent achieves a

payoff unexpectedly changed, then it learns slowly, otherwise learns fast.

For each pair 〈state, action〉 (which is the argument of function Q(·, ·)) we

calculate P and S-values. P -values are exponential averages of the collected payoffs

with weight factor λ (λ ∈ (0, 1)), while S-values are exponential averages of the

absolute differences between the current payoff and the respective P -value. This

algorithm uses two different learning rates αNS (payoffs have rapid variations)

and αS (agents have nearly stationary policies) such that αNS < αS. The choice of

which learning rate must be used in the update phase of Q(s, a) depends on whether

the absolute difference between the current payoff and the respective P -value is

greater than the respective S-value.

P -values are an estimate of the expected payoffs and S-values are an estimate of

their variability. If the actual payoff is near to the expected one, then we assume

the environment is enough stationary and we update the Q-values with a high

learning rate (αS). On the other hand, when the current payoff is highly different

with respect to the associated P -value we use αNS to update Q-values in order to

reduce the non stationary effects of the environment.

2.2.2 Change & Keep

The CK algorithm (see [1]) is based on the following plain observation: when an

agent chooses a different action due to either learning or exploration, it typically

obtains an uninformative payoff. These misleading payoffs cannot be foreseen by

other agents and consequently their payoffs may be deceptive, thus negatively

affecting cooperation.

The idea proposed is to discard the payoff received immediately after the chang-

ing in the action selection, so the update of the Q-value is suspended. The agent

13

2.2 Multi-Agent Learning 2. Reinforcement Learning

repeats the same action and then the related Q-value will be finally updated. This

temporary suspension of the update phase gives time for the other agents to react

to its new action, thus having a more informative payoff for the update of the

related Q-value.

This behavior can be simply described by a simple finite state machine: starting

from state sC , while an agent selects the same action, it remains in that state

where it updates the associated Q-value and it chooses a new action according to

its strategy. When the selected action changes, agent passes in state sK where it

suspends the update phase, it still selects the same action, it updates the related

Q-value then it comes back to sC .

2.2.3 Minimax Q-learning

The Q-learning algorithm converges regardless of the learning rate α, but values

associated to that factor may influence the learning speed. This algorithm con-

verges to the true value of the pair 〈state, action〉 with a continuous refinement of

estimates given by payoffs and by current estimates.

In zero-sum games with two players (they are usually identified as max and

min), the payoff associated to a state st is related to the best action chosen by

the other player in state st+11. The estimate of the Q function is computed in a

different way depending on the player: max player computes mina′ Q(st+1, a′) in

the update equation of the Q function, while min player computes maxa′ Q(st+1, a′)

in the same equation.

When two players execute at the same time their actions we have a different

definition of the Q function:

V (s) = maxa∈A

mino∈O

Q(s, a, o), (2.4)

where a indicates the action chosen by an agent, while o is the action chosen by

the opponent.

1This is due to the term maxa′ Q(st+1, a′) of Equation (2.3), since that player chooses in turn

its action to execute.

14

2. Reinforcement Learning 2.2 Multi-Agent Learning

2.2.4 Nash Q-learning and Friend or Foe Q-learning

The Nash-Q algorithm is tightly related to the previous one, but now each agent

keeps a copy of the Q function of other agents: as a consequence it will be defined as

Q(s, a1, a2, . . . , an), where ai indicates the action of the i-th agent. There must be

one and only one Nash equilibrium in order to guarantee the algorithm convergence:

this is a strong condition, since that in not zero-sum games there may exist more

than one Nash equilibrium. In that case some coordination mechanism must exist

in order to have all agents reaching the same equilibrium.

In [8] the authors prove the conditions which must hold in order to have con-

vergence in Markovian non zero-sum games:

• each state must be visited infinite number of times;

• the learning rate α must satisfy the following hypothesis:

– 0 6 αt(s, a1, . . . , an) 6 1,

∑∞

t=0 αt(s, a1, . . . , an) = ∞,

∑∞

t=0 [αt(s, a1, . . . , an)]

2< ∞

– αt(s, a1, . . . , an) = 0 if (s, a1, . . . , an) 6= (s, a1

t , . . . , ant )

Bowling has studied this kind of games [3] and he has confuted the previous

theorem giving a counterexample; he added some stronger conditions which must

hold on the initial values of the Q function to guarantee the convergence of this

algorithm.

The algorithm Friend or Foe Q-learning proposed by Littman overcomes some

limitations introduced by the previous one. The convergence of the Nash-Q algo-

rithm is guaranteed towards a Nash equilibrium if a Nash equilibrium exists for

the opponents and a coordination equilibrium exists, both defined for all games

associated to the Q functions seen in the learning phase. This condition implies a

priori knowledge of which equilibrium an agent may reach: if an agent is friendly

(see the adjective “friend” in the name of this algorithm), it reaches a coordina-

tion equilibrium (then it applies the classic Q-learning algorithm), otherwise (“foe”

case) it reaches an equilibrium of the opponents (then it applies the minimax-Q

algorithm).

15

Chapter 3COIN: COllective INtelligence

The rich behavior of social insect colonies arisen not from

the sophistication of any individual entity in the colony,

but from the interactions among those entities

D.H. Wolpert, K. Tumer

Contents

3.1 Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.1 Artificial Intelligence and Machine Learning . . . . . . . 20

3.2.2 Social Science-Inspired Systems . . . . . . . . . . . . . . 22

17

3.1 Preamble 3. COIN: COllective INtelligence

3.1 Preamble

In the last decade two particular fields of computer science were deeply studied

whose intersection may have great success. The first concerns the ways to control

distributed systems adaptively that have little (if any) centralized communication

with minimal knowledge of any dynamic behavior. The second is Reinforcement

Learning (RL), a branch of machine learning concerned with an agent acting in an

environment from which it receives rewards evaluating its behavior (see Chapter

2, [15] and [17]). The goal of a RL algorithm is to define how, using those rewards,

an agent may update its policy to maximize its utility.

We might hope that RL may be used in the control scenario introduced before,

since RL is adaptive and it is not restricted to a particular domain. However, al-

ready in the single agent case, while acting in a generic environment we must face

the computational limitations, since the policy space may be too large. We might

introduce more agents (remind that each agent has its own utility function) each

controlling only part of the system. Unfortunately, we have implicitly introduced

a new global reward function (from now we refer to it as world utility) evaluating

the overall behavior emerging from the environment. All agents have their private

utility: how can we map the world utility into the private utility of agents? More-

over we may find a valid characteristic to choose utility functions of each agent

such that each agent can maximize its utility function and the overall behavior

may increase the world utility.

With the term COllective INtelligence (COIN) we refer to any pair of large,

distributed collection of interacting computational processes among which there is

little to no centralized communication or control, with a world utility function that

rates the possible dynamic histories of the collection. If each process uses a RL

algorithm, we are interested to study how we can set the utility functions of each

RL algorithm to achieve high values of the world utility without having any prior

knowledge of the dynamic nor any model of the system.

18

3. COIN: COllective INtelligence 3.2 Background

3.2 Background

There are many computationally distributed systems where restrictions on central-

ized communication or on the central controller may exist. Moreover, the controller

may be uncertain about what kind of algorithm it may use for the control phase.

Just a few of the potential examples include:

• vehicular traffic control;

• control system for routing over a communication network;

• control system for a team of planetary exploration rovers.

These systems may be controlled with an artificial COIN albeit COIN may reach

out these engineered fields: the COIN design problem (that is how to induce a

cooperative behavior among a set of self-interested agents) is an inverse problem,

whereas the overall set of scientific fields are concerned to systems which are best

characterized as a “forward problem” (that is how, given a desired behavior, an

environment may be formalized). The latter, given the dynamic laws of each

single part of that system, determines its overall behavior. Rather we wish a way

to configure each dynamic law to induce an expected global behavior.

As an example we may consider a generic country with capitalist economy

where the world utility is a mean of the gross domestic product (GDP), while the

reward functions for agents may be related to the achievements of their private

goal.

In general, to achieve high world utility values in a COIN, agents should not

have common goals or, at worst, work at cross-purposes (self-interested agents). In

these situations the system may suffer the economic phenomena known as Tragedy

of the Commons (TOC), where the avarice of each individual works to lower world

utility. Another undesired phenomena is the liquidity trap, in which a subset of

agents behaving in a certain manner helps the world utility, but this behavior, if

employed by all agents, results in lower values of the world utility.

To have a clear viewpoint, this is what we mean by COIN:

19

3.2 Background 3. COIN: COllective INtelligence

1. there are many processes running concurrently, performing actions which

affect themselves;

2. there is little (if any) centralized communication among processes, but we do

not prohibit a broadcast communication started by a single process;

3. there is little to no centralized personalized control, but we do not prohibit

the communication of a single control signal to all other processes;

4. there is a well specified task in the form of extremizing a utility function that

is related with the behavior of the overall distributed system.

The following elements distinguish the typical approach to COIN:

• they satisfy (4), then the approach is scalable to very large number of pro-

cesses;

• the approach for tackling (4) is widely applicable, since it works with little

(if any) broadcasting communication as stated in (2) and (3). Moreover it

is adaptive and robust and it doesn’t need a deeper knowledge about the

system is formalized;

• each individual process is implemented as a RL algorithm (but it is not a

necessity).

3.2.1 Artificial Intelligence and Machine Learning

There is an extensive body of work in AI and Machine Learning related to COIN: in

the following subsections we explain how them can be applied to approach COIN.

Reinforcement Learning

In RL (see Chapter 2) we find some interesting features suited for any distributed

environment where there is not a primary controller nor a model of that system used

to learn strategies by agents. Rather, an agent must successfully learn strategies

based on rewards it receives from environment. Typical RL algorithms TD(λ)

20


(they use value functions) and Q-learning (they use an evaluation of the 〈state,

action〉 pair) have been investigated and applied in real environments.

These features may appear suitable to COIN. Unfortunately, each RL algorithm

will not perform well on large distributed heterogeneous problems in general, be-

cause the policy-action space is very extended. Usually one should use many RL

algorithms rather than one to check their performance in order to choose the ap-

propriate one.

Distributed Artificial Intelligence

This field is essentially a natural extension of AI, where tasks have migrated to-

wards parallel implementation, so we have different modules each one with different

tasks concurrently working towards a common goal. To do this we have to guaran-

tee that the task to accomplish will be well modularized to improve convergence.

As a consequence we need a central controller scheduling various sub-tasks and

processing the associated results.

Despite this evolution, distributed artificial intelligence refers to the traditional

AI ideas (reasoning, understanding, planning, learning, . . . ) rather than on their

cumulative character.

Multi-Agent Systems

This field is concerned with interactions among members of a set of agents as well

as the way they act. The design of a multi-agent system with a central coordinator

may involve:

• decomposing a global task into tractable sub-tasks for each agent;

• establishing communication channels that provide a minimal amount of in-

formation to each agent to enable the execution of that sub-tasks;

• coordinating agents in such a way to guarantee cooperation towards the

global task avoiding any kind of conflicting strategies.

In point of fact, agents act selfishly (each one may have many utility functions)

and we need to provide incentives to each agent to improve cooperation in order to

21

3.2 Background 3. COIN: COllective INtelligence

avoid the TOC. In this instance we may use coordination, negotiations, coalition

formation or contracting. Unfortunately these approaches completely forget the

optimization of the system at the expense of scalability and reliability.

3.2.2 Social Science-Inspired Systems

Some human economies provide examples of occurring systems that may be char-

acterized by COIN. They consider the extremization of constrained world utility

where there are strong conditions on agents and their interactions.

In this section we summarize two economic concepts related to COIN, in that

they deal with how a large number of agents can cooperate.

Mechanism Design

Mechanism design is concerned with the incentives that must be given to a set

of agents interacting each other. Usually these incentives induce Pareto optimal

(PO) joint actions where any agent can do better without hurting another agent.

One important scheme used as incentive is auction, which is applied when there

are many agents interacting in an environment exchanging goods. All auctions

perform the same goal: match supply and demand of goods. A mechanism such

auctions inducing PO does not necessarily extreme the world utility function.

Perfect Rationality Noncooperative Game Theory

The simplest form of a game foresees the existence of two or more agents each of

them has a set of possible actions it can perform and a utility function (also known

as payoff matrix for finite games) mapping any joint action chosen to an associated

utility value for agent i, i.e.: Ai → R.

There are many versions both in the action selection phase and in the strategy

selection phase. The former is related to extensive form games (each agent in turn

selects its action to be executed) and the latter concerns the action that must be

chosen given the state of the environment (it must be deterministic or stochastic).

A solution of a game (also called equilibrium) is a profile in which every agent

behaves rationally. With a Nash equilibrium (NE) we have a configuration where

22


each agent chooses the best strategy given the strategies of other agents. As a

consequence, if all agents found a NE, they have not any incentive to leave out

that equilibrium. A game must have zero, only one or many NE in the pure-

strategy space, while in the mixed-strategy space (there is a probability distribution

associated to each strategy) the Nash’s theorem always guarantees the existence

of at least one NE.

In the cooperative game theory all agents are able to enter binding contracts

each other so they coordinate their strategies. In this way, agents avoid NE that

are not PO.

23

Chapter 4A Framework Designed for COINs

Do not worry about your difficulties in Mathematics.

I can assure you mine are still greater.

Albert Einstein (1879, 1955)

Contents

4.1 Problems with a Model-Based Approach . . . . . . . . . . . . . . 26

4.2 Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2.1 Preliminary Definitions and Terminology . . . . . . . . 27

4.2.2 Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2.3 Learnability . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3 A Descriptive Framework for COINs . . . . . . . . . . . . . . . . 33

4.3.1 Candidate Salient Characteristics . . . . . . . . . . . . . 33

4.3.2 Factoredness . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3.3 Wonderful Life Utility . . . . . . . . . . . . . . . . . . . 36

4.3.4 How to Induce these Salient Characteristics? . . . . . . 39

25

4.1 Problems with a Model-Based Approach 4. A Framework Designed for COINs

In Chapter 3 we saw that changing an existing scientific field to encompass

systems meeting all of the requirements of COINs can be very hard. In this chapter

we motivate a new framework designed to analyze COINs, illustrating both the

nomenclature used and the basic mathematical theory (see [23]).

4.1 Problems with a Model-Based Approach

The most natural approach used to build a COIN involves the following steps:

1. we build a stochastic model of the COIN’s dynamics parametrized by a vector

ϑ (it can contain parameters used by microlearners, the world and local

utilities, . . .);

2. we solve the function f(ϑ) which maps the parameters of that model in the

resulting stochastic dynamics;

3. we wish to have a high expected value of a generic world utility wu;

4. finally we aim to solve the inverse problem, i.e. we would have to search the

ϑ that, via f(·), results in a high value of E(wu|ϑ).

Now we examine some of the challenges of the present approach:

• we are mainly interested in very large, complex, noisy systems which often

operate in a non-stationary environment composed by many microlearners

running simultaneously. Building a detailed model of such a system will be

very difficult; anyway, if we have such a model it will be nearly complicated

and hard to be used;

• even for a simpler model, some difficulties may arise during the application

to the function f(·);

• even if we have the function f(·), the inverse problem may be impossible to

be solved in practice;

• in addition to these difficulties, we wish to have a high level model which

allows to change the microlearning algorithm of each agent without having

to redo the entire model each time.

26

4. A Framework Designed for COINs 4.2 Nomenclature

There is an alternative approach which avoids these difficulties setting up a little

model at higher level that has little to do with the dynamics: if this model is COIN

compliant then its world utility will benefit. Of course, these salient characteristics

must be easy to be induced in a COIN.

4.2 Nomenclature

In this section we concentrate on the four salient characteristics of COIN:

intelligence : it quantifies how well a microlearner learns and performs;

learnability : it is a characteristic of a utility function that we would expect

to be well-correlated with how well a microlearning algorithm can learn to

optimize it;

factoredness : a utility function is factored if whenever its value increases, then

the overall system will benefit;

Wonderful Life Utility (WLU) : it is an example of a utility function that is

both factored and learnable.

4.2.1 Preliminary Definitions and Terminology

1. With the term microlearning the authors refer to a single RL algorithm used

by each agent of the system to modify its behavior. With COIN initialization

the authors refer to the initial construction of a COIN potentially based upon

salient characteristics. With the term macrolearning the authors refer to

any imposed run-time modifications to COIN which are based on statistical

inference concerning salient characteristics.

2. For convenience we suppose time t ∈ Z. During the initialization phase we

suppose t 6 0.

3. All variables affecting a COIN are identified as components of an euclidean

vector states of various discrete nodes. The authors define ζη,t

to be a vector

27

4.2 Nomenclature 4. A Framework Designed for COINs

in the euclidean vector space Zη,t, where ζη,t

indicates the state of the node

η at time t; the ith component of such a vector is indicated by ζη,t;i

.

4. For convenience we indicate with ζ,t∈ Z,t the vector of the states of all nodes

at time t; ζ−η,t

∈ Z−η,t refers to the vector of the states of nodes except η

at time t; finally, ζ refers to the global vector of the states of all nodes at

any time. Moreover we will use a shorthand notation for the gradient, e.g.:

∂ζ,tF (ζ

,) indicates the vector of the partial derivative of F (ζ

,) with respect

to the elements of ζ,t. With ζ

,t<t′we will refer to all the components of ζ

,t

with t < t′.

5. The binary operator • is used to indicate the vector formed by concatenating

the components of two vectors, e.g.: α • β refers to the vector formed by

concatenating the components of α with the components of β.

6. The universe where a COIN behaves is assumed to be completely determinis-

tic, since the real world obeys deterministic physics. COIN, being incorporate

in such a world, must be deterministic too as well as any learning algorithm

acting in a COIN. Any deterministic COIN may be based on merely stochas-

tic concepts in a higher level of the working environment: a high level exists

where the feasible policies are chosen (thus it might be stochastic), while

COIN level is deterministic because we are assured about the action to be

executed.

7. To consider the determinism we bundle all variables we are not directly con-

sidering (but important for the dynamics of the system) in an environment

node.

8. The dynamics of the system is expressed by writing ζ,t′>t

= C(ζ,t) that is

a subset of the set of ζ ∈ Z that are consistent with the deterministic laws

governing COIN. Despite C(·) is defined for any argument like ζ ∈ Z for any

t, generally not all the ζ ∈ Z lie in C,t.

9. The authors do not impose particular boundaries both of what we mean by

“COIN”, whose dynamics is given by C(·), and what we call “macrolearning”

28


(perturbations instigated from outside). Macrolearning goes beyond the def-

inition of C and it may refer, for example, to any statistical inference process

modifying the private utilities at runtime to induce the salient characteris-

tics (see Section 4.2). We must pay attention to the system to discern what

is owed to C and what was changed from outside. Besides these consider-

ations, whatsover the boundary of the system used to distinguish C from

the macrolearning, the mathematical formalization of COIN is restricted to

a system evolving according to C irrespective of the macrolearning.

10. We are provided with some world utility G : Z → R that ranks the various

conceivable worldlines of a given COIN. Since the environment node is never

observed, we implicitly assume that G is not a function of its state; moreover,

it does not depend from time t. Furthermore we are provided with personal

utilities gη,t : Z ⊗ Z → R that are considered as “virtual” private functions

typically used to analyze the behavior of the system.

11. As mentioned above there may be variables in each state of any node which

represent the utility function that the associated learning algorithm (mi-

crolearning) is trying to extremize. These local variables are members of ζ

and they represent the private utilities function. We recall the fact that the

personal utilities {gη} do not exist in COIN, they are not specified by any

element of ζ: these functions are just used to mathematically formalize the

private utilities.

4.2.2 Intelligence

Given a system and a world utility function G we will need to evaluate the per-

formace of the system in terms of that utility function. The evaluation needs a

mapping of an arbitrary worldline ζ, a utility function and an arbitrary dynamic

law C in R (it is function of a worldline). Such a measure will also allow to quantify

how the behavior of each microlearning algorithm is reflected in the value assumed

by the personal utility evaluated in a given ζ.

We would prefer a less model-dependent approach that only uses only an arbi-

29


trary utility function, a state ζ and C; this performance measure must not be a

raw utility value like gη(ζ), since that is not invariant with respect to monotonic

transformations of gη; moreover, it must not penalize a microlearner because that

algorithm cannot achieve a prefixed result if that is impossible to achieve due to

C and/or to the actions of other nodes.

A first natural approach is to generalize the game theoretic concept “best re-

sponse strategy” and consider the problem of how well η performs given the ac-

tions of the other nodes. In particular we might compare the utility of the present

worldline ζ to the set of other worldlines ζ ′, where ζ−η,t

= ζ ′

−η,t, and use those

comparisons to quantify the performance of the node η.

To compare the various worldlines we concentrate only on future contributions

given by the substitution of ζ ′ with ζ; if we allow arbitrary ζ ′

,t<0, then the differ-

ences between the past components of ζ ′ and ζ can modify the value of the utility

regardless the effects of any difference in the future components. For many COINs

of interest we may restrict the attention to those ζ ′ where ζ ′

,t<0differs from ζ

,t<0

only for the internal parameters of the microlearner of η, differences that only at

times t > 0 manifest themselves in the utility function. Since these changes do

not affect the t < 0 components and since we are only interested with changes of

ζ affecting the utility, then we impose to leave ζη,t<0

unchanged.

In quantifying the performance of η for behavior given by ζ we compare ζ

with a set of ζ ′ restricted to those ζ ′ sharing the past of ζ (i.e.: ζ ′

,t<0= ζ

,t<0and

ζ ′

−η,0= ζ

−η,0) and ζ ′

,t>0∈ C,t>0. Since ζ ′

η,0is free to vary while ζ ′

,t<0is not, then

ζ ′ /∈ C in general. Thus considering these dynamically impossible ζ ′ is equivalent

to consider a restricted set of ζ ′ with the internal parameters modified, all of which

belong to C.

In a more formal way, given C and a generic measure dµ(ζ,0) demarcating which

points in Zη,0 we are interested in, in [23] the authors define the intelligence for

node η of a point ζ with respect to a utility U as follows:

ǫη,U(ζ) ≡∫

dµ(ζ ′

,0) · Θ

[U(ζ) − U

((ζ

,t<0• C(ζ ′

,0))]

· δ(ζ ′

−η,0− ζ

−η,0), (4.1)

where Θ(·) is the Heaviside theta function (it equals 0 if its argument is less than

0, elsewhere it equals 1), while δ(·) is the Dirac delta function which we assume

30


∫dµ(ζ ′

−η,0) = 1.

Intuitively ǫη,U (ζ) measures the fraction of alternative states of η (it follows

that 0 ≤ ǫη,U (ζ) ≤ 1) where the performance of η does not improve. For example,

ǫη,U (ζ) = 0.5 means that in 50% of the alternative states η does not improve its

performance; as particular case note that with ǫη,U (ζ) = 1 the node η is fully

rational.

The learning algorithm of the node η that is trying to improve U has intelligence

close to 1: we expect that those algorithms should have high values of intelligence.

Given a particular ζ−η,0

the conditional probability that ζη,0

= p is a monotonically

increasing function of ǫη,gη(ζ

,t<0•C(p•ζ

−η,0)). Since for a given ζ

−η,0the intelligence

ǫη,gηis a monotonically increasing function of gη, then the probability that ζ

η,0= p

is a monotonically increasing function of gη(ζ ,t<0• C(p • ζ

−η,0)). It follows that

the better the microlearning algorithm, the more tightly peaked the associated

probability distribution over intelligence values is.

At any point ζ which is a Nash equilibrium (NE) in the set of the personal util-

ities {gη}, the intelligence of all nodes η must equal 1. Since this is the maximum

value of the intelligence, then every point that is a NE in {gη} is also PO in the as-

sociated intelligence (no deviation from such a ζ can raise any of the intelligences).

If there exists at least one NE in the set {gη}, then there is not any PO point in

the set {ǫη,gη(ζ)} that is not a NE.

4.2.3 Learnability

Intelligence can be a difficult quantity to work with, e.g.: fix η and consider a

region centered in any ζ with whatsover utility U , where ζ is not a local maximum

of U . Then, by increasing the values U takes in that region, the intelligence ǫη,U (ζ)

will increase. Necessarily, values of the intelligence of points outside that region

will decrease. So intelligence has non-local character, in fact we cannot directly

modify it to ensure that is simultaneously high for any and all ζ.

A second general problem of intelligence regards the specification of details of a

microlearner: if these details are not available, then it can be extremely difficult to

predict which of two private utilities the microlearner will be better able to learn.

31


Moreover, even with the details, the prediction can be nearly impossible. So from

these considerations emerge that it can be difficult to determine which values of

the intelligence of a private utility will accrue to various choices of those private

utilities. As a consequence, macrolearning that involves modifying private utilities

to directly increase intelligence with respect to those utilities can be fairly difficult.

In a team game we have gη = G for all η: using those {gη} as private utilities of

microlearners (maybe in a COIN with many agents) results in a very bad signal-to-

noise ratio, since it may be hard for any agent η to discern the effects of its actions.

As a consequence the effects of its actions upon its utility function (so upon G)

can be undetectable because there are many other processes (players) going into

determining values assumed by G. So agent η will find it difficult to decide how to

act best once the learning phase has completed, since there is nothing η can do to

achieve high intelligence.

We wish a measure of U capturing these effects, but without depending on any

kind of function maximization (or, generally speaking, extremization) nor on any

other aspects of how the node determines its actions. Given a measure dµ(ζ ′

,0)

restricted to C, we define the utility learnability for a utility U for a node η at ζ

in t = 0 as follows:

Λη,U(ζ) ≡∫

dµ(ζ ′

,0) · |U

(ζ

,t<0• C(ζ

−η,0• ζ ′

η,0))− U(ζ)|

∫dµ(ζ ′

,0) · |U

(ζ

,t<0• C(ζ ′

−η,0• ζ

η,0))− U(ζ)|

(4.2)

The intelligence learnability is defined the same way as Equation (4.2) replacing

U(·) with ǫη,U(·).Equation (4.2) may be interpreted as a signal-to-noise ratio. The integrand in

the numerator reflects how much of the change in U that results from replacing

ζ,0

with ζ ′

,0(see the term ζ ′

η,0) is due to the change in t = 0 of the state of node

η (this is the “signal”). The denominator reflects how much of the change in U

that results from replacing ζ with ζ ′ (see the term ζ ′

−η,0) is due to the change in

t = 0 of the state of nodes other than η (this is the “noise”). So we infer that

learnability quantifies how easy it is for a microlearner to discern the consequences

of its behavior in the utility function U . We presume a microlearning algorithm

will achieve higher intelligence values if provided with a more learnable private

32

4. A Framework Designed for COINs 4.3 A Descriptive Framework for COINs

utility.

The differential learnability of a utility function U in ζ is the learnability with

dµ restricted to an infinitesimal n-ball about ζ:

λη,U(ζ) =‖∂ζ

η,0U(ζ

,t<0• C(ζ

,0))‖

‖∂ζ−η,0

U(ζ

,t<0• C(ζ

,0))‖

(4.3)

By itself the value given by Equation (4.3) has no significance; we are interested

to the ratio of differential learnabilities for different U ’s at the same ζ, so giving

a decisive criterion which can be used to select a particular utility function U .

Another significant feature is that it does not depend on the choice of some measure

dµ(·). Usually, in this kind of learnability, we consider an expected value based

upon a region with lower intelligence, in fact in those ζ with higher intelligence we

have λη,U(ζ) = 0.1

In this form, learnabilities are not meant to capture all factors that will affect

how high an intelligence value a particular microlearner will achieve. These factors

are typically incorporated in the microlearners, so this measure may be preferably

used as a guide to improve performance.

4.3 A Descriptive Framework for COINs

In this section we present a descriptive framework for COIN, in particular the

salient characteristics and the relationship between these characteristics and per-

sonal utilities.

4.3.1 Candidate Salient Characteristics

In a framework like this one it is useful to identify certain characteristics we expect

they are associated to a COIN having large world utility. These characteristics

formalize the intuition that we want COINs for which private utilities, if well

initialized, will result in large values of the world utility without any bottleneck,

TOC (see Section 3.2) or the like.

1If ζ is a maximum, then U(ζ) will be a maximum too so its derivative will be 0.

33

4.3 A Descriptive Framework for COINs 4. A Framework Designed for COINs

One candidate for such a characteristic related to PO is the weak triviality,

where we have two worldlines ζ and ζ ′ consistent with the dynamics C of the

system, where for every node η we have gη(ζ) > gη(ζ′). If for any such pair of

worldlines where one Pareto dominates the other one then it is necessary true that

G(ζ) > G(ζ ′). In these systems if the microlearners collectively modify ζ in a way

that ends up helping all of them, then the world utility also rises. As a consequence

the maxima of G are PO points for personal utilities (note that the reverse may

not hold).

Weakly trivial systems can evolve to a world utility minimum. For example let

us consider automobile traffic in the absence of any traffic control system; let each

node be a driver and their private utilities g(·) quantify how quickly they get to

their destination (gη(ζ) is large if driver η gets to his destination in a short amount

of time), while G is the sum of all private utilities. This system is clearly weakly

trivial (for every pair ζ and ζ ′, if gη(ζ) > gη(ζ′) for all η then G(ζ) > G(ζ ′)): if

there is traffic jam (rush-hour, accidents and the like) and each driver tries to get

to his destination as fast as he can, then the system does not result in acceptable

throughput as a whole (in fact G will be low)2. However, this kind of systems are

used in some cases, since each agent, regardless how others behave, guarantees that

its private utility is greater than a certain level. If we assume each agent has a

large amount of actions to guarantee such a behavior, then a weakly trivial system

guarantees that the world utility is not too low. In the extreme case where each

agent knows its utility for every one of its actions, the PO points are NE, so the

point maximizing G is a NE too.

The main problem emerging from the weak triviality is the fact that the in-

dividual microlearners are greedy: in a COIN there is not an incentive to replace

ζ with a different worldline ζ ′ that would improve personal utility of each agent

as stated in the definition of weak triviality. Rather, the incentives applied to

each microlearner motivate the learners to behave in a way that may hurt some

of them. So, from these considerations weak triviality is not an optimal choice as

salient characteristic of a COIN.

2Obviously here we do not allow any change to private utilities.

34


We can assume that if the microlearners are well designed, then each one will be

doing close to as well it can given the behavior of the other nodes. So, the system

is more likely to be in ζ rather than in ζ ′ if for all η we have ǫη,gη(ζ) > ǫη,gη

(ζ ′).

Such a system is defined coordinated if for any such pair ζ, ζ ′ ∈ C and for all η for

which ǫη,gη(ζ) > ǫη,gη

(ζ ′) we have G(ζ) > G(ζ ′).

4.3.2 Factoredness

In this section, we discuss a third candidate characteristic which does not suffer

of the negative aspects of weak triviality. In this case we do not replace personal

utility {gη} with intelligence {ǫη,gη} as coordination does, but rather we consider

different worldlines whose differences at time 0 involve a single node (this is more

related to NE concept than PO one).

Say that our worldline of COIN is ζ, while ζ ′ is another worldline which ζ,t<0

=

ζ ′

,t<0and ζ ′

,t>0∈ C,t>0; let us restrict our attention to those ζ ′ where at t = 0 differ

from ζ only for node η. If for all such ζ ′ we have

sgn[gη(ζ) − gη

(ζ

,t<0• C(ζ ′

,0))]

= sgn[G(ζ) − G

(ζ

,t<0• C(ζ ′

,0))]

(4.4)

and if this is true for all nodes η, then that COIN is factored for all those utilities

{gη} at ζ in t = 0 with global utility G. Equation (4.4) states that, for any node

η, given the rest of the system, if the state of such node at t = 0 changes in a way

improving the utility of that node, then it necessarily improves world utility. So,

the more is performant a microlearner, the largest are values of G.

For a factored system we have

ǫη,gη(ζ) = ǫη,G(ζ) ∀η (4.5)

and the NE are local maxima of world utility.

It is important to note that having a factored system does not mean that a

change to ζη,0

improving gη(ζ) cannot also hurt gη′(ζ) for some η′ 6= η: the side

effects on the rest of the system due to the increase of the utility of η do not end

up decreasing world utility, but they may have arbitrary effects on other private

utilities.

35


Another fact to consider is that if gη,t′ is factored with respect to G, then a

change at ζη,t′

improving gη,t′(ζ ,t<t′, C(ζ

,t′)) improves G(ζ

,t<t′, C(ζ

,t′)), but it may

hurt some gη,t′′ 6=t′(ζ ,t<t′, C(ζ

,t′)) and/or ǫ(η,t′′),gη,t′′

(ζ,t<t′

, C(ζ,t′

)).

In general we cannot have both perfect learnability and factoredness: let us sup-

pose that ∀t,Zη,t = Z−η,t = R and dynamics is the identity operator (∀t, C(ζ,0),t =

ζ,0). So if G(ζ

,0) = ζ

η,0• ζ

−η,0and if we assume the system is perfectly learnable,

then it will be never perfectly factored. However, any change to ζη,0

improving gη

may help or hurt G depending on the sign of ζ−η,0

. So, from these considerations,

we prefer having a system as more as factored to keep it closer to NE (this will be

the goal of macrolearning).

If a system is factored for some utilities {gη}, then it will be factored for any

utilities {g′η} where, for all η, g′

η is a monotonic increasing function of gη.

Theorem 4.1. A system is factored at all ζ ∈ C if and only if for all those ζ and

for all η we can write:

gη(ζ) = Ψη

(ζ

,t<0, ζ

−η,0, G(ζ)

)(4.6)

for some function Ψ(·, ·, ·) such that ∂GΨη(ζ ,t<0, ζ

−η,0, G(ζ)) > 0 for all ζ ∈ C.

With Theorem 4.1 (the proof is in [23, Section 4.3.2]) the authors guarantee

that the system is factored without any concern for C. As example, consider a team

game (see Section 4.2.3) where gη = G ∀η: these COINs are obviously factored

regardless of C, in fact, if gη increases then necessarily G increases too.

4.3.3 Wonderful Life Utility

In practice, team game utilities often are poor choices personal utilities due both

to their low learnability and the fact they require centralized communication. Let

define for t = 0 the effect set Ceffη (ζ) of node η at ζ as the set of all components

ζη′,t

for which ∂ζη,0

(C(ζ,0))η′,t 6=

−→0 : this set is the set of all components ζ

η′,twhich

would be affected by a change in the state of node η at t = 0. Moreover we define

Ceffη without the dependence from ζ as

⋃ζ∈C Ceff

η (ζ) and ¬Ceffη as the set of the

components of the space Z which are not in Ceffη .

36


For any set σ of components (η′, t) define CLσ(ζ) as the vector formed by

clamping the components of σ in ζ to a prefixed arbitrary value (here it equals−→0

for all the components of σ). Consider a Wonderful Life set σ: the value of the

wonderful life utility (WLU for short) for the set σ in ζ is defined as follows

WLUσ(ζ) = G(ζ) − G(CLσ(ζ)

)(4.7)

The WLU for the effect set of node η is G(ζ)−G(CLC

effη

(ζ)) which for ζ ∈ C can

be written as G(ζ,t<0

• C(ζ,0)) − G(CL

Ceffη

(ζ,t<0

• C(ζ,0))).

WLU for the effect set of node η can be viewed as the change of the world

utility as if that node η had never existed. The CL operation produces a new ζ

without any concern about the dynamics C of the system (so ζ may not lie in

C): this independence from the dynamics is a crucial strength of the WLU, in fact

to evaluate WLU, we do not infer how the world would have evolved from t = 0

setting the state of η to−→0 .

If the set of all nodes is partitioned in subworlds such that all nodes belonging

to the same subworld ω share the same effect set, then those nodes will have

essentially the same personal utilities. If they have large intelligence values, this

utility sharing means that all the nodes of subworld ω behave in a coordinated

way.

Theorem 4.2. 1. A system is factored for all ζ ∈ C if and only if for all ζ and

for all η we can write

gη(ζ) = Ψη

(ζ¬C

effη

, G(ζ))

(4.8)

for any function Ψ(·, ·) such that ∂GΨη(ζ¬Ceffη

, G) > 0 ∀ζ ∈ C.

2. A COIN is factored for the set of personal utilities equal to the associated

effect set WLU.

As a generalization of point (2) of Theorem 4.2 we note that a system is factored

if personal utility of all nodes is the WLU of a set ση containing Ceffη (the proof

of Theorem 4.2 is in [23, Section 4.3.3]).

To keep the presentation clear for the remainder of this section we omit the

argument ζ,t<0

.

37


Theorem 4.3. Let σ be a set containing Ceffη , then:

λη,WLUσ(ζ)

λη,G(ζ)=

‖∂ζ−η,0

G(C(ζ

,0))‖

‖∂ζ−η,0

G(C(ζ

,0))− ∂ζ

−η,0G(CLσ

(C(ζ

,0)))

‖(4.9)

If we expect to have a large ratio of magnitude of gradients then the effect set

WLU has much higher learnability than in team games, e.g.: suppose to have a

wide COIN where η represents only a minimal amount of such a system; given the

predominance of η′ 6= η, the change of G based upon ζη′,

is essentially independent

by ζη,0

. In such circumstances Theorem 4.3 (the proof is explained in [23, Section

4.3.2]) tells us that the effect set WLU for η will have a larger learnability than

does the world utility.

For a fixed σ, if we redefine the CL function (i.e., we clamp to another fixed

value rather than−→0 ) then we change the function mapping ζ

,0in CLσ(C(ζ

,0)),

as a consequence the mapping (ζη,0

, ζ−η,0

) → G(CLσ(C(ζ,0))) too. Such a change

of the clamping operation can affect ∂ζ−η,0

G(CLσ(C(ζ,0))); therefore, by Theorem

4.3 we change λη,WLUσ(ζ). Consequently, for any choice of σ we should set the CL

function in such a way to maximize learnability.

Now, consider the case where for some node η we can write G(ζ) as G1(ζCeffη

)+

G2(ζ ,t<0•ζ

¬Ceffη

) and it is also true that the effect set of η (Ceffη ) has few elements.

So values of G(·) are much larger than those of G1(·), which means that partial

derivatives of G(·) are greater than G1(·). As a consequence, the effect set WLU

is more learnable than the world utility due to the following results.

Theorem 4.4. If ∃ η, σ : Ceffη ∈ σ and ∃G1(ζσ

∈ Zσ), G2(ζ−σ∈ Z−σ) : G(ζ) =

G1(ζσ) + G2(ζ−σ

), then

λη,WLUσ(ζ)

λη,G(ζ)=

‖∂ζ−η,0

G(C(ζ

,0))‖

‖∂ζ−η,0

G(CL−σ

(C(ζ

,0)))

‖(4.10)

A special case of Theorem 4.4 (proofs are presented in [23, Section 4.3.2]) is

the following:

Corollary 4.1. If for some node η we can write

38


1. G(ζ) = G1(ζσ) + G2(ζ−σ,t>0

) + G3(ζ ,t<0)

for some set σ containing Ceffη , and if

2. ‖∂ζ−η,0

G(C(ζ

,0))‖ ≫ ‖∂ζ

−η,0G1

(Cσ(ζ

,0))‖

then

λη,WLUσ(ζ) ≫ λη,G(ζ) (4.11)

In practice, to assure that condition (i) of this corollary is met might require

that σ be a proper subset of Ceffη . Countervailing, to assure that condition (ii) is

met will usually force us to keep σ as small as possible.

More generally, if there is a set σ′ ∈ Ceffη such that for each component (η; 0; i)

the chain rule term∑

(η′,0)∈σ′

[∂ζ

η′,tG(ζ)

]·[∂ζ

η,0;i

[C(ζ

,0)]

η′,t

]= 0, then the effects

on G of changes to ζη,0

that are mediated by the members of σ′ cancel each other

out. In this case we can usually remove the elements of σ′ from Ceffη with no ill

effects.

4.3.4 How to Induce these Salient Characteristics?

As depicted above, such a framework offers theorems relating fundamental charac-

teristics of a COIN to their general properties of the past: we wish a COIN being

in a global state ζ∗ where there is a set {gη} such that ζ∗ is factored for utilities

{gη}, and intelligence ǫη,gη(ζ∗) is as large as possible for all η.

A first approach is to have each microlearner explicitly trying to lead the world-

line towards such a point ζ∗. Initialization of COIN (i.e., set ζ,0) implies setting

the algorithm controlling η, so we impose ζ,0

in such a way to have some special gη

for which C(ζ,0) is factored with respect to gη and with large values of ǫη,gη

(C(ζ,0)).

The main problem is to find such a gη: this implies a careful and a possible mod-

elization of the system, clearly in contrast with the observations stated in Section

4.1.

Other possible approaches are related both to COIN initialization and macrolearn-

ing. In this case we use {gη} as private utilities at some t < 0 inducing a factored

COIN to be as intelligent as possible. Since we deploy private utilities we can use

learnability rather than intelligence, so we choose some {gη} which are as learnable

39


as possible while still being factored. The authors usually use inference in COIN

initialization, e.g.: effect set Ceff of a node is composed by those ζη′,t>0

which

have non zero correlation with respect to ζη,0

. Theorem 4.2 guarantees that the

system is factored for effect set WLU personal utilities and by Corollary 4.1, for

small effect sets, effect set WLU has a large differential learnability with respect

to G (see Equation (4.11)). So we evince with this scenario that this framework

advises us to use WL private utilities based on the associated effect sets rather

than the team game private utilities.

When doing macrolearning, the authors initialize the system with initial esti-

mate of effect set of η (initial guessed effect set) and we impose the association

between private utilities and WLU. Next, we watch the system run and we observe

the correlations among the components of ζ and then we change the components

of ζ belonging to the effect set of η (so changing personal utility of η accordingly).

40

Chapter 5Experimental Applications

Who neglects learning in his youth,

Looses the past and is dead for the future

“Phrixus” – Euripides (484 BC, 406 BC)

Contents

5.1 Packet Routing in a Network . . . . . . . . . . . . . . . . . . . . 42

5.1.1 COIN for Network Routing . . . . . . . . . . . . . . . . 43

5.1.2 Experimental Results . . . . . . . . . . . . . . . . . . . 45

5.2 Learning Sequences Of Actions . . . . . . . . . . . . . . . . . . . 46

5.2.1 COIN Solution . . . . . . . . . . . . . . . . . . . . . . . 46

5.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3 Bar Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

41

5.1 Packet Routing in a Network 5. Experimental Applications

It is possible to formalize many real systems like a COIN (see Section 3.2).

This framework is widely applied, e.g. vehicular traffic control, learning sequences

of actions, packet routing in a network. Across-the-board it is useful to use this

distributed control methodology where we have many processes (packets, rovers,

agents) in such a way hard to be formalized, where centralized controller has plan-

ning (how “make” the controller) and computational (best actions to be chosen

may take a long time according to the domain to be controlled) difficulties.

In the following sections we explain real cases in literature where a world where

many agents act has been formalized and controlled by a COIN:

packet routing in a network : we need to route packets in a network com-

posed by routers and computers with the well known SPA (Shortest Path

Algorithm) and with the new approach introduced by COIN (see [22]);

learning sequences of actions : we have a grid world where agents act trying

to take as many tokens as possible (see [24]);

attendance model of El Farol Bar Problem : in this well known case we

have many agents that must attend the bar in a week both avoiding over-

crowed days and days with few agents (see [6]).

The environments above mentioned are extremely different from each other: if

we try to formalize them just with a simple RL algorithm we may face the problem

of how we can assign rewards to all agents acting in these environments; this may

lead to introduce important approximations that necessarily impact on the system

behavior at runtime. As a consequence, the authors explain how COIN was applied

in known cases in literature in order to induce a cooperative behavior among a set

of agents, and they compare the COIN performance with other utility function

ones in order to verify it usefulness.

5.1 Packet Routing in a Network

We are facing the problem of how can we route information packets in a network

to make them reaching their destinations according to an opportune metrics about

42

5. Experimental Applications 5.1 Packet Routing in a Network

the path chosen. A well known metrics is SPA where each packet is routed on a

link such that it can reach its destination with as less steps as possible. In this

situation, the microlearner sets the internal parameters of its router running the

SPA. Microlearners do not completely address the problem of guaranteeing that

private utilities do not induce the learning algorithm to work at cross-purposes to

satisfy the main goal of the routing.

In this case, we use 3 algorithms: SPA and COIN both with full knowledge

(FK) of the true rewards obtained following a specified path (with reward being

the time taken by a packet to be routed) and COIN memory based (MB – it has

just local knowledge).

5.1.1 COIN for Network Routing

In this example we concentrate on the two networks depicted in Figure 5.1 where

traffic originated by routers represented by white boxes had only the routers rep-

resented by dark boxes as ultimate destinations (point out that in both networks

router 2 is a bottleneck). As standard definition traffic at router is a pair 〈r, d〉,

(a) Network A (b) Network B

Figure 5.1: Network architectures (from [22])

where r is a real number (source router) and d is the destination tag (e.g.: a com-

puter) to be reached. At each timestep each router sums all traffic received from

upstream routers to obtain the amount of traffic, then it chooses where to send

that load.

43

5.1 Packet Routing in a Network 5. Experimental Applications

We keep a running average of the total value of the load of each router in L

previous timesteps; this average is used to compute W (x) to get the sum of the

total delay accumulated at this timestep by all the packets traversing this router at

this timestep. Each router has a different definition of such a W (x) (according to

the hardware, queue length, . . . ): in this testbed routers 1 and 3 have W (x) = x3,

while router 2 (bottleneck) has W (x) = log(x + 1). Obviously, the overall goal is

to minimize the total delay encountered by all traffic.

In COIN, with η we identify the pair 〈router, destination〉, so ζη,t

is the vector

of traffic sent along all link exiting from the router η (with the destination of packets

traversing this router) at time t. As subworld we identify each set of routers whose

packets share the same ultimate destination.

In the classic SPA each node η tries to set ζη,t

to minimize the sum of delays to

be accumulated by traffic on the way to its ultimate destination. In COIN we use

a complementary approach, i.e., η tries to set ζη,t

in such a way to optimize gω for

subworld ω containing η. With the term “full knowledge” we mean that at time t

all routers know the average loads of all routers at time t−1 and assume that those

values will be the same at time t (this can be a good assumption for large values

of L), so we can make accurate estimates of how best route their traffic according

to their respective criteria.

Having limited knowledge, COIN routing can only predict the WLU value re-

sulting from each routing decision. More precisely, for each pair 〈router, destination〉the microlearner estimates the mapping between the load on all outgoing links to

WLU-based reward; then each router sends packets along the path resulting with

the best estimated reward. In this case, we use a more conservative method, i.e.

we randomly choose between the path with the best estimated reward and the path

chosen by FK SPA.

The load of a router r at time t is given by ζ, while Wr,t(ζ) is the function

W (x) of node r at time t. The world utility function is given by the total delay,

i.e. G(ζ) =∑

r,t Wr,t(ζ). Using WLU to set the local utility of each microlearner

we have g(ζ) =∑

r,t ∆ω,r,t(ζ), where ∆ω,r,t =[Wr,t(ζ) − Wr,t(CLw(ζ))

].

At each time step the MB COIN uses∑

r ∆ω,r,t(ζ) as reward signal for trying

44

5. Experimental Applications 5.1 Packet Routing in a Network

to optimize this full WLU. This reward is computed in a decentralized way: all

packets have a header containing a running sum of the ∆s encountered in all routers

it has traversed so far. Each destination node sums the values of the headers and

sends this sum back to all routers that had routed to it.

5.1.2 Experimental Results

The networks discussed above (Figure 5.1) were tested under light, medium and

heavy traffic loads as depicted in Table 5.1; moreover, from each source router a

new packet was fed at each time step.

Network Source Dest. (light) Destin. (medium) Destin. (heavy)

A 4 6 6;7 6;7

5 7 7 6;7

B 4 7;8 7;8;9 6;7;8;9

5 6;9 6;7;9 6;7;8;9

Table 5.1: Source–destination pairings for the three traffic loads

(a) Network A (b) Network B

Figure 5.2: Overall delay of the networks (from [22])

As depicted in Figure 5.2 (these results are averaged on 50 different executions

with a window-size L = 50) we see that FK COIN outperforms the FK SPA. So

with COIN we have a system operating in a way that reduces the average total

45

5.2 Learning Sequences Of Actions 5. Experimental Applications

delay for all packets, not in a greedy fashion like SPA. Moreover MB COIN has

better performance than FK SPA: so we deduce that COIN will always outperform

any algorithm that estimates the shortest path.

5.2 Learning Sequences Of Actions

Another typical application of RL is the Grid world, where an agent navigates in

a two-dimensional grid and at each time step it receives a reward related to the

action chosen. In the episodic version of the Grid world an agent moves for a

certain amount of time steps and then it is returned to its initial position; in this

situation we need a learner optimizing the sum of rewards obtained (Q-learning

and SARSA are typically used for this problem).

In this problem there are many agents navigating in the grid simultaneously

and interacting with the reward of each others. These interactions are modeled

through tokens with different values laid on the grid: each one has values between

0 and 1 and each cell may have at most one token. When an agent moves into a

cell with a token it receives a reward equal to the value of the token and then it

removes that token (so that reward will no longer be available if an agent enters

in that cell). At the end of the episode all the tokens are reset and each agent is

returned to its initial position. The main goal is to collect the highest amount of

tokens in a fixed number of time steps.

Interactions among agents are a useful formalization to examine coordination

and selfish behavior, so considering TOC and the likes.

5.2.1 COIN Solution

Here we pose this problem in the form of COIN and we define:

• Lη,t is the matrix representing the location of agent η at time t. If it is in

location (x, y) then Lη,t,x,y = 1, otherwise Lη,t,x,y = 0. With {Lη,t} we denote

the set of location matrices.

• Laη,t is the location of agent η would have had at time t had it taken action

46

5. Experimental Applications 5.2 Learning Sequences Of Actions

a at time step t − 1.

• Lη is the location matrix of agent η across all time (Lη =∑

t Lη,t).

• Lη,<t is the location matrix of agent η across all time before t (Lη =∑

t′<t Lη,t′).

• L is the location matrix of all agents across all time (L =∑

t

∑η Lη,t).

• L<t is the location matrix of all agents across all time less than t (L =∑

t′<t

∑η Lη,t′).

• L−η is the location matrix of all agents, but η, across all time (L−η = L−Lη).

• L−η,<t is the location matrix of all agents other than η across all time before

t (L−η,<t = L<t − Lη,<t).

• Θ stores the initial values and locations of all tokens.

The space Z is composed by Θ and the set {Lη,t} of all location matrices, while

a worldline ζ is a point in that space. We define the function V (L, Θ) which returns

the value of a token received from a location matrix as follows:

V (L, Θ) =∑

x,y

Θx,y · min(1, Lx,y) (5.1)

The world utility function G(ζ) is given by the sum of all the tokens taken

during an episode:

G(ζ) = V (L, Θ) (5.2)

To formulate the WLU in this problem let us suppose that the operator CLη

sets the state of η to the null vector, so we have the WLU where the agent is

removed from the worldline:

WLU−→0

η (ζ) = G(ζ) − V (L−η, Θ) (5.3)

The utility stated in Equation (5.3) is different from one where the values of the

tokens present in locations visited by agents are summed (rather such a function

is known as Selfish Utility, SU for short). WLU−→0 returns the values of the tokens

in locations not visited by other agents, i.e. the values of the tokens that would

not have been taken, if agent η had not been in the system.

47

5.2 Learning Sequences Of Actions 5. Experimental Applications

These utility functions are based on the performance on a full episode: to learn

an optimal sequence of actions we introduce a reward related to a single time step.

To that end, let us decompose an arbitrary utility function U as follows:

U(L) =∑

t

U(L<t+1) − U(L<t)

The reward for a single time step is given by:

Rt(L) = U(L<t+1) − U(L<t) (5.4)

As a consequence for the two utilities depicted above (global and WL ones) we

introduce the associated single time step utilities:

GRt(ζ) = V (L<t+1, Θ) − V (L<t, Θ) (5.5)

WLUR−→0η (ζ) = GRt(ζ) − (V (L−η,<t+1, Θ) − V (L−η,<t, Θ)) (5.6)

5.2.2 Results

In the experiments we used 3 different utility functions (they were opportunely

changed as stated in Equation (5.4)):

• SU (Selfish Utility): each agent receives the discounted sum of the values of

tokens that it alone collected;

• TG (Team Game utility): each agent receives the full world utility;

• WLU (Wonderful Life Utility): is the contribute given by an agent to the

token collection, i.e. it is the difference in the total token collection with and

without that agent.

Each agent was controlled by a Q-learner: the input space for each one consists

of its location in the grid, while the action space is given by the 4 directions an

agent can choose. The discount parameter γ is set to 0.95 and actions are chosen

stochastically based on Q-values, so the probability an agent can choose action ai

in state s is given by

P sai

=kQ(s,ai)

∑j kQ(s,aj)

, k = 50

48

5. Experimental Applications 5.3 Bar Problem

Figure 5.3 depicts clearly that the SU function produces poor results, worse

than random actions, because each agent tries to collect as token as possible, so

competing with the others. TG utility seem to be quite good with respect to SU,

but the learning time is extremely large, because each agent receives a noisy reward

(it does not perceive clearly the consequences related only to its actions). Instead,

this problem does not occur with WLU (there is also the Aristocrat Utility in [24],

but here is not examined), because an agent can discern clearly how its actions

affected the world reward.

Figure 5.3: System performance with 10 agents on a 10×10 grid (from [24])

In Figure 5.4 we have qualitatively similar results depicted in Figure 5.3. TG

utility has harder time learning respect to the small grid, because in this case the

payoff is even more noisy. Instead, if agents use WLU they were able to cooperate

collectively because they discern the consequence of their actions from the obtained

rewards.

5.3 Bar Problem

In this well known problem in literature (see [6, Section 4]) we try to apply COIN

in a problem widely examined (this is known as dispersion game, where we have

n agents and k tasks to be assigned to those agents). Here we have n agent, each

of whom picks one of seven nights to attend a bar the following week (this process

is repeated every week) to avoid both overcrowed nights and boring ones (nights

49

5.3 Bar Problem 5. Experimental Applications

Figure 5.4: System performance with 100 agents on a 32×32 grid (from [24])

where there are few agents attending the bar). In each week each agent uses its

own RL algorithm to choose which night to attend the bar to maximize its utility.

The world utility function is given by:

G(ζ) =∑

t

7∑

k=1

γk(xk(ζ, t)) (5.7)

where xj(ζ, t) is the j-th component of the vector x(ζ, t), i.e. the number of agents

attending night j at week t, while γk(y) ≡ αk · y · exp(−y/c) (c and {αk} are real

values, where c is the given optimal number of agents attending the bar and {αk}are weight factors). This world utility is the sum of rewards for each night in each

week.

This G is chosen to reflect the attendance at the bar in different night config-

urations: with few and many agents G returns a small (or even negative) reward

(note that γk(u) has a maximum in u = c).

The vector α weights different nights to give them more or less importance. In

[6] the author chooses two different vectors to guarantee the effectiveness of COIN:

α1 = [1 1 1 1 1 1 1] and α2 = [0 0 0 7 0 0 0].

In these experiments, the authors set c = 6 and a number n of agents 4 times

than the number of agents necessary to have c agents attend the bar on each of

the seven nights (n = 7 · 6 · 4 = 168 agents).

Each agent is configured with different reward functions1:

1Pay attention to not confuse reward functions and utility functions, since the former reflect

50

5. Experimental Applications 5.3 Bar Problem

• Uniform Division reward (UD for short):

UD(dω(t), ζ, t) ≡ γdω(xdω

(ζ, t))/xdω(ζ, t)

• Global Reward (GR for short):

GR(dω(t), ζ, t) ≡7∑

k=1

γk(xk(ζ, t))

• Wonderful Life reward (WL for short):

WL(dω(t), ζ, t) ≡7∑

k=1

γk(xk(ζ, t)) −7∑

k=1

γk(xk(CLω(ζ), t)) =

= γdω(xdω

(ζ, t)) − γdω(xdω

(CLω(ζ), t))

where dω is the night chosen by subworld ω (to keep the problem simple we

assume that each subworld ω is composed by only one agent).

With the GR utility each agent receives the same reward of others: obviously

the system is factored, but evaluating this function requires a centralized commu-

nication. This characteristic (to be avoided) is not present in WL, since each agent

only needs to know the total attendance on the night it attended.

(a) Performance with α1 (b) Performance with α2

Figure 5.5: Average performance of the Bar Problem (from [6])

the state of the system in a given time step (is is potentially observable) and the latter are a

formalization of the main goal of each agents (there depend by the past behavior of each agent).

51

5.3 Bar Problem 5. Experimental Applications

5.3.1 Results

Figure 5.5 depicts the average performance averaged over 50 separate runs; for

both Figure 5.5(a) and Figure 5.5(b) the top curve is WL, the middle is GR and

the bottom is UD. Using WL we have convergence to near optimal performance,

so we deduce that the Bar Problem is enough suited to have cooperation.

Note the convergence time of WL as compared with GR in Figure 5.5(b), about

4 times faster. With α1 (Figure 5.5(a)) the convergence time of WL is 30 times

lower than the GR. In both cases UD utility has awful performance worsening in

future weeks, behavior due typically to a low signal-to-noise ratio (see team games

in Section 4.2.3).

52

Part II

Innovation

53

Chapter 6Theoretical Considerations

In mathematics you don’t understand things. You just get

used to them.

Johann von Neumann (1903, 1957)

Contents

6.1 Class of Games . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.1.1 Matrix Games . . . . . . . . . . . . . . . . . . . . . . . 56

6.1.2 Stochastic Games . . . . . . . . . . . . . . . . . . . . . . 57

6.1.3 Differences between Grid world and Bar Problem . . . . 58

6.2 Delayed Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.3 Reward Function of the Bar Problem . . . . . . . . . . . . . . . . 60

6.4 Q-learning Dynamics . . . . . . . . . . . . . . . . . . . . . . . . 63

6.5 Introduction to Coalition Formation . . . . . . . . . . . . . . . . 66

6.5.1 Coalition Structure Generation . . . . . . . . . . . . . . 67

6.5.2 Optimization within a Coalition . . . . . . . . . . . . . 69

6.5.3 Payoff Division . . . . . . . . . . . . . . . . . . . . . . . 69

55

6.1 Class of Games 6. Theoretical Considerations

In this chapter we discuss coordination problems in dispersion games. Coordi-

nation problems are a key factor of RL if we have many agents and our aim is to

induce them to cooperate. At first, we introduce the bothersome problem of the

delayed reward that is often present in RL giving an intuitive example, then we

introduce the theoretical differences between Grid world and Bar Problem; then

we analyze the Bar Problem, in particular its reward function and its behavior (the

latter using the Q-learning dynamics following the approach proposed in [20]).

Next, we discuss about the Q-learning dynamics in order to understand how

an agent policy evolves during the learning phase using different reward functions.

Due to this analysis, we can understand how the policy of each agent evolves during

the learning phase using different reward functions.

Finally, we introduce the theoretical grounds of the Coalition Formation theory,

that will be extended in Chapter 7.

6.1 Class of Games

In the following sections, we discuss about two main classes of games, how their

structures influence the trend of an environment formalized with such games and

how they can avoid or emerge some bothersome aspects of RL. After a brief the-

oretical introduction, we analyze the Bar Problem and the Grid world in the eyes

of these definitions.

6.1.1 Matrix Games

Definition 6.1. A matrix game is defined as a tuple 〈N, A1...N , R1...N〉, where N

is a collection of n agents, Ai is the set of actions available to agent i (let A be the

joint action space, i.e. A1 × A2 × . . . × An) and Ri is the reward function A → R

for each agent i.

In such a game, each agent chooses actions maximizing its own reward function

(it can be viewed as a n-dimensional matrix) which depends on the actions chosen

by other agents.

56

6. Theoretical Considerations 6.1 Class of Games

What does it mean “solve such a game”? Solving the game means to find

the agent’s best response policy, which allows the agent to collect the highest

reward given the other agents’ policies. A stationary strategy can be evaluated

only if strategies of other agents are known (for instance the prisoner dilemma,

the matching pennies game, . . . ). Agents can also play mixed strategies, so they

select actions according to a probability distribution. The latter lead us to define

an opponent-dependant solution (i.e., the best response to the joint action of other

players), thus the definition on Nash equilibrium.

These games can be purely collaborative (agents share the same reward func-

tion, so these games belong to the class of zero-sum games) or purely competitive

(each agent has a reward function counteracting with that of others, these games

belong to the class of general-sum games), and they are usually played for an un-

defined number of iterations. In particular, games each agent can perceive actions

of others, but they know neither the intentions of others nor the reward functions

(for such a game see [1, Section 3]).

6.1.2 Stochastic Games

Definition 6.2. A stochastic game is defined as a tuple 〈N, S, A1...N , T, R1...N〉,where N is a collection of n agents, S is the set of all the possible states of the

game, Ai is the set of actions available to agent i (let A be the joint action space,

i.e. A1 × A2 × . . . × An), T is a transition function S × A × S → [0, 1] and Ri is

the reward function of agent i (Ri : S × A → R).

Definition 6.2 looks very similar to Definition 6.1, in fact each state of a stochas-

tic game can be viewed as a matrix game with the payoff for each joint action deter-

mined by Ri(s, a): after playing the matrix game and receiving payoffs, agents are

transitioned to another state determined by their joint action (that is associated

to a new matrix game).

57

6.1 Class of Games 6. Theoretical Considerations

6.1.3 Differences between Grid world and Bar Problem

The Grid world and the Bar problem are formalized as different typology of games:

the former belong to the class of stochastic games (Section 6.1.2), the latter to the

matrix games ones (Section 6.1.1). This is a classification based upon the structure

of a game. Moreover, we can have episodic games (i.e., games played a certain

amount of times) as well as games where there is an environment dynamically

changed by agents’ behavior.

Following the notation introduced in Section 6.1.1 and in Section 6.1.2, in the

Bar Problem (Section 5.3) Ai are the nights available to each agent i and Ri

depends only on a function of the joint action (see function xk(·, ·) of Equation

(5.7)). Furthermore, the state space is drastically reduced since each agent knows

only the number of agents attending the bar on the same night it has chosen and

not their identity1, so it can be viewed as a single-agent problem (with a reduced

state space size too). With this particular configuration and, above all, since the

Bar Problem belongs to the class of matrix games, we avoid the delayed reward

issue: the reward of a joint action (i.e. in which night each agent attends the bar)

is immediately available at the next time step, so agents have a factual evaluation

of their own actions.

The Grid world environment (Section 5.2) is a stochastic non-stationary game,

since the environment where agents act is dynamically changed accordingly on

actions chosen by agents (i.e. tokens collected are no longer present), and mainly

because each single reward depends upon the joint action and the system’s state.

In this kind of games we could face the delayed reward problem. As we will see in

Section 6.2, TG, SU and UD utility functions are immune to this problem, since the

reward returned to each agent relies only on the joint action executed at previous

time step (it obviously depends on the environment configuration, but this one

does not involve a delayed reward). With the WLU function these considerations

do not hold, as shown by the example proposed in Section 6.2.

1So having agents η1 and η2 or agents η41 and η79 does not make difference.

58

6. Theoretical Considerations 6.2 Delayed Reward

6.2 Delayed Reward

RL is based upon the concept of reward: an agent (or more than one) perceives a

state st of the environment at time t, it chooses an action at complying with its

policy π (so it reaches to another state s′t) and then it receives a reward signal rt+1

(which usually ranges on R). Given that reward, an agent can change its policy

following, for instance, Q-learning or SARSA (or any other RL algorithm). It is

desirable that this reward signal will be immediately available, in particular if it

is fundamental to evaluate the action at chosen in state st: in this case, an agent

can understand the consequences of its action (“it is a good action” or “it is not a

good action”) because the reward immediately obtained is directly related to that

action. If this reward signal is given to that agent with a certain delay τ and that

agent is not aware of this reward, it could not understand that such a reward is

related to an action at rather than the current action at+τ (τ ∈ N, τ > 0). In

this case the agent changes its policy accordingly to that reward, but it refers to a

wrong situation (i.e. to the pair 〈st+τ , at+τ 〉 rather than 〈st, ar〉).This is a situation known as delayed reward in literature (see [10] and [17,

Chapter 7]), and next we see a simple example.

Recall the Grid world presented in Section 5.2, where two or more agents move

in a grid collecting tokens in such a way to avoid gathering tokens that will be taken

by other agents (this is a modelization to induce cooperation among them). Let

us consider Equation (5.6): the first term computes the sum of the token values

collected at time t, while the second one computes the sum of the token values

collected at time t without agent η. Let us take that this sequence of events holds:

• agent η1 takes the token k1 with value 2.5 at time t;

• since time t to time t + τ − 1 (τ ∈ N, τ > 1) no tokens will be collected;

• at time t + τ agent η2 is on the cell where at time t there was the token k1,

while agent η1 is far away from agent η2.

In such a situation the first term of Equation (5.3) is zero, since in this time step

t + τ no tokens were found, while the second one for agent η1 equal to the sum

59

6.3 Reward Function of the Bar Problem 6. Theoretical Considerations

of the token values collected in this time step as if it would never be existed, that

is k1. So agent η1 receives -2.5 as reward that penalizes the fact it has taken the

token k1 despite it is away from such token at time t + τ .

Note that this problem only affects WLU function in the Grid world, but not

TG, SU and UD utility functions. This is obvious, in fact this consideration can

be derived by looking at the definition of such utility functions (see Section 5.2.2):

TG, SU and UD utility functions are only related to tokens collected in the actual

time step, without considering tokens taken in the past by other agents.

6.3 Reward Function of the Bar Problem

With a deeper observation of the Bar Problem we can discuss about its exponential

world utility function γ(·) (Equation (5.7)). It is not symmetrical and it has a large

right tail that increases when c raises as depicted in Figure 6.1.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Rew

ards

Agents

Exponential functions

c = 4c = 5c = 6c = 7c = 8

Figure 6.1: Exponential functions of the Bar Problem

The Bar Problem aims to introduce an implicit coordination among many

agents to avoid overcrowed nights and boring ones. The exponential function

proposed in [6] has a maximum in c = 6 (see the blue line of Figure 6.1): if we

60

6. Theoretical Considerations 6.3 Reward Function of the Bar Problem

have n < c agents attending the bar, this function rapidly decreases the difference

among rewards returned to agents, while for n > c these differences are minimal,

e.g. from n = 6 to n = 10 agents the reward returned only decreases of 14.43%

about2, so agents are induced to consider the nights with 10 agents attending the

bar as safe. This non-uniform reward might return unclear information to agents,

so that they could not concentrate on coordination to obtain as high as possible

reward values. This characteristic becomes more important with the WLU func-

tion, since it computes the difference between two bar configurations to obtain the

reward to be returned.

In such a case, it may be worth to adopt a symmetrical function that propa-

gates as uniform as possible rewards based on the absolute value of the difference

between the optimal number of agents and the number of agents attending the

bar. A well known function belonging to this class is the Gaussian function with

mean µ equal to the optimal number c of agents attending the bar and with ap-

propriate variance σ2. The variance should not be too large, so we relapse in the

previous problem, that is we have a function still returning uniform rewards, but

the difference between reward obtained by an agent when there are n + 1 agents

and when there are n agents can be very small. On the other side, it cannot be

too small, otherwise we have a very slow learning rate and a large convergence

time. For instance, suppose we have agents using the WLU function. If we im-

pose σ2 = 1 we could think that with this small variance agents will accurately

learn optimal actions. However, we have an important problem: if we have n = 6

agents attending the bar, then these agents receive a reward r = 0.157, so they

learn that the night they chose should be good; at the contrary, if we have n = 7

agents attending the bar they receive a reward r = −0.157, so they learn that the

night they chose is bad. With any other value of n, agents receive a reward r ≃ 0.

With this simple example it is easy to understand that agents are inclined to pick

overcrowed nights (r ≃ 0) or boring ones (r ≃ 0 because we have few agents), so

discarding the main goal, because they prefer to obtain reward r = 0 (or near to

2With 6 agents the exponential reward function returns r6 = 6 · exp(−1) ≃ 2.21, and with 10

agents it returns r10 = 10 · exp(− 5

3) ≃ 1.89.

61

6.3 Reward Function of the Bar Problem 6. Theoretical Considerations

0) rather than a negative one. Therefore, we must choose a suitable variance to

obtain high learning rate and small convergence time, and the Gaussian function

should not be larger than the exponential one (thus its variance must not be too

high) in order to avoid awful (difference) rewards.

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0 2 4 6 8 10 12 14 16 18 20 22 24 26

Rew

ard

Agents attending the bar

Gaussian and Exponential WL Reward Functions

Std. deviation = 1.0Std. deviation = 1.5Std. deviation = 2.0Std. deviation = 2.5Std. deviation = 3.0Std. deviation = 3.5

Exponential

Figure 6.2: WLU rewards

Figure 6.2 depicts the shape of the exponential and the Gaussian reward util-

ity functions when we have agents using the WLU function. We multiplied each

Gaussian function with a real positive number k to have approximately the same

maximum value when we have n = c agents attending the bar using the WL expo-

nential utility function. It is easy to see that the WL exponential utility function

returns good rewards when there are less than 6 agents, but the shape of this func-

tion is too smooth with 7 or more agents, in fact, it returns too similar rewards:

for instance, between n = 11 and n = 17 agents we have an offset of 16.7% about.

With the WL Gaussian utility function we have more significant rewards. When

there are less than 6 agents attending the bar in a specific night, they are induced

to attend the bar in that night, but when there are a little bit more than 6 agents

attending the bar, they are induced to attend the bar at another night. In the

extreme case when n ≫ c (e.g. n = 14 with σ = 2.5), agents receive essentially

62

6. Theoretical Considerations 6.4 Q-learning Dynamics

zero-reward and they are induced to consider quite good this night.

In Chapter 8 we propose the results obtained by all the previous considerations,

that is we compare the performance between the Bar Problem with the exponential

reward function and the Gaussian one, the Q-learning dynamics with the latter as

well as the performance of the Grid world.

6.4 Q-learning Dynamics

To understand how Q-learning works, we need to analyze its dynamics. Following

[20] and [21] we draw on the Replicator Dynamics (RD) model from Evolutionary

Game Theory (EGT). Concepts and techniques developed in EGT were initially

formulated in the context of evolutionary biology, so we have a population com-

posed by the strategies of all agents where such strategies evolve: analyzing of the

expected value of this process gives an approximation called RD. This evolutionary

process usually combines selection and mutation: the former favors some varieties

over others, the latter provides variety in the population. RD is mainly focused

on the role of selection describing how a system consisting of different strategies

changes over time.

The general form of a replicator dynamics is the following:

dxi

dt= [(Ax)i − x · Ax] xi (6.1)

where xi represents the density of strategy i in the population and A is the payoff

matrix which describes the different payoff values each strategy receives when in-

teracting with others in the population. The state of the population is described by

x = (x1, x2. . . . , xJ) and represents the different densities of all the different types

of strategies in the population. As a consequence, (Ax)i is the payoff received by

strategy i and x · Ax is the average payoff in the population. The growth ratedxi

dt

xi

of the population using strategy i equals the difference between the current

strategy payoff and the average payoff in the population.

Let us have 2 different populations p and q using Q-learning: we need a system

of two differential equations that corresponds to a RD for asymmetric games. If

63

6.4 Q-learning Dynamics 6. Theoretical Considerations

A = Bt then Equation (6.1) holds and we can write:

dpi

dt= [(Aq)i − p · Aq] pi

dqi

dt= [(Bp)i − q · Bp] qi

(6.2)

The growth rate of the strategies in each population is now determined by the

performance of the other population.

In this situation each agent has a probability vector [x1, x2, . . . , xn] over its

action set {a1, a2, . . . , an}. The Boltzmann distribution is described by

xi(t) =

exp

(1

τ· Qai

(t)

)

∑n

j=1 exp

(1

τ· Qaj

(t)

) (6.3)

where xi(t) is the probability of playing strategy i at time step t, τ is the temper-

ature3 (τ ∈ R and τ ≥ 0) and Qai(t) is the q-value for action ai at time t. The

temperature controls the necessary tradeoff between exploration (high tempera-

ture) and exploitation (low temperature), hence it will be decreased over time. If

we have payoff matrices A and B for 2 players we can calculate the time limit (it

is well explained in [21]) obtaining for each player4:

dxi

dt= xiα

1

τ[(Ay)i − x · Ay] + xiα

∑

j

xj ln

(xj

xi

)(6.4)

dyi

dt= yiα

1

τ[(Bx)i − y · Bx] + yiα

∑

j

yj ln

(yj

yi

)(6.5)

The first term of Equations (6.4) and (6.5) is exactly the RD and thus takes

care of the selection mechanism; the mutation term for Q-learning is given by the

second term, in fact it can be written as:

xiα∑

j

xj ln

(xj

xi

)= xiα

(∑

j

xj ln (xj) − ln (xi)

)=

= α (−xi ln (xi)) − xiα

(−∑

j

xj ln (xj)

)=

= α (Si − xiSn)

(6.6)

3Notice that this τ has nothing to do with the one used in Section 6.2.

4In [21] the authors use τ instead of1

τ, in literature this term can be used either in the

numerator or in the denominator.

64

6. Theoretical Considerations 6.4 Q-learning Dynamics

where Si is the entropy of strategy i (how much we know about that strategy) and

Sn is the entropy concerning the entire distribution.

If we have more than 2 populations, we can extend Equations (6.4) and (6.5)

using the theoretical interpretation of Equation (6.1): the first term equals the

payoff of the strategy chosen and the second one equals the average payoff of the

population.

Now, we propose the Q-learning dynamics of the Bar Problem following the

previous theory. For ease of computation we analyze the Bar Problem with 8

agents, 2 days per week and with a policy matrix as follow:

Π =

a(t) b(t) . . . h(t)

1 − a(t) 1 − b(t) . . . 1 − h(t)

T

(6.7)

where a(t) is the probability of the first agent to attend the bar on Monday (and

obviously 1 − a(t) is the probability of that agent to attend the bar on Tuesday,

and so on), while the strategy i analyzed for each player is “attend the bar on

Monday”.

The two terms of Equations (6.4) and (6.5) rely on the utility function we use;

for instance, if we use the TG utility function with strategy “attend the bar on

Monday” the first term of that equations for strategy (agent) i may be written as

follow:

(Ay)i = Ψ (1 + agents−i (Monday) , c) + Ψ (agents−i (Tuesday) , c) (6.8)

x · Ay = Pi [Monday] ·

[Ψ (1 + agents−i (Monday) , c) + Ψ (agents−i (Tuesday) , c)]+

Pi [Tuesday] ·

[Ψ (1 + agents−i (Tuesday) , c) + Ψ (agents−i (Monday) , c)]

(6.9)

where:

• Ψ(·, ·) is the reward function (in [6] it matches γk(·)) and it depends by agents

attending the bar and by the optimal number c of agents;

• agents−i(d) gives the expected number of agents attending the bar on day d

except agent i (it can be easily computed with Equation (6.7));

65

6.5 Introduction to Coalition Formation 6. Theoretical Considerations

• Pi [d] gives the probability of agent i to attend the bar on day d (see Equation

(6.7)).

Instead, if we use the WLU function with the same strategy as above we have:

(Ay)i = Ψ (1 + agents−i (Monday) , c) − Ψ (agents−i (Monday) , c) (6.10)

x · Ay = Pi [Monday] ·

[Ψ (1 + agents−i (Monday) , c) − Ψ (agents−i (Monday) , c)] +

Pi [Tuesday] ·

[Ψ (1 + agents−i (Tuesday) , c) − Ψ (agents−i (Tuesday) , c)]

(6.11)

Finally, for the UD utility function with the same strategy as above we have:

(Ay)i = Ψ (1 + agents−i (Monday) , c) (6.12)

x · Ay = Pi [Monday] · Ψ (1 + agents−i (Monday) , c)+

Pi [Tuesday] · Ψ (1 + agents−i (Tuesday) , c)(6.13)

For instance, P1 [Monday] = a(t) and agents−1 (Monday) = b(t) + c(t) + d(t) +

e(t) + f(t) + g(t) + h(t) (see Equation (6.7)).

In Section 8.2.3 we will use this approach and we will show the results obtained.

6.5 Introduction to Coalition Formation

In many domains we may need to work with a large amount of agents in order to

reach a goal. In such cases, we can try to model their interactions with normal

form games, but this model might be neither accurate nor useful, because we have

a huge and unreliable model. Instead, another way to consider this problem is

to study such games in a more abstract setting from a cooperative game theory

point of view called characteristic function game. In such games, the value of

each coalition of agents S is given by a characteristic function v(S). It may be

interpreted as the value created when the members of S come together and interact.

As a consequence, a cooperative game is a pair 〈N, v(·)〉, where N is the finite set

of players and v(·) is a function mapping subsets of N to numbers.

Coalition formation involves three main activities:

66

6. Theoretical Considerations 6.5 Introduction to Coalition Formation

Coalition structure generation : it is the coalition formation phase done by

agents such that agents within each coalition coordinate their activities, but

not between coalitions. This means partitioning the set of agents N into

exhaustive and disjoint coalitions called coalition structure (CS). Notice the

difference between coalition and CS: the former is the powerset of the set

N (also indicated as 2n, where |N | = n), while the latter involves a set of

constraints (i.e.: there is not a coalition structure where a generic agent η

belongs to two or more disjoint coalitions C1 and C2).

Solving the optimization problem of each coalition : this means polling the

task and resources of agents in the coalition maximizing monetary value, that

is money received from outside the system for accomplishing tasks minus the

cost of using resources.

Dividing the value of the general solution among agents : the reward is

assigned to each coalition, but agents need that reward in order to update

their policies. As a consequence, that reward must be divided up among

them so they can understand the goodness of their actions.

These activities may overlap and they are not independent, e.g.: the coalition that

an agent wants to join depends on the portion of the value that the agent would

be allocated in each potential coalition.

6.5.1 Coalition Structure Generation

Let N be the set of all agents, and n = |N |, while S is a generic coalition. Here we

assume that each coalition’s value v = v(S) is nonnegative. In a coalition structure

CS each agent belongs to exactly one coalition and some agents may be alone in

their coalitions. We will call this set of coalition structures M . The value of a

coalition structure is given by:

V (CS) =∑

S∈CS

v(S) (6.14)

67


Usually, the goal is to maximize the social welfare of agents by finding a coalition

structure CS∗ such that:

CS∗ = arg maxCS∈M

V (CS) (6.15)

It is easy to note that the number of coalition structures is large (Θ (nn)), so not

all the coalition structures can be enumerated unless the number of agents is small.

The exact number of coalition structures is:n∑

i=1

Z(n, i), (6.16)

where Z(n, i) is the number of coalition structures with i coalitions. This quantity

is also known as the Stirling number of the second kind and it is captured by the

following recurrence:

Z(n, i) = i · Z(n − i, i) + Z(n − 1, i − 1), (6.17)

where Z(n, n) = Z(n, 1) = 1. The first term counts the number of coalition

structures formed by adding the new agent to one of the existing coalitions (there

are i choices because the existing coalition structure has i coalitions). The second

term considers adding the new agent to a coalition of its own, and therefore existing

coalition structures with only i − 1 agents are counted.

Recall that if we have n agents, all the possible coalitions (that is the powerset of

N) is 2n−1 (not counting the empty set). If we decide to exclude some coalitions a

priori, this exclusion might cause the value of the best remaining coalition structure

to be arbitrarily far from the optimum.

In literature many researchers have mostly focused on superadditive games

(v(S ∪ T) > v(S) + v(T) for all disjoint coalitions S,T ⊆ N): in such games

coalition structure generation is trivial, in fact all agents form the grand coalition

so they operate together. However, many games are not superadditive, because

a cost to form a coalition or some constraints may exist (coordination overhead,

anti-trust penalties, limited amount of time to carry out the communications and

computations). This class of games may be subadditive (v(S ∪ T) < v(S) + v(T)

for all disjoint coalitions S,T ⊆ N), where agents are best off by operating alone,

or it may be neither superadditive nor subadditive, where some coalitions are best

off merging while others are not.

68


6.5.2 Optimization within a Coalition

Under limited and costless computation, each coalition would solve its optimization

problem. However, in many domains we can’t solve the problem from a combi-

natorial viewpoint, so an approximate solution must be found. In such a case,

selfish interested agents would want to strike the optimal tradeoff between solution

quality and the associated computation. This will affect the values of coalitions,

which in turn will affect which coalition structure gives the highest welfare.

6.5.3 Payoff Division

Payoff division strives to divide the value gained by a coalition structure among

agents in a fair and stable way so that agents are motivated to stay with the

coalition structure rather than move out of it. Many payoff division methods have

been proposed in literature. Here we discuss about two of them: the core and the

Shapley’s value.

The Core

The core of a coalition formation game is a set of payoff configurations (−→x , CS),

where −→x is a vector of payoffs given to agents in such a way that no subgroup is

motivated to depart from the coalition structure CS (it is like the Nash equilib-

rium):

Core =

{(−→x , CS) | ∀S ⊂ N,

∑

i∈S

xi > v(S) ∧∑

i∈N

xi =∑

S∈CS

v(S)

}(6.18)

The core is the strongest of the classical solution concepts in coalition formation.

In many cases, it may be empty, because there is no way to divide the social good

so that the coalition structure becomes stable, so there will be an infinite sequence

of steps from one payoff configuration to another. To avoid such problems, explicit

mechanism were proposed, like limits on negotiation rounds, contract costs or some

social norms to limit the negotiation.

Another opposite problem is to have multiple payoff vectors in the core, so all

agents have to agree on one of them (such vector is usually called the nucleolus,

69


that is the payoff vector that is in the center of the set of payoff vectors in the

core).

A further problem related to the core is that the constraints in the definition

become numerous as the number of agents increases (point out the term ∀S ⊂ N

in Equation (6.18)).

The Shapley’s Value

The Shapley’s value is another policy for dividing payoffs in coalition formation

games and it will be defined axiomatically. Agent i is called dummy if v(S∪{i})−v(S) = v({i}) for every coalition S that does not include agent i. Agents i and j

are called interchangeable if v((S \ {i}) ∪ {j}) = v(S) for every coalition S that

includes agent i but not agent j. The three axioms of the Shapley’s value are:

Symmetry : if agents i and j are interchangeable then xi = xj .

Dummies : if agent i is a dummy then xi = v({i})

Additivity : for any two games v and w, xi in v + w equals xi on v plus xi in w,

where the game v + w is defined by (v(S) + w(S)) = v(S) + w(S).

The Shapley’s value is the only payoff division scheme that satisfies the previous

three axioms and it is defined as follow:

xi =∑

S⊆N

(|N | − |S|)! − (|S| − 1)!

|N |! · (v(S) − v(S− {i})) (6.19)

This payoff can be interpreted as the marginal contribution of agent i to the

coalition structure averaged over all the possible joining orders (it recalls the ground

idea of COIN). Notice that the payoff must be computed over all the possible |N |!joining orders, thus it is computationally hard.

It is interesting to note that the Shapley’s value, like the core, guarantees that

individual agents and the grand coalition are motivated to stay with the coalition

structure. However, unlike the core, it does not guarantee that all subgroups of

agents are better off in the coalition structure than by breaking off into a coalition

of their own.

70


In such games the joining order of agents matters, but in the real world there

may exist situations that can be formalized with games where that joining order is

irrelevant. In this case, the core and the Shapley’s value are unnecessarily hard to

be computed. In Chapter 7 we will analyze these games and we will propose new

solutions of the reward distribution problem.

71

Chapter 7Task Allocation via Coalition Formation

Computers are useless. They can only give you answers.

Pablo Picasso

Contents

7.1 Game Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7.1.1 Curse of the State Space Size . . . . . . . . . . . . . . . 76

7.1.2 Fuzzy Games and Groups of Agents . . . . . . . . . . . 78

7.2 Utility Functions of the Game . . . . . . . . . . . . . . . . . . . . 79

7.2.1 Reward Distribution among Agents . . . . . . . . . . . . 79

7.2.2 Characteristic and Reward Functions . . . . . . . . . . . 81

7.3 Testbed Problem: Cooking Teams . . . . . . . . . . . . . . . . . 83

7.3.1 Configurations . . . . . . . . . . . . . . . . . . . . . . . 84

7.3.2 Reward Functions . . . . . . . . . . . . . . . . . . . . . 87

7.3.3 State Space . . . . . . . . . . . . . . . . . . . . . . . . . 88

73

7.1 Game Outline 7. Task Allocation via Coalition Formation

In Section 6.5 we briefly introduced coalition formation problems (recall that

in this kind of problems there is a set of agents coordinating their activities, thus

receiving rewards). A game is completely described by a set N of n agents and by a

characteristic function v(S) that evaluates a given coalition S. The characteristic

function of pure coalition formation games considers the joining order of agents

belonging to a generic coalition. In this chapter we introduce a new kind of game

involving task allocation and coalition formation games.

7.1 Game Outline

As said above, in the real world there are different problems which can be formalized

as coalition formation games. However, this formalization might involve some

constraints restricting the application on real problems. For example, a possible

constraint is the joining order of agents in a coalition that could not be considered

while formalizing a problem.

In such situations all the previous payoff divisions described in Section 6.5.3 are

unnecessarily complicated and hard to apply (recall the Shapley’s value, Equation

(6.19)), so we need another way to evaluate a coalition and to split the payoff

among its agents.

To satisfy these requirements here we introduce a new kind of game involving

task allocation games and coalition formation ones. In such games, we have an

environment with some tasks to be allocated to all agents, but these tasks must

be executed with a priori fixed number of agents.

Given a set N of n agents and a set T of t tasks, we define a Dispersion Game

as a game where each agent has to decide which of the t tasks to undertake in order

to achieve full world utility. A world utility function is provided and it is used to

evaluate the overall behavior emerging from the environment. A well known game

belonging to this class of games is the Bar Problem (Section 5.3), where we have

a large amount of agents with respect to the number of tasks. In general these

games are known as anti-coordination games, in fact each agent tends to choose a

task in order to maximize its own reward without interest about other agents.

74

7. Task Allocation via Coalition Formation 7.1 Game Outline

Definition 7.1 (Coalition formation games). Given a set N of n agents and a

coalition S of agents, we define Coalition Formation Games the games where the

value of each coalition S is given by a characteristic function v(S) (see Section

6.5). These coalition values may represent the quality of an optimal solution for

each coalition optimization problem. Moreover, in general they may depend on non-

members’ actions due to positive and negative externalities (interactions of agents’

solutions).

Dispersion games and coalition formation games focus on different kind of prob-

lems. In the real world we can deal with problems that can be formalized with

dispersion games, but this modeling might be neither complete nor accurate. We

could use coalition formation games, but in these games we must necessarily face

out with the important problem of the difficult computation of reward division

among agents. In order to deal with some problems which require a sharing among

ground characteristics of the two previous games, we define a new kind of games

where we can deal with the main characteristics of dispersion games and coalition

formation ones.

Definition 7.2 (Task allocation via coalition formation games). These games in-

volve dispersion games and coalition formation ones (Definition 7.1), so we have

a set N of n agents and a set T of t tasks some of which can only be computed

by a prefixed group of agents. Furthermore, each agent η is identified with a type

kη (kη ∈ K = {1, 2, . . . , k}), thus a generic coalition S can be formed by different

types of agents. Since we have a finite number of tasks, not all coalition structures

in M are feasible, so we have to find an optimal coalition structure CS∗ ∈ M in

order to achieve full world utility (which is given a priori).

In order to describe a coalition, in games of Definition 7.2 (see [14]) we only

need to specify how many agents of each type are participating. A coalition can

be identified with a point S ∈ Rk such that 0 6 S 6 Q, where Q ∈ R

k+ specifies

the total number of players of each type.

From this point of view, coalition formation is an ongoing, dynamic process with

payoff generated when coalitions create, regroup or dismiss. As a consequence, we

75


only consider the process of coalition generation where agents belonging to a set

learn how to distribute themselves into exhaustive and disjoint coalitions. Under

this new learning framework, a farsighted agent will move away from a certain

coalition if and only if it expects to increase its payoff in the future from such

deviation.

Unlike dispersion games, this class of games is subsumed by the class of goal

satisfaction problems, where we have a goal that cannot be satisfied by only one

agent.

7.1.1 Curse of the State Space Size

Now, let us consider the state space size, that is a well known issue of RL. It is

usually useful to consider the state space size in order to foresee whether a problem

can be formalized (that is whether we have not a very huge state space size) and,

above all, if we can find a suitable solution of the learning problem. This new

kind of games involves coalition formation games, so we could consider both the

number of agents acting in an environment and the agent types, thus these games

may possibly have a huge state space. The number C of possible coalitions is

C =k∑

type=1

∑

B∈type−subset

∏

i∈B

|Qi| ∀B ∈ type − subset and i ∈ B (7.1)

where the first sum runs through all the k agents’ type, type− subset is the subset

of N that contains all sets B of exactly type elements, i represents the type i in B

and |Qi| is the number of elements of type i, e.g.: if we have Q = {Q1, Q2, Q3, Q4},then

C = [Q1 · Q2 · Q3 · Q4]

+ [Q1 · Q2 + Q1 · Q3 + Q1 · Q4 + Q2 · Q3 + Q2 · Q4 + Q3 · Q4]

+ [Q1 + Q2 + Q3 + Q4]

+ [Q1 · Q2 · Q3 + Q1 · Q2 · Q4 + Q1 · Q3 · Q4 + Q2 · Q3 · Q4]

In this situation we have C coalitions (Equation (7.1)), but in such games we

must consider also all the coalition structures, thus we have a vector Υ such that

76

7. Task Allocation via Coalition Formation 7.1 Game Outline

|Υ| = 2C . As a consequence, we have a representation state used to represent the

perceived environment of any players of size is 2C ·C. The representation state size

may become very large, but we must consider that not all the possible coalition

structures are admissible, thus an agent needs not to consider all the 2C ·C coalition

structures.

It is easy to understand that even with a small amount of types of agents we can

have a huge state space and the problem may become intractable. For example, let

us suppose to have a population Q = {3, 2, 4, 5} thus having C = 359 coalitions.

The representation space is 2C · C ≃ 4.2156 · 10110 and it is obviously intractable.

Furthermore, if we are dealing with a non-stationary environment the state space

will get increasingly huge, and we must not ignore the well known problems of such

games as depicted in Section 6.1.2 and in Section 6.2.

As a consequence of the previous considerations, we do not use any state space

information. In addition to its size, we must deal with the fact that if we use

the state space, each state goodness depends on all types of agents belonging to

the coalition visiting such state. Let us suppose to have a coalition S1 = {0, 4}visiting the state s1 and the characteristic function v(·) that rates as good this

coalition (thus the related state). The four agents belonging to S1 believe that

they impliedly1 formed a good coalition in this state s1 related to a particular task

t, thus they tend to choose this task in the future (let us suppose to ignore the

exploration policy, if exists). If at next time step one agent of the first type joins

S1 (thus we have a new coalition S2 = {1, 4}), all of them choose the same task t

and the characteristic function rates as bad this new coalition, then the four agents

of the second type do not understand that the bad payoff obtained is not due to

themselves, but instead it is due to the agent of the first type. As a consequence,

now they rate as not good the task t (thus they choose another task), while it was

previously rated as good. In this situation, all agents tend to pick other tasks with

respect to those picked in the past and to rate in a wrong way different states of

environment.

In such a situation, the state space induces a further non-stationary extension

1In these games we do not allow any centralized communication among agents.

77


to the problem. To reach a (near) optimal state in this huge space we must ensure

coordination among different agents belonging to a coalition, and these coalitions

must be coordinated by themselves. In the simple example described above we

have two coalitions not coordinated, thus agents do not reach any optimal state.

7.1.2 Fuzzy Games and Groups of Agents

Let P(S) be the set of all possible coalitions. Any population defined by the vector

Q ∈ Rk+ generates a characteristic function called fuzzy game:

Definition 7.3 (Fuzzy game). A Fuzzy game is a pair (Q, v) such that:

• v is the characteristic function and is a mapping function v : P \ {∅} 7→ R;

• Q ∈ Rk+.

The number of possible coalition structures is limited by the number of tasks

t. Given such a number of tasks, no more than t coalitions can be formed, thus we

introduce the following

Definition 7.4 (Coalition structure). A coalition structure CS = (S1,S2, . . . ,SH}(where 0 6 H 6 |Q|) is a partition of Q, that is Sh 6= ∅ for any h ∈ {1, 2, . . . , H},⋃H

h=1 Sh = Q and Si ∩ Sj = ∅ for any i, j ∈ {1, 2. . . .H} with i 6= j.

With this model we allow agents belonging to a set to organize themselves into

a precise coalition in order to achieve much efficient individual payoffs and possibly

to obtain large world utility values. As mentioned in Section 6.5.1, not all games

are superadditive, thus large organizations could operate less efficiently than the

sum of their constituent parts, thus a grand coalition will not form.

Given such games, in this section we describe economies with a small group

of agents belonging to a finite number of types. The main goal is to build stable

coalitions that will end up in a stable and possibly meaningful coalition structure.

Here we need to redefine the concept of core used in Equation (6.18) because we

are dealing with a priori non superadditive characteristic functions.

Definition 7.5 (Core of a fuzzy game). The core of the fuzzy game (v,Q), that is

Fcore(v,Q), is the set of vectors x = {x1,x2, . . . ,xn} such that:

78

7. Task Allocation via Coalition Formation 7.2 Utility Functions of the Game

1. xQ = maxCS∈P(Q)

∑S∈C vS, where P(Q) denotes the set of all possible coali-

tion structures and C is a set of coalitions;

2. xS > v(S) for any coalition S ∈ C.

7.2 Utility Functions of the Game

In the following sections we discuss about different utility functions of this type

of game as well as how to find a useful way to deal the reward obtained. These

functions are used both to evaluate a coalition S and the resulting task allocation,

so they formalize different aspects of this type of game.

7.2.1 Reward Distribution among Agents

One of the main interesting problems consists of finding a suitable way to distribute

the reward among agents of a coalition, so we can apply the Shapley’s value or the

core as stated in Section 6.5.3.

As discussed in Section 7.1 now we are facing with a new kind of game where we

do not deal with the joining order of agents in a coalition, but we must consider task

allocation. In order to find a right method to split rewards, we should consider its

usefulness as well as its computational cost. The Shapley’s value shares the ground

idea of COIN (notice the similarity between the meaning of the difference terms

of Equation (4.7) and Equation (6.19)) and it has some useful properties as stated

in Section 6.5.3. Unfortunately, this method suffers of possibly bad performance,

since in Equation (6.19) we compute the reward to be assigned to each agent by

examining all possible coalitions having a set N of n agents. In RL this means that

at each time step, a priori, any agent can change the resulting coalition structure

(thus coalitions) according to their policy (it obviously depends on the reward

received). All that means we must compute the Shapley’s value for each agent,

and this process repeats over time. Furthermore the Shapley’s value considers the

joining order of agents in each coalition, but in this kind of game this is useless

(besides its computational cost).

79

7.2 Utility Functions of the Game 7. Task Allocation via Coalition Formation

In order to avoid such undesirable features we propose marginal contribution,

a new payoff assignment method that shares the interesting characteristics of the

Shapley’s value, but avoids such a heavy computational burden.

Definition 7.6. Given a coalition S = {n1, n2, . . . , nk} we define marginal contri-

bution for agent type i the following function:

mi = v (S) − v (S−i) , (7.2)

where:

• v (S) is the characteristic function of the task allocation via coalition forma-

tion problem;

• S−i is the coalition S without one agent of type i;

• ni is the number of agents of type i.

Example 1. Let us take to have S = {5, 1}. We can compute the marginal

contribution for the first agent type (m1) and for the second one (m2) obtaining:

• m1 = v (5, 1) − v (4, 1);

• m2 = v (5, 1) − v (5, 0).

Recall the definition of marginal contribution in Equation (7.2): we can see the

marginal contribution shares the ground idea of COIN (in particular the clamping

function, that is CLσ(·) in Equation (4.7)), since it evaluates the goodness of

each agent belonging to a given coalition by computing the difference between

the characteristic function value obtained with full coalition and the characteristic

function value obtained without one agent of a certain type of that coalition.

It is interesting to note that the marginal contribution described in Definition

7.6 computes the payoff to be assigned to each coalition S in a symmetric way

with respect to each agent of the same type. The second term of Equation (7.2)

is computed on the clamped coalition S−i, and this is done by each agent of such

coalition. Hence we consider fair important all agents belonging to S, because in

this framework we are interested about each agent’s type rather than an agent

itself in order to satisfy a task as described in Definition 7.2.

80

7. Task Allocation via Coalition Formation 7.2 Utility Functions of the Game

7.2.2 Characteristic and Reward Functions

As stated in Section 6.5, the characteristic function of a game affects the game

itself, since it evaluates the goodness of a coalition, thus the structure of game.

In literature, many researchers have been focused on superadditive games (Section

6.5.1) where coalition structure generation is trivial. However, many games are

not superadditive, so we aim to have a characteristic function not superadditive:

this will avoid the grand coalition formation thus coalitions will obtain meaningful

rewards.

It is useful to note that, during the formalization of a problem, we must use a

feasible (and possibly robust) characteristic function to model a desired behavior. If

we introduce a strict characteristic function (i.e., a characteristic function allowing

one feasible coalition structure to have a large amount of tasks and agents) we

might not reach an optimal (or a near one) coalition structure. This problem

becomes more evident if we consider the state space, since in this case, besides

the characteristic function, we must deal with a huge state space in order to avoid

negative effects on the learning process.

We draw on Example 1 a useful feature of the characteristic function and its

association with the game. In this example we compute the difference between

two marginal contribution values, but it may happen that one of them might be

unfeasible. In such situation, the characteristic function should be robust and it

must return a value indicating that such coalition is unfeasible in order to make

possible to compute the marginal contribution, e.g.: in Example 1 if the coalition

S = {5, 0} is unfeasible, then we could impose v(S) = 0.

RL considers that each agent, after the execution of an action selected with

respect to its policy, receives a reward based on a reward function. Until now,

here we have considered only characteristic functions which evaluate only coalition

goodness, but we need to introduce a reward function such that it can parcel out

the coalition value given by the characteristic function.

We can define the reward value gi for each agent of type i in task allocation via

coalition formation games.

81

7.2 Utility Functions of the Game 7. Task Allocation via Coalition Formation

Definition 7.7. The reward value gi for each agent of type i is given by

gi = g(mi), (7.3)

where g(mi) is the reward function that evaluates the marginal contribution mi

computed according to Equation (7.2).

With these two functions we aim to induce an implicit coordination mechanism

among different types of agents belonging to a coalition. The marginal contribu-

tion (Equation (7.2)) provides a measure to coordinate different agents within a

coalition, while the reward function (Equation (7.7)) is used to coordinate different

coalitions in order to reach an optimal configuration (that is a equilibrium).

In these games we are dealing also with task allocation, so we need a world

utility function that evaluates the overall behavior of environment.

Definition 7.8. The world utility value G is given by

G = G (g (m1) , g (m2) , . . . , g (mk)) , (7.4)

where G(·) is the given world utility function of environment.

According to Definition 7.7 we can use the reward functions stated in Section 5.2

and in Section 5.3, so we can evaluate the goodness of an environment formalized

with RL.

As stated in Section 4.3.3, the COIN clamping function suggests to clamp the

elements of state ζ pertaining to agent η to a prefixed arbitrary value. In that

situation we are free to prefer one specific clamping value rather than another one

(i.e.: in the Bar Problem we can clamp to the null action or to a random one),

but anyway the clamping function acts on the state space. In this new typology

of game we are dealing with a huge state space, thus we proposed in Equation

(7.2) a clamping function acting on a given coalition S clamping to a fixed value

related to agents’ type in that coalition, that is it removes one agent of type i then

it evaluates the resulting coalition S−i with the characteristic function v(·) of the

game.

Let us focus on the WLU function of Equation (4.7): as said above, this utility

function is mainly based on the state, but here we must deal with a huge state

82

7. Task Allocation via Coalition Formation 7.3 Testbed Problem: Cooking Teams

space as mentioned in Section 7.1.1. As a consequence we must pay attention on

which clamping function we can use, in fact if we clamp to a different state and not

to the null state, each agent will receive a meaningful reward, but useless, in fact

it can’t reuse this precious information in the future in order to avoid or to choose

a particular action because it has not any state information (recall that as stated

in Section 7.1.1 we do not use any state information). Anyway, if we consider the

state space, this reward might be still useless, in fact the state space is huge and

each agent will have difficulties to visit another time the same state (recall that

such a visit depends on the joint action and not only to the agent’s action).

This particular feature does not affect other reward functions (selfish utility

function, team game utility function and uniform division reward), because with

such functions we do not compute any difference involving whatsover clamping

value, thus involving any state information. This may be useful with this kind of

games if we do no consider the state space. However this feature might be affected

by whether the state space exists. For example, following the definition of the TG

utility function (see Section 4.2.3 and Section 5.2.2), in general we must compute

the sum over all tasks of the reward function evaluated in such tasks. This could

lead the learning process towards to non optimal values, since the reward value

assigned to each agent belonging to a coalition can be influenced by other agents

and coalitions related to other tasks.

7.3 Testbed Problem: Cooking Teams

Here, we introduce the cooks and helpers problem (see [25]), where we have 2 types

of players k = {cook, helper} and 3 different cooking teams:

• 1 cook and 2 helpers cook one cake;

• 4 cooks alone cook one cake (too many cooks encounter difficulties reaching

an agreement);

• 1 helper alone cooks one cookie.

83

7.3 Testbed Problem: Cooking Teams 7. Task Allocation via Coalition Formation

Each cake is worth 10 and each cookie 1. Moreover we have some constraints about

the number of possible kitchens available, in fact we have U = 7 different kitchens

(tasks) and each one can be selected by any coalition S = {fc, fh} of fc cooks and

fh helpers.

Let us suppose that Su describes the coalitions cooking in kitchen u (u ∈ U =

{1, . . . , 7}). We define the following characteristic function (n ∈ N):

v(S) =

n · 10 if we have a coalition S = {n, 2 · n} ,

n · 10 if we have a coalition S = {4 · n, 0} ,

n if we have a coalition S = {0, n} ,

0 elsewhere.

(7.5)

The world utility is defined as:

G (Q) =

U∑

u=1

v(Su) (7.6)

where Su is the coalition undertaking the u-th task.

The main goal of this game is to maximize Equation (7.6) with respect to the

U = 7 tasks, without communication among agents acting in this environment.

7.3.1 Configurations

In this problem we will analyze five different possible cases in order to find an

admissible coalition structure CS∗ such that maximizes G (Q).

Case 1 : the population Q = {fc, fh} consists of fc > 0 cooks and fh = 0 helpers.

If fc ≤ 4 the core is nonempty (recall that the core focuses on the stability of

an admissible coalition, and in this case any solution is stable because there

are not leftover cooks creating instability). If fc > 4, the core is nonempty if

and only if fc is an integer multiple of 4 (so they cook fc mod 4 cakes). In any

other case, in the population there are leftover cooks who create instability.

Case 2 : the population Q = {fc, fh} consists of fh > 0 helpers and fc = 0 cooks,

thus the core is always nonempty and it assigns 1 to each helper.

84


Case 3 : the population Q = {fc, fh} is composed by r1 + r2 coalitions (both of

them greater than 0), where r1 is the number of coalitions (1, 2) (1 cook, 2

helpers) and r2 is the number of coalitions (0, 1). In this case there are many

helpers relative to the number required for teams with composition (1, 2).

Competition among helpers exists and it keeps the price of a helper down.

The core assigns 1 to each helper and 8 to each cook.

Case 4 : the population Q = {fc, fh} is composed by r1 + r2 coalitions, where

r1 is the number of coalitions (1, 2) and r2 is the number of coalitions (4, 0)

(both of them greater than 0). This is the opposite of the previous case:

here we have many cooks with respect to the number of helpers. Now the

competition exists among cooks in order to be in a coalition with 2 helpers.

In this case the core assigns 154

to each helper and only 52

to each cook.

Case 5 : the population Q = {fc, fh} is composed by r1 > 0 coalitions of type

(1, 2). Here the core contains a continuum of points and the extremes are

described by the core of Case 3 and Case 4.

Let us consider the characteristic function of Equation (7.5): at first sight

it is a strict function that rates only known coalitions. Obviously, if we change

the characteristic function, the problem is changed too. For example, with the

characteristic function of Equation (7.5) we do not formalize the fact that there

may exist only one oven. In order to allow the cook of only one cake or only one

cookie we must impose n = 1 in Equation (7.5): i.e., both coalitions S1 = {1, 2} and

S2 = {4, 8} can cook only one cake. Moreover, we can formalize a moderator that

organizes agents belonging to a coalition is such a way to maximize the welfare, e.g.:

if we have the coalition S = {2, 5}, players can cook two cakes and one cookie,

while with Equation (7.5) they can cook anything. Anyway, with these simple

considerations we show the importance to have a good and a robust characteristic

function with respect to the problem we aim to formalize.

To test the goodness of the characteristic function proposed in Equation (7.5)

we propose four different configurations Q = {cooks, helpers} according to the five

different cases discussed above:

85


Bar-1 : we have 5 cooks and 20 helpers;



Bar-4 : we have 21 cooks and 5 helpers.

Using Equation (7.5) we can obtain the optimal coalition structures described

in Table 7.1. Obviously, according to Equation (7.5), it is considered optimal any

admissible permutation of the coalition structures of Table 7.1, e.g.: an admissible

optimal coalition structure for the Bar-2 is reported in Table 7.2.

Tasks Bar-1 Bar-2 Bar-3 Bar-4

1 (Monday) S = {1, 2} S = {7, 14} S = {7, 14} S = {12, 0}payoff = 10 payoff = 70 payoff = 70 payoff = 30

2 (Tuesday) S = {1, 2} S = {4, 0} S = {4, 0} S = {1, 2}payoff = 10 payoff = 10 payoff = 10 payoff = 10

3 (Wednesday) S = {1, 2} S = {0, 0} S = {4, 0} S = {0, 0}payoff = 10 payoff = 0 payoff = 10 payoff = 0

4 (Thursday) S = {1, 2} S = {0, 0} S = {4, 0} S = {0, 0}payoff = 10 payoff = 0 payoff = 10 payoff = 0

5 (Friday) S = {1, 2} S = {0, 0} S = {2, 0} S = {4, 0}payoff = 10 payoff = 0 payoff = 0 payoff = 10

6 (Saturday) S = {0, 10} S = {0, 0} S = {0, 0} S = {4, 0}payoff = 10 payoff = 0 payoff = 0 payoff = 10

7 (Sunday) S = {0, 0} S = {0, 0} S = {0, 0} S = {0, 3}payoff = 0 payoff = 0 payoff = 0 payoff = 3

World utility value 60 80 100 63

Table 7.1: Optimal coalition structures for the four bar problems

86


Tasks Coalitions Payoffs

1 (Monday) S = {1, 2} 10

2 (Tuesday) S = {1, 2} 10

3 (Wednesday) S = {1, 2} 10

4 (Thursday) S = {1, 2} 10

5 (Friday) S = {2, 4} 20

6 (Saturday) S = {1, 2} 10

7 (Sunday) S = {4, 0} 10

Table 7.2: Another admissible coalition structure for the Bar-2

7.3.2 Reward Functions

In order to worth different coalitions, we evaluate them with the marginal contri-

bution proposed in Section 7.2.1. Furthermore, in order to split the payoff obtained

with the marginal contribution among agents of a coalition, we propose four dif-

ferent reward functions.

Selfish utility function Each agent η of type kη = i (where

i ∈ K = {cook, helper}) obtains a payoff given by

gi = mi (7.7)

where mi is the marginal contribution computed according to Equation (7.2).

COIN utility function Each agent η of type kη = i (where

i ∈ K = {cook, helper}) obtains a payoff given by

gc = fc ·[v (fc, fh) − v (fc − 1, fh)

]+ fh ·

[v (fc, fh) − v (fc, fh − 1)

]−

[(f r

c + 1) ·(v (f r

c + 1, f rh) − v (f r

c , f rh))+

f rh ·(v (f r

c + 1, f rh) − v (f r

c + 1, f rh − 1)

)](7.8)

gh = fc ·[v (fc, fh) − v (fc − 1, fh)

]+ fh ·

[v (fc, fh) − v (fc, fh − 1)

]−

[f r

c ·(v (f r

c , f rh + 1) − v (f r

c − 1, f rh + 1)

)+

(f r

h + 1)·(v (f r

c , f rh + 1) − v (f r

c , f rh))]

(7.9)

87


where as usual v(·, ·) is the characteristic function, fc and fh are the number

of cooks and helpers in the day attended by agent η, while f rc and f r

h are the

number of cooks and helpers in a different random day.

Team Game utility function Each agent η of type kη = i (where i ∈ K =

{cook, helper}) obtains a payoff given by

gi =1

n·

D∑

d=1

v(fdc , fd

h) (7.10)

where as in the previous case f dk is the number of agents of type k on day d

(obviously D is the number of days per week) and n is the number of agents.

Uniform Division utility function Each agent η of type kη = i (where i ∈K = {cook, helper}) obtains a payoff given by

gi =1

fc + fh

·∑

i∈k

fi · mi (7.11)

where fi is the number of agents of type i in the day attended by agent η

and mi is the agent’s marginal contribution of type i.

7.3.3 State Space

We will analyze this problem with and without any state information in order to

see how agents behave in this environment. If we use the state space, it will be

huge, so we expect to have worse performance than without any state information.

As a consequence, we introduce the concept of difficulty of a problem that is

related to the number of optimal states in the joint state space. In the four bar

configurations described above we formalized this idea. Bar-1 looks “easiest” than

Bar-4: the former has many helpers with respect to the number of cooks (thus each

cook tends to cook one cake with two helpers and they will not find any difficulty),

while in the latter all cooks tend to take two helpers in order to obtain better payoffs

(helpers are shared resources). In these situations we have a joint state space with

a different amount of optimal states and they depend on the characteristic function

used. The proposed characteristic function rates only coalitions with a predefined

number of agents as described in Equation (7.5), thus the joint state space will be

88


formed only by single states where an admissible coalition is. For example, let us

analyze the Bar-4: in this case the characteristic function proposed identifies only

few optimal states, so we have a huge state space with a very small amount of good

states.

An easy way to deal with this a priori huge state space is to reduce it using an

appropriate state space representation. Following the idea proposed in Section 5.3,

we analyze this problem with a reduced state space. Each state s represents the

number of cooks and helpers attending the bar in a day chosen by the agent that is

evaluating its policy. Hence, the state space for each bar is drastically reduced as

follow (enclosed in parenthesis is reported the original state space size computed

according to Equation (7.1)):

Bar-1 : the new state space has 126 states (2125 · 125 ≃ 5.3169 · 1039).




As described above, the state space size is heavily decreased, but we must bear in

mind that the goodness of each state depends on the number of cooks and helpers

attending the bar. Hence a state can be rated in a different way by cooks and

helpers, thus if a cook rates as bad a state s, a helper can rate it as good. As

a consequence, they do not agree on the evaluation of that state, so cooks and

helpers tend to visit the same set of states and the world utility will assume lower

values.

In Section 8.3 we propose the result obtained with these different bar configu-

rations, with and without state in order to analyze the overall coalitions’ behavior.

89

Chapter 8Results

The only real valuable thing is intuition. The intellect has

little to do on the road to discovery.


Contents

8.1 Grid world . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

8.1.1 First Grid . . . . . . . . . . . . . . . . . . . . . . . . . . 93

8.1.2 Second Grid . . . . . . . . . . . . . . . . . . . . . . . . . 95

8.2 Bar Problem and its Reward Functions . . . . . . . . . . . . . . . 98

8.2.1 First Bar Configuration . . . . . . . . . . . . . . . . . . 99

8.2.2 Second Bar Configuration . . . . . . . . . . . . . . . . . 105

8.2.3 Q-learning Dynamics of the Bar Problem . . . . . . . . 109

8.3 Cooking Teams Problem . . . . . . . . . . . . . . . . . . . . . . 116

8.3.1 Nonempty State Space . . . . . . . . . . . . . . . . . . . 117

8.3.2 Empty State Space . . . . . . . . . . . . . . . . . . . . . 128

91

8. Results

In the previous chapters COIN has been described and analyzed with its the-

oretical grounds and applications on real problems. In this chapter we show some

characteristics emerging from real applications with particular kind of problems

like the Grid world and the Bar Problem (see Section 5.2 and Section 5.3 respec-

tively). In the following we show the results achieved with particular environment

configurations, then we analyze some unfavorable characteristics, like the delayed

reward, that in most problems managed by RL can be ignored (for instance matrix

games, see Section 6.1.1 and [1, Section 3] for further details). On the contrary,

with COIN in non-stationary environments this problem should be considered (if

possible) to obtain the best possible solutions.

Furthermore, we analyze the different reward functions used to induce coop-

eration among agents with another viewpoint (in this case we analyze a reduced

version of the Bar Problem) using the Q-learning dynamics (see Section 6.4). Due

to this analysis, we can understand how the policy of each agent evolves during

the learning phase using different reward functions. Thus, we can understand how

the different reward values induce agents towards different solutions (we will see

some solutions where agents randomly act).

Finally, we show the interesting results obtained in the Cooking Teams Problem

(see Section 7.3) with different configurations of the agents and of the environment.

In this case the problem still involves many agents, but they must cooperate and

regroup in coalitions to reach a prefixed goal. These results are twofold: first, they

show how the problem difficulty drastically increases if we introduce some bounds

(that is the presence of different coalitions of agents to reach a goal); second,

these results are useful to understand how the different configurations proposed

can influence the dynamic behavior of each agent and of each coalition of agents.

For each configuration, we propose different experiments that depict the overall

behavior emerging from the environment and how this behavior can be extremely

modified acting on some configurations both of the agents and of the environment.

92

8. Results 8.1 Grid world

8.1 Grid world

In the Grid world presented in [24] we have a good introduction of the theoretical

aspects of COIN, but in practice the problem is not sufficiently described, because

we do not have any kind of information about the arrangement of agents and

tokens.

By a careful inspection of the results proposed in such paper (see Figure 5.3 and

Figure 5.4) we can have doubts; in particular, from the results depicted in Figure

5.3 we see that with the WLU function all agents achieve excellent performance,

larger than those achieved with the TG and the SU functions: the latter suggest

that all agents receive rewards not enough significant to learn the (near) optimal

sequence of actions that leads to a complete collection of tokens (notice that on

Figure 5.3 and Figure 5.4 the world utility function is on y-axis, that is equivalent

to the sum of the token values collected up to the current time step). This result

may be questionable, and in the following we discuss it.

8.1.1 First Grid

We start our analysis with the grid depicted in Figure 8.1 (tokens are represented

with letter T and agents with letter A). In this situation it is plausible to find

results different from those shown in [24]. We expect to obtain increasing utility

function values, since all agents, even acting with a random policy, are able to

collect an increasing amount of tokens laid on the grid.

We executed the experiments with the following configuration:

• 4 agents, each one using Q-learning with learning factor α = 0.5 and discount

factor γ = 0.95, while an ǫ-greedy policy is used for the exploration with

exploration factor ǫ = 0.1 decreasing over time;

• 12 tokens with random values between 0.1 and 2 (50 possible values);

• each agent can execute 4 different actions (up, down, left, right) with zero-

probability to make a mistake (i.e. it believes to be in position p in the grid

93

8.1 Grid world 8. Results

Figure 8.1: Untypical grid

due to execution of action a, but instead it recovers in position p′), and it is

able only to perceive its position on the grid;

• results are averaged over 10 different runs, each one composed by 10,000

trials of 10 steps;

• we introduce the uniform division utility function (UD): it is similar to the

TG one, but here all agents receive a reward averaged on the number of

agents (i.e. sum of the token values collected up to now by all agents divided

by the number of agents).

In Figure 8.2 we notice the good performance of the agents using the WLU

function, in fact they collect 82% of the available world utility value (notice that

all the graphs reported are normalized with respect to the sum of all the token

values), with a large convercenge speed already in the first 2,000 trials (collecting

about 80% of the available world utility value).

The agents using the SU and the TG utility functions tend to show a similar

convergence speed and both of them have a lower convergence value with respect

to the WLU one. The convergence speed is due to the greedy behavior of these

agents. In the earliest trials they receive rewards propelling them to collect the

94


0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Nor

mal

ized

Wor

ld U

tility

Trials

WLUSUTGUD

Figure 8.2: Results of the untypical grid

same tokens in subsequent trials: in such a case they focus on tokens near them

and/or on tokens with higher value ignoring others.

The agents using the UD utility function show a uniform behavior, in fact their

policy results in a slow convergence speed and in a low convergence value: this is

particularly due to the awful signal-to-noise ratio analyzed in Section 4.2.3.

With the WLU and the SU functions the agents have better performance than

using the TG and UD ones, since with the latter they receive the same reward and

then they encounter difficulties discerning the effects of their actions, while with

the former each agent collects a reward related to its actions and then it is led

towards better understanding the effects of its actions.

8.1.2 Second Grid

To test the effective reliability of the WLU function we analyzed the third grid

presented in [18, Figure 3b] (Figure 8.3) ignoring the ability of each agent to move

itself diagonally and the negative rewards returned to agents which tend towards

95

8.1 Grid world 8. Results

the same path in each single trial1.

Figure 8.3: Grid proposed by ’t Hoen and Bohte (from [18])

The experiments were executed with the following configuration:

• 8 agents, each one using Q-learning with learning factor α = 0.5 and discount

factor γ = 0.95, while an ǫ-greedy policy is used for the exploration with

exploration factor ǫ = 0.1 decreasing over time;

• 8 tokens: tokens 1 and 2 have value equal to 1.6, from 3 to 6 they have value

equal to 1.0, and finally tokens 7 and 8 have value equal to 1.2;

• each agent can execute 4 different actions (up, down, left, right) with zero-

probability to make an error (i.e. it believes to be in position p in the grid

due to execution of action a, but instead it is in position p′), and it is able

only to perceive its position on the grid;

1This particular agent behavior is mainly due to their preference to follow a path guaranteeing

non-negative reward. This means that an agent can continuously follow the same path even if

in such a path there are not tokens, because this gives reward equals to 0, instead of a possible

negative reward due to tokens shared between 2 or more agents.

96


• the results are averaged over 10 different runs, each one composed by 10,000

trials of 15 steps;

• as in Section 8.1.1 we evaluate also the UD utility function.

In this situation we aim to have coordination among agents to collect as many

tokens as possible. So, we expect agents will not collect a significant number of

tokens with the TG and UD utility functions, since all these agents receive a reward

related to the joint action and not to each single action; on the contrary with the

SU function agents are led towards tokens with higher value.

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Nor

mal

ized

Wor

ld U

tility

Trials

WLUSUTGUD

Figure 8.4: Results of the grid proposed by ’t Hoen and Bohte

The results depicted in Figure 8.4 are consistent with our expectation; in fact,

we clearly note the awful performance of the agents using the TG and the UD utility

functions that, about after 3,000 trials, tend towards about 25% of the overall

available world utility value. At the opposite, the agents using the SU and the

WLU functions behave in a similar way. Despite the agents using the SU function

show similar growth rate with respect to those using the WLU function in the first

1,000 trials, the SU convergence speed decreases, thus these agents tend towards

about 65% of the global available world utility value, while the agents using the

97

8.2 Bar Problem and its Reward Functions 8. Results

WLU keep collecting tokens (they rech 87% of the avalable world utility). Already

with 5,000 trials these agents collects 80% of the available world utility value using

the WLU function, while with the SU one they remain constant at 65%.

In this situation, what said above still holds: with the WLU and the SU func-

tions all agents receive rewards related to their own actions (with good conse-

quences on the signal-to-noise ratio, because agents keep improving their own pol-

icy), while with the TG and UD functions all agents cannot clearly discern the

consequences of their actions.

8.2 Bar Problem and its Reward Functions

The Bar Problem is significantly different from the Grid world (the former is classi-

fied as a dispersion game, [19, Section 3]). While the Grid world is a non-stationary

environment (due to agent’s changing policies) and stochastic, this environment is

stochastic and stationary. Each agent can perceive this problem as single-agent

single-state environment (it is known as arm bandit problem): each agent chooses

a night to attend the bar and receives a reward not related to which agents are

attending the bar, but rather to the number of agents in that night. Having n

agents, each agent ηi considers agents {η1, η2, . . . , ηi−1, ηi+1, . . . , ηn} not as oppo-

nents, but just as entities related to the environment. With this expedient the size

of the problem is widely reduced, in fact we need to consider an environment with

a number of states equal to the number of agents acting in that environment plus

one2, i.e. n + 1.

In the following sections, we show the results obtained with two different bar

configurations, both with agents configured with ǫ-greedy policy (ǫ = 0.1 decreasing

over time), learning rate α = 0.5 and discount factor γ = 0.95. Furthermore, we

compare the performance obtained with the exponential reward function (see γ(·) of

Equation (5.7)) and the Gaussian one (see Section 6.3) of these two configurations.

2We must also consider the case where no agent attends the bar at a given night

98

8. Results 8.2 Bar Problem and its Reward Functions

8.2.1 First Bar Configuration

In this environment, we have 30 agents that must choose a night to attend the bar

in a week of 5 days, where the optimal number of agents attending the bar is 2.

This experiment consists of 15 thousand weeks and the results are averaged over

10 different runs.

We compared the world utility trend obtained with the WL, the TG and the

UD utility functions, where the TG and the UD values are respectively divided by

the total number of agents and by the total number of agents attending the bar

in the night attended by a specific agent. Furthermore these comparisons depend

from the mathematical form of the reward function; we show the results obtained

with such utility functions when we have an exponential function (as stated in [6,

Section 4] and in Equation (5.7)) and a Gaussian one (see Section 6.3).

The maximum achievable value is given by having 2 agents attending the bar

in 4 days of the week and 22 agents in the remaining one (moreover the latter

give an insignificant contribute). This is easy to verify, since we have the following

nonlinear problem to solve:

arg maxxi

5∑

i=1

Ψ (xi, c)

constraints:

∑5i=1 xi = 30

0 6 xi 6 30 ∀i = 1 . . . 5

where xi is the number of agents attending the bar on night i, c is the desired

number of agents and Ψ(·, ·) is the reward function:

1. Ψ(xi, c) , xi · exp(−xi

c) if we use the exponential reward function;

2. Ψ(xi, c) , k ·N (c, σ2) (where k ∈ R and k > 0) if we use the Gaussian reward

function.

In Figure 8.5 we see the normalized results3 achieved with the exponential

reward function. The agents using the TG and the UD utility functions behave in

the same way: they show a good convergence speed (similar to that of the agents

3The normalization is computed with respect to the maximum value derived above.

99


0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

0 2000 4000 6000 8000 10000 12000 14000 16000

Nor

mal

ized

Wor

ld U

tility

Weeks

WLUTGUD

Figure 8.5: Results of the first bar configuration with the exponential reward functions

using the WLU function), but after 4,000 weeks they remain stuck at 55% of the

available world utility. The WLU function is more interesting: the agents using

this function, even starting from the same value as TG and UD, rapidly increase

the world utility already in 2,000 weeks and after week 4,000 they have a constant

growth rate that tends to stabilize after week 12,000 at about 90% of the available

world utility. From these considerations, we infer that the WLU function gives to

agents meaningful rewards so they can clearly and rapidly understand the effective

consequences of their own actions.

In Figure 8.6 the results obtained with the Gaussian reward functions (σ = 2

and k = 7.2) are depicted. The world utility function trends achieved with different

reward functions start from the same point, but the one obtained with the WLU

function rapidly increases to 85% after 3,000 weeks. The agents using the TG and

the UD utility functions show a clearly decresing trend of the world utility, and

then it stabilizes at about 35-40% of the available world utility. It is interesting to

qualitatively compare these trends to those depicted in Figure 8.5 and 8.6 when the

agents use the exponential and the Gaussian TG-UD reward functions: ignoring

the starting value of these two trends, we clearly note that the former have an

100


0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

0 2000 4000 6000 8000 10000 12000 14000 16000

Nor

mal

ized

Wor

ld U

tility

Weeks

WLUTGUD

Figure 8.6: Results of the first bar configuration with the Gaussian reward functions

increasing trend in the first weeks, while the latter rapidly decrease towards 40%

in the first 4,000 weeks (recall the signal-to-noise ratio explained in Section 4.2.3).

As we will see below, this behavior is mainly due to the reward function.

It is interesting to compare the results obtained with the exponential reward

function and the Gaussian one. At first sight we may think to compare the world

utilities depicted in Figure 8.5 and Figure 8.6, but it is incorrect since that they

are normalized with respect to different maximum values (one is related to the

exponential reward function and one is related to the Gaussian one), even if we

can realize the difference of the world utility values achieved by both the WLU

functions in comparison to those obtained by the TG and the UD utility ones.

This may induce ourselves to deduce that the exponential WLU function may be

less selective than the Gaussian one so it can cause less accurate results.

Apart these qualitative remarks, a better way to formally compare the results

obtained is to use the relative entropy (also known as Kullback-Leibler distance,

KLd for short). Given two probability distributions p(x) and q(x) of a discrete

101


variable X, the KLd is defined as:

KLd(p ‖ q) =∑

x∈X

p(x) · logp(x)

q(x)(8.1)

The KLd is always non-negative (Gibbs’ inequality) and it equals zero if and only

if p = q. So, in Equation (8.1) we impose q equals to the probability distribution of

the five different optimal bar configurations and p equals to the bar configuration

in a given week (X represent the days of a week), e.g.:

p =

[10

30,

7

30,

1

30,

4

30,

8

30

],

q1 =

[22

30,

2

30,

2

30,

2

30,

2

30

],

q2 =

[2

30,22

30,

2

30,

2

30,

2

30

],

. . . ,

q5 =

[2

30,

2

30,

2

30,

2

30,22

30

]

Thus the KLd for week w is modified as follows:

KLdw(pw ‖ q) = mini=1...5

∑

x∈X

pw(x) · logpw(x)

qi(x)(8.2)

As a consequence, for each week w we can compute the distance between the

probability distribution pw(x) of that week and the optimal probability distribution

qi(x) using Equation (8.2), thus we can obtain the trend of the KLd (that is

influenced by the different utility functions).

We expect that the probability distribution obtained with the Gaussian WLU

function has the KLd lower (or at least similar) than to the one obtained with

the exponential WLU function as a consequence of the considerations explained

in Section 6.3. Notice that we may have zero probability to attend the bar in a

specific night x, so in this situation we cannot compute the KLd. To avoid this

problem we sum 1 to each probability value.

Figure 8.7 depicts the resultant KLd of the distribution probabilities obtained

with the Gaussian and the exponential WLU function over 10 different runs of the

first 2,000 weeks: it is interesting to note that both the WLU functions have the

same trend, but in the early weeks (the first 400) the exponential WLU function

102


0

0.01

0.02

0.03

0.04

0.05

0.06

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Rel

ativ

e E

ntro

py

Weeks

Gaussian WLU (std. dev. = 2.0)Exponential WLU

Figure 8.7: Mobile mean of the WLU functions relative entropy

seems to reach the minimum distance more quickly than the Gaussian one, but

after week 400 they show the same behavior.

For completeness of exposition we present also the KLd of the probability dis-

tributions obtained with the TG and UD utility functions in Figure 8.8(a) and

Figure 8.8(b). It is interesting to note that the TG utility functions (both Gaus-

sian and exponential) show about the same behavior (and it is quite noisy) and

the relative entropy converges to the same value of the starting one of the relative

entropy obtained with the exponential WLU. The relative entropy related to the

agents using the UD utility functions is clearly worse, in fact it shows an increasing

behavior4, so we can assert that both the UD utility functions lead the agents to

avoid the optimal configuration.

Given the theoretical results obtained with the KLd it is interesting to evaluate

agents’ behavior when they use the Gaussian WLU function or the exponential

one. Figure 8.9 depicts the bar attendance: the red boxes represent the optimal

configuration, while the blue error lines represents the fluctuation of the number of

4Recall that the relative entropy measures a distance from a probability distribution p and an

optimal probability distribution q, so the lower is, the better is.

103


0.05

0.052

0.054

0.056

0.058

0.06

0.062

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Rel

ativ

e E

ntro

py

Weeks

Gaussian TG (std. dev. = 2.0)Exponential TG

(a) TG

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0.055

0.06

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Rel

ativ

e E

ntro

py

Weeks

Gaussian UD (std. dev. = 2.0)Exponential UD

(b) UD

Figure 8.8: Mobile mean of the TG and UD utility functions relative entropy

agents from the optimal configuration in each day of the last 10 weeks. With the

exponential WLU function of Figure 8.9(a) we have the overcrowded day with lower

deviation with respect to the overcrowded one obtained with the Gaussian WLU

function of Figure 8.9(b), but the optimal days have large fluctuation compared to

the ones achieved with the Gaussian WLU function. These are small differences,

because this bar configuration counts the presence of not too many agents. Anyway,

these results confirm what stated in Section 6.3 about agents’ behavior when we

change the utility function.

0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

1 2 3 4 5

Age

nts

Atte

ndin

g th

e B

ar

Days of Week

Optimal configuration

Deviations

(a) Exponential WLU function

0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

1 2 3 4 5

Age

nts

Atte

ndin

g th

e B

ar

Days of Week


Deviations

(b) Gaussian WLU function (σ = 2.0)

Figure 8.9: Attendance of the first bar configuration

104


8.2.2 Second Bar Configuration

In this experiment, we have a remarkable environment with 60 agents, 7 days per

week and 4 agents as optimal number. The results are averaged over 10 different

runs, each one composed by 20 thousand weeks. As stated in Section 8.2.1, the

maximum performance is similarly given by 4 agents attending the bar in 6 days of

the week and in the remaining one we have 36 agents. In the same way, we compare

the results obtained with the exponential reward function and the Gaussian one

(here to be taken with σ = 2.5 and k = 9.0) using the minimum KLd between the

current probability distribution and the optimal probability distribution of the bar

attendance. All agents use an ǫ-greedy exploration policy (ǫ = 0.1 decreasing over

time), learning factor α = 0.5 and discount factor γ = 0.95.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

Nor

mal

ized

Wor

ld U

tility

Weeks

WLUTGUD

Figure 8.10: Results with the second bar configuration with the exponential reward functions

In Figure 8.10 we see the normalized results obtained with the exponential

reward functions. All agents have the same world utility starting value and, in the

first 1,000 weeks, they have the same growth rate. After that week the growth rate

of the agents using the TG and UD functions slightly decreases and then these

agents reach about the same world utility value at week 4,000. On the other hand,

the world utility of the agents using the WLU function continues to increase until

105


0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

Nor

mal

ized

Wor

ld U

tility

Weeks

WLUTGUD

Figure 8.11: Results of the second bar configuration with the Gaussian reward functions

week 6,000 and it reaches about 90% of the available value. In this situation we

may say that the exponential reward functions seem to distribute similar rewards

to agents, thus they might pick non-optimal nights to attend the bar.

In Figure 8.11 the results obtained with the Gaussian reward functions are

depicted: in this situation we achieve qualitatively different results with respect

to Figure 8.10, in fact all the world utility functions start around the same value,

but the world utility obtained with the WLU function rapidly increases and in

3,000 weeks it adjusts itself at about 80%. On the other hand, it is interesting to

note that qualitatively the agents using the TG and UD Gaussian utility functions

obtain a world utility value less than those using the exponential TG and UD ones

and they have a poor increasing rate which stops already in the first 1,000 weeks.

As a consequence we could further on affirm that the Gaussian reward functions

seem to distribute reward more selectively than the exponential ones.

In order to see whether the Gaussian reward functions are better than the

exponential ones, it is interesting to compare the results obtained with the Gaussian

utility functions and with the exponential ones and to notice the different values

achieved by both the WLU utility functions compared to those obtained with

106


the TG and UD ones. We still use the minimum KLd between the probability

distribution of the bar attendance pw(x) at week w and the optimal ones q1...7(x)

similarly computed with Equation (8.2) (in this case i ranges from 1 to 7 and qi(x)

and pw(x) are opportunely changed) to verify the goodness of the Gaussian WLU

function with respect to the exponential one.

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Rel

ativ

e E

ntro

py

Weeks

Gaussian WLU (std. dev. = 2.5)Exponential WLU

Figure 8.12: Mobile mean of the WLU functions relative entropy

Figure 8.12 depicts the KLd of the distribution probabilities obtained with the

Gaussian and the exponential WLU functions over 10 different runs on 2,000 weeks.

If in the previous bar configuration (Section 8.2.1) the differences between the prob-

ability distributions induced by the Gaussian and the exponential WLU functions

were slight, in this situation the advantages of the Gaussian utility function with

respect to the exponential one are more evident: the probability distribution in-

duced by the Gaussian WLU function rapidly decreases the relative entropy value

already in the first 500 weeks, that is it gets close to the optimal probability distri-

bution of the bar attendance. The probability distribution of the exponential one

reaches about the same value after 800 weeks and both of them remain constant

after week 1,200, even if the former seems to have relative entropy values lower

than the latter and it is less noisy. These results confirm how stated in Section

107


6.3 and the previous considerations about the Gaussian reward functions goodness

with respect to the exponential ones.

0.026

0.027

0.028

0.029

0.03

0.031

0.032

0.033

0.034

0.035

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Rel

ativ

e E

ntro

py

Weeks

Gaussian TGExponential TG

(a) TG

0.033

0.0335

0.034

0.0345

0.035

0.0355

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Rel

ativ

e E

ntro

py

Weeks

Gaussian UD (std. dev. = 2.0)Exponential UD

(b) UD

Figure 8.13: Mobile mean of the TG and UD utility functions relative entropy

For completeness of exposition we present also the KLd of the probability dis-

tributions obtained with the TG and UD utility functions in Figure 8.13(a) and

Figure 8.13(b) in the exponential and in the Gaussian form. It is interesting to

note that in such figures both the probability distributions induced by the TG

and UD utility functions are far away from the optimal one as in the first Bar

Problem configuration (Section 8.2.1). The probability distribution obtained with

the TG utility function tends to increase the distance from the optimal probability

distribution, while the one achieved with D utility function is heavily noisy and it

is mainly due to the signal-to-noise ratio described in Section 4.2.3.

Given the theoretical results obtained with the KLd it is interesting to evaluate

the agent behavior when they adopt the Gaussian WLU function or the exponential

one. Figure 8.14 depicts the attendance of the bar: the red boxes represent the

optimal configuration, while the blue error lines represent the fluctuation of the

number of agents attending the bar from the optimal configuration in every day

of the last 10 weeks. It is easy to verify that with the Gaussian WLU function

the bar configuration has larger fluctuations from the optimal configuration than

the exponential one, but these deviations are not simultaneous, in fact we have a

lower distance from the probability distribution of the optimal configuration when

we use the Gaussian WLU function than the exponential one (Figure 8.12). In

108


some cases it may happen that agents implicitly create a coalition and then they

change the overcrowded day where they attend the bar to not lower the world

utility. Obviously, this particular behavior does not influence neither the KLd nor

the world utility; in fact, they are based only on the distance from the optimal

configuration.

0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

32

34

36

38

40

42

1 2 3 4 5 6 7

Age

nts

Atte

ndin

g th

e B

ar

Days of Week


Deviations

(a) Exponential WLU function

0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

32

34

36

38

40

42

1 2 3 4 5 6 7

Age

nts

Atte

ndin

g th

e B

ar

Days of Week


Deviations

(b) Gaussian WLU function (σ = 2.5)

Figure 8.14: Attendance of the second bar configuration

8.2.3 Q-learning Dynamics of the Bar Problem

As explained in Section 6.4, with Equations (6.4), (6.5), (6.7) and from (6.8) to

(6.13), we can compute the Q-learning dynamics of the Bar Problem and we can

examine how the agent behavior changes during different weeks. The change of

agent ηi’s behavior is obviously influenced by behavior of other agents {η−i}: when

τ increases (see Equations (6.4) and (6.5)) agents are led to exploration, that is they

do not take care of their past experience (in the extreme case we may have agents

acting randomly); at the opposite when τ tends to 0 the Q-learning dynamics of

agents lead them to consider only their past experience thus discarding exploration.

Hence, the temperature usually starts from a high value (more exploration) and it

is decreased over time (more exploitation).

We initialize Equation (6.7) as follows (initial conditions for the system of

differential equations (6.2)):

Π =

0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27

0.8 0.79 0.78 0.77 0.76 0.75 0.74 0.73

T

(8.3)

109


and we set α = 0.1 and τ = 0.1; as a consequence we obtain the results depicted

in Figures 8.15(a), 8.15(b) and 8.15(c). Notice that here we always consider the

Bar Problem with a Gaussian reward function (with optimal number of agents

attending the bar c = 1, standard deviation σ = 0.5 and constant k = 1),because

in Section 8.2.1 and in Section 8.2.2 we see that the Gaussian reward function gives

better results than the exponential one.

In Figure 8.15 we can see how bad the UD utility function is, in fact all agents

converge to a uniform policy: they have the same probability to attend the bar on

Monday or on Tuesday and their probability distributions have the same evolution.

The TG and WLU functions induce the same behavior, in fact with these

utilities the agent with higher probability to attend the bar on Monday (agent 8,

see Equation (8.3)) increases that probability towards 1, while the others converge

to attend the bar on Tuesday. In such a case we reach an admissible optimal bar

configuration.

It is interesting to find the critical temperature τ for which the WLU and TG

utility functions lead agents towards a uniform policy like that depicted in fig-

ure 8.15(c). However, we fixed α = 0.1 and we do not consider anymore the UD

function, since that it always leads agents towards a uniform probability distribu-

tion. With some experiments we found τWLU = 0.140 for the WLU function, and

τTG = 0.154 for the TG utility function.

In Figure 8.16 we show the results for the WLU and TG utility functions with τ .

The WLU function with τWLU converges to the optimal policy in about 200 weeks;

increasing τWLU we found singularity solving numerically the system of differential

equations stated in Equation (6.2). The TG utility function with τTG converges

to the optimal policy in about 750 weeks; increasing τTG agents tend towards a

uniform policy.

Given the previous results depicted in Figure 8.16 it is interesting to analyze

how the agents dynamically behave with the TG and WLU functions. Keeping

fixed the temperature τ to 0.14 (note that such a temperature is critical for the

WLU function but not for the TG utility one), in Figure 8.17 we depict how

the agents act. The agents configured with the TG utility function reach 90% of

110


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70

Pro

babi

litie

s of

eac

h ag

ent t

o at

tend

the

bar

on M

onda

y

Weeks

WLU Bar Dynamics with 8 Agents - tau = 0.1, alpha = 0.1

Agent 1

Agent 2

Agent 3

Agent 4

Agent 5

Agent 6

Agent 7

Agent 8

(a) WLU function

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70

Pro

babi

litie

s of

eac

h ag

ent t

o at

tend

the

bar

on M

onda

y

Weeks

TG Bar Dynamics with 8 Agents - tau = 0.1, alpha = 0.1

Agent 1

Agent 2

Agent 3

Agent 4

Agent 5

Agent 6

Agent 7

Agent 8

(b) TG utility function

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70

Pro

babi

litie

s of

eac

h ag

ent t

o at

tend

the

bar

on M

onda

y

Weeks

UD Bar Dynamics with 8 Agents - tau = 0.1, alpha = 0.1

Agent 1

Agent 2

Agent 3

Agent 4

Agent 5

Agent 6

Agent 7

Agent 8

(c) UD utility function

Figure 8.15: Bar dynamics with 8 agents

111


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60 80 100 120 140 160 180 200

Pro

babi

litie

s of

eac

h ag

ent t

o at

tend

the

bar

on M

onda

y

Weeks


Agent 1

Agent 2

Agent 3

Agent 4

Agent 5

Agent 6

Agent 7

Agent 8

(a) WLU function

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800

Pro

babi

litie

s of

eac

h ag

ent t

o at

tend

the

bar

on M

onda

y

Weeks


Agent 1

Agent 2

Agent 3

Agent 4

Agent 5

Agent 6

Agent 7

Agent 8


Figure 8.16: τ for the bar dynamics with 8 agents

probability to attend the bar on Monday after 140 weeks, while with the WLU

one after 155 weeks. Both of them reach 100% of probability to attend the bar on

Monday after 200 weeks and they have the same increase ratio (even if agent 6 has

a little bit different behavior).

To justify these results it is important to note that in such bar configuration

we have agents converging towards the optimal bar allocation using both the WLU

and TG utility functions: this is not too surprising, since we are dealing with a

heavily reduced problem (2 days per week and only 8 agents). In such a situation

the signal-to-noise ratio (which, in the general case always penalizes the TG utility

112


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60 80 100 120 140 160 180 200

Pro

babi

litie

s of

eac

h ag

ent t

o at

tend

the

bar

on M

onda

y

Weeks


Agent 1

Agent 2

Agent 3

Agent 4

Agent 5

Agent 6

Agent 7

Agent 8

(a) WLU function

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60 80 100 120 140 160 180 200

Pro

babi

litie

s of

eac

h ag

ent t

o at

tend

the

bar

on M

onda

y

Weeks


Agent 1

Agent 2

Agent 3

Agent 4

Agent 5

Agent 6

Agent 7

Agent 8


Figure 8.17: Bar dynamics with 8 agents and τ = 0.14

function, see Section 4.2.3) does not affect the TG utility function, since each agent

can clearly discern the consequences of its actions.

Notice that in these experiments we used a particular policy matrix Π (Equation

(8.3)), in fact the system converges to the optimal configuration with both the WLU

and TG utility functions. This is obvious because, apart the initial transient state

(see Figure 8.15(a) and Figure 8.15(b)), the fact that the agent with higher initial

probability to attend the bar on Monday always converges on that day is due to

the first term of both Equation (6.4) and Equation (6.5). Initially, the expected

number of agents attending the bar on Monday is 1.88 (that is greater than the

113


optimal one), so all agents will decrease their probability to attend the bar on

Monday (and Tuesday will be overcrowded). After the fifth week agent 8 slowly

increases its probability to attend the bar on Monday because now the expected

number of agents attending that day is about 0.64. This fact leads agent 8 towards

to attend the bar on Monday because it continuously receives higher rewards than

the expected ones so it follows its policy “attend the bar on Monday”.

To verify this behavior it is interesting to see how the system evolves when we

use particular initial policy matrices Πs. Let us start with this policy matrix:

Πu =

0.9999 0.53 0.52 0.51 0.49 0.48 0.47 0.46

0.0001 0.47 0.48 0.49 0.51 0.52 0.53 0.54

T

(8.4)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

Pro

babi

litie

s of

eac

h ag

ent t

o at

tend

the

bar

on M

onda

y

Weeks


Agent 1

Agent 2

Agent 3

Agent 4

Agent 5

Agent 6

Agent 7

Agent 8

Figure 8.18: Uniform policies obtained with Πu

In this situation Equation (6.4) and Equation (6.5) tell us that the initial ex-

pected number of agents attending the bar on Monday except agent 1 is equal to

3.46 5, so agents from 2 to 8 will randomly act (uniform probability policy). Agent

1 expects (including itself) 4.4599 agents attending the bar on Monday and 3.4601

agents attending the bar on Tuesday, so it will converge to a uniform (random)

5Obviously if we include agent 1 these results, apart the numerical values, do not change.

114


policy even if it has a high probability to attend the bar on Monday. This behavior

does not depend on the utility function (TG rather than WLU or UD), but only on

Equation (6.4) and Equation (6.5) in particular on the expected number of agents

attending the bar on Monday or Tuesday. In Figure 8.18 we can see how the sys-

tem behaves: all the previous considerations hold; all agents converge towards a

uniform policy, that is they act randomly.

Another interesting policy matrix is the following:

Πc =

0.9 0.22 0.23 0.24 0.21 0.20 0.19 0.18

0.1 0.78 0.77 0.76 0.79 0.80 0.81 0.82

T

(8.5)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

Pro

babi

litie

s of

eac

h ag

ent t

o at

tend

the

bar

on M

onda

y

Weeks


Agent 1

Agent 2

Agent 3

Agent 4

Agent 5

Agent 6

Agent 7

Agent 8

Figure 8.19: Policies of agents obtained with Πc

This configuration is like the previous one, given that we have one agent (agent

1) with a high probability to attend the bar on Monday. In the first weeks we

expect that this agent decreases its probability to attend the bar on Monday (with

the WLU and TG utility functions, while with the UD all agents always converge

towards a uniform policy) because on that day, agent 1 excluded, there are 1.47

expected agents, while the others increase their probability to attend the bar on

Monday because on Tuesday there are 5.63 expected agents (including agent 1). At

a given point Monday will appear overcrowded so agents from 2 to 8 will lower their

115

8.3 Cooking Teams Problem 8. Results

probability to attend the bar on that day. In the meanwhile, agent 1 is decreasing

its probability to attend the bar on Monday, but the coalition created by agents

from 2 to 8 is decreasing those probabilities too, so agent 1 realizes that Tuesday is

more overcrowded than Monday, and it increases its probability to attend the bar

on Monday (while the others continue to decrease that probability because they do

not follow the policy “attend the bar on Monday” since it returns rewards lower

than those returned by the average policy); thus, the system converges towards an

admissible optimal configuration. These considerations are well depicted in Figure

8.19.

8.3 Cooking Teams Problem

In Section 7.3 we described a well known problem in literature to formalize a new

kind of game that involves task allocation games as well as coalition formation ones.

We depicted the curse of the state space size in such problems that exponentially

increases with respect to the number of agents for each type. In order to deal with

the state space size, in all these experiments we introduced a new characteristic

function (Equation (8.6) and Equation (8.7)) to evaluate the quality of a coalition

that is different from the one used to evaluate the world utility (Equation (7.5)),

so we can compare the results obtained with these two different characteristic

functions.

N2(fc, fh, µcµh) = κ · 1

2 · π · σ2· exp

(−(fc − µc

)2+(fh − µh

)2

2 · σ2

)(8.6)

In Equation (8.6) fc and fh respectively indicate the number of cooks and helpers,

µc and µh the optimal values according to Equation (7.5), while κ is a real positive

number. Here we take the standard deviation σ = 1.5. This function is used to

evaluate the coalition (n, 2 · n), where, according to Equation (7.5), it refers to

coalition (n, 2 · n) (respectively, number of cooks and number of helpers). The

evaluation of such function is given if and only if n− n = ±1 (n, n ∈ N), elsewhere

the Gaussian characteristic function returns 0.

116

8. Results 8.3 Cooking Teams Problem

For the coalition (4 · n, 0) we use the following Gaussian function:

N2(fc, µc) = κ · 1

σ ·√

2 · π· exp

(−(fc − µc)

2

2 · σ2

)(8.7)

where as before fc and µc respectively indicate the actual number and the optimal

number of cooks according to Equation (7.5), while κ is a real positive number and

the standard deviation σ is equal to 1.5. The evaluation of such function is given

if and only if n − n = ±2 (n, n ∈ N), elsewhere it returns 0.

Finally, for the coalition (0, n) we used the evaluation of the same coalition

described in Equation (7.5), that is n (obviously n ∈ N).

8.3.1 Nonempty State Space

In this subsection we will analyze the results obtained with the Cooking Teams

Problem using the state space described in Section 7.3.3. These results refer to the

four bar configurations proposed in Section 7.3.1. We used different parameters

to configure the environment and agents in order to see whether we can reach an

optimal solution (or a near one).

In the following, we show and analyze the results of different experiments. The

ǫ-greedy exploration policy uses a parameter ǫ that is the probability of an agent to

explore the environment rather than following its policy. This probability decreases

over weeks w with with the following law:

ǫw =ǫ0w

1 + w · drǫ

, (8.8)

where ǫw is the current exploration ratio at week w, ǫ0w is the initial exploration

ratio and drǫ is the exploration decreasing rate (it is greater than 0).

Standard Configuration

In this test, all agents use the well known Q-learning algorithm in order to find

an optimal policy. The learning rate α is equal to 0.5 and each agent use an ǫ-

greedy exploration policy with ǫ = 0.1 decreasing over time. Here we compare the

performance obtained in Bar-1 and in Bar-4 first using the characteristic function

of Equation (7.5), then Equations (8.6) and (8.7) in order to see whether we can

117


boost the results achieved by agents. Each graph is obtained as average over 10

different runs of 150,000 weeks, and on the final average we calculate a mobile

mean.

0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

0 20000 40000 60000 80000 100000 120000 140000 160000

Wor

ld U

tility

Weeks

WLUSUTGUD

(a) Bar-1

0

2

4

6

8

10

12

14

16

18

20

22

24

26

0 20000 40000 60000 80000 100000 120000 140000 160000

Wor

ld U

tility

Weeks

WLUSUTGUD

(b) Bar-4

Figure 8.20: Bar-1 and Bar-4 with α = 0.5, ǫ = 0.1; the standard characteristic function

of Equation (7.5) is used to compute both the world utility and the quality of a coalition of

agents attending the bar

In Figure 8.20 we can clearly see the influence of the state space on the per-

formance of the problem. Even if in the first weeks we have a remarkable growth

rate, the world utility adjusts on low values with respect to the maximum ones

described in Table 7.1. Furthermore here we can note the concept of problem dif-

118


ficulty depicted in Section 7.3.3. In Figure 8.20(b) the agents using the UD utility

function tend to lower the world utility, while in Figure 8.20(a) they show a slight

increasing rate (Bar-4 is more difficult than Bar-1, since we have many cooks with

respect to helpers, so the latter are shared resources sought after the cooks). In-

stead, the agents using the SU functions tend to increase the world utility value,

but it remains clearly lower than the optimal one.

In order to asses the quality of Equation (7.5) to assign rewards to agents, we

use that function to evaluate the world utility, while Equations (8.6) and (8.7)

compute the reward to be assigned to each agent.

In Figure 8.21 we still perceive the lower difficulty of Bar-1 with respect to Bar-

4. If in Figure 8.21(a) the agents using the SU function obtain a lower world utility

with respect to the ones of Figure 8.20(a), in Figure 8.21(b) with all the different

reward functions we achieve better results with respect to 8.20(b). This fact is due

to the smoothness of the Gaussian functions used by the reward function, since

they create attraction fields around the optimal points in the state space. Instead,

with Equation (7.5) these points are negatively evaluated or, at worst, they result

to a zero reward. As a consequence, all agents tend to stay away because they

achieve negative rewards and they prefer to visit those states that provide zero

reward.

Looking at the Gaussian characteristic function of Equation (8.7) we can see

that if we have more than two cooks attending the bar, they gain a useful reward

(either positive or negative) leading them towards optimal states. For example, let

us suppose to have the coalition S = {3, 0}. In this case the characteristic function

of Equation (7.5) returns 0 to each cook. At the opposite, with the Gaussian

characteristic function of Equation (8.7) each cook gains 2.01380 (supposing they

are using the SU function). In this situation each cook is led to visit that state

and hopefully the near optimal one (that is S = {4, 0}).

It is interesting to check which states of the Q-table of each agent are most

often visited in Bar-4 configuration in order to understand which coalitions will

be formed during the learning phase. As stated above, we expect that most cooks

form homogeneous coalitions composed by only cooks, while helpers are shared

119


0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

0 20000 40000 60000 80000 100000 120000 140000 160000

Wor

ld U

tility

Weeks

WLUSUTGUD

(a) Bar-1

0

2

4

6

8

10

12

14

16

18

20

22

24

26

0 20000 40000 60000 80000 100000 120000 140000 160000

Wor

ld U

tility

Weeks

WLUSUTGUD

(b) Bar-4

Figure 8.21: Bar-1 and Bar-4 with α = 0.5, ǫ = 0.1; the characteristic function of Equation

(7.5) is used to compute the world utility, while the characteristic functions of Equations

(8.6) and (8.7) are used to evaluate the quality of a coalition of agents attending the bar

resources in this configuration, thus the optimal choice for a cook is to form a

coalition with 2 helpers.

Figure 8.22 depicts the different Q-table visits for each agent’s type. In Figure

8.22(a) the most visited states correspond to the coalition formed by 3 to 7 cooks

and 0 helpers. We have experimentally verified that most cooks tend to form

coalition composed by only cooks in order to come near the optimal coalition

(fc, 0) (where fc is a multiple of 4).

120


Only in a little number of cases we have some cooks forming the optimal coali-

tion (fc, fh = 2 · fc) (fc cooks and fh helpers). Figure 8.22(b) shows this situation,

in fact the most visited state is the state number 44, that corresponds to 0 cooks

and 2 helpers (thus cooking 2 cookies). The state number 45 (which corresponds

to 1 cooks and 2 helpers) is visited a small amount of times with respect to the

44 one, because each cook tends to form coalition of only cooks because these are

states more numerous than the states associated to the coalition (fc, fh = 2 · fc).

Since each cook tends to form such coalitions and each pair of helpers waits for a

cook (that is unavailable), each helper tends to form coalition composed by only

helpers.

The same considerations still hold for the situation described in Figure 8.22(c);

the most visited state is the state number 66 that corresponds to 0 cooks and 3

helpers. As before, the state number 67 (1 cook and 3 helpers) is visited a small

amount of times with respect to the 66 because there are not any available cooks6.

Exploring the Environment

In the previous subsection we have discussed about the advantages of a Gaussian

characteristic function with respect to a “strict” characteristic function used to

assign rewards in this problem. Now let us suppose to increase the exploration

ratio in order to evaluate whether it results in better performance. This increase is

due to the fact that here we are dealing with a bounded state space. The quality

of each single state depends on the number of cooks and helpers, thus if an agent

has an appropriate exploration strategy then all agents tend to visit the same

non-optimal states.

The simplest method used to induce exploration is the ǫ-greedy exploration,

where we have a (decreasing) probability ǫ used to explore the environment rather

than exploiting it. Another method to induce the exploration is to set the initial

Q-values of each agent to a high value (they are usually set equal to rmax

1−γ, where

6Recall that using the Gaussian functions of Equations (8.6) and (8.7) to compute the rewards,

this coalition is negatively evaluated for each helper and positively for each cook. Anyway, the

world utility is computed using Equation (7.5), thus that coalition does not improve it.

121


0

1000

2000

3000

4000

5000

6000

7000

0 1

2 3

4 5

6

0 1

2 3

4 5

6 7

8 9

10

0 1000 2000 3000 4000 5000 6000 7000

Visits

Actions

States

Visits

(a) Cook; states from 0 to 10

0

2000

4000

6000

8000

10000

12000

14000

0 1

2 3

4 5

6

43

44

45

46

47

0 2000 4000 6000 8000

10000 12000 14000

Visits

Actions

States

Visits

(b) Helper; states from 43 to 47

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

0 1

2 3

4 5

6

65

66

67

68

69

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Visits

Actions

States

Visits

(c) Helper; states from 65 to 69

Figure 8.22: Q-table visits for cooks and helpers in Bar-4 with α = 0.5, ǫ = 0.1; the

characteristic function of Equation (7.5) is used to compute the world utility, while the

characteristic functions of Equations (8.6) and (8.7) are used to evaluate the quality of a

coalition of agents attending the bar

rmax is the maximum reward obtainable7 and γ is the discount factor of the future

expected reward used in the Q-Learning update formula8). It is easy to understand

that the former method provides a random exploration policy; in fact, during the

exploration an agent randomly chooses an action to be executed, while the latter

enables a more uniform exploration policy.

In Figure 8.23 we can see that also in this case the Gaussian characteristic

functions of Equations (8.6) and (8.7) gives rewards more smoothed than the char-

acteristic function (7.5). With the former functions the world utility shows an

increasing behavior after the first 5,000 weeks, while with the latter the world

7Here we impose rmax = 10.8Equation (2.3).

122


0

2

4

6

8

10

12

14

16

18

0 20000 40000 60000 80000 100000 120000 140000 160000

Wor

ld U

tility

Weeks

WLUSUTGUD

(a) Equation (7.5)

0

2

4

6

8

10

12

14

16

18

20

22

24

26

0 20000 40000 60000 80000 100000 120000 140000 160000

Wor

ld U

tility

Weeks

WLUSUTGUD

(b) Equations (8.6) and (8.7) are used as characteristic function,

while Equation (7.5) is used to evaluate the world utility of Equa-

tion (7.6)

Figure 8.23: Bar-4 with α = 0.5, ǫ = 0.3 over 150,000 weeks (here we used the characteristic

functions of Equations (7.5), (8.6) and (8.7)). These graphs are an average mean over 10

different runs

utility tends to reach a constant convergence value after few weeks.

It is interesting to note how the exploration induced by increasing the initial

value of the ǫ parameter results in better performance. The world utility depicted

in Figure 8.23(b) tends to have the same convergence speed with respect to Figure

8.21(b), but the former, after week 20,000, shows an increasing behavior obtained

by all the four utility functions until week 100,000, while the latter, after week

123


20,000, reaches a constant value at convergence. Thus, the exploration strategy

seems to lead agents towards better performance.

The same considerations still hold if we use the characteristic function described

by Equation (7.5): the world utility obtained by the agents using the SU function

depicted in Figure 8.23(a) has a slightly lower convergence speed with respect

to the world utility value achieved by the agents using the SU function depicted

in Figure 8.20(a). The agents using the former continue to improve their policy

already after week 60,000, while agents using the latter reach a convergence value

lower than the one obtained in Figure 8.23(a).

The previous experiment suggests that the exploration strategy seems to give

slightly better results. Figure 8.24 depicts the performance of Bar-4 obtained by

the agents using the SU function and the characteristic function of Equations (8.6)

and (8.7). The different exploration values give a boost around week 80,000, where

we see that the agents with greater exploration factor achieve increasing world

utility values. Agents’ policy is more tuned than the one of those using other

exploration rates. This policy goodness is mainly due to the higher exploration

rates, in fact in the earliest weeks these agents explore the environment in such a

way to visit near optimal states, thus its policy will be positively affected in future

weeks.

From this experiment, we can infer another important consideration about the

state space. Looking at Figure 8.24 we can see that the agents using the SU

function with ǫ = 0.9 obtain better performance (albeit in a greater number of

weeks) than others (obviously they use the SU function). This exploration value

means that these agents act in a semi-random fashion, in fact in first weeks they

randomly choose an action with probability equal to 0.9. This particular behavior

means that the state space is heavily bounded, thus agents seem to achieve better

payoffs acting in a semi-random fashion (almost in the first weeks).

We have experimentally verified that keeping the exploration rate fixed does

not improve the performance. If all agents use a fixed exploration rate they reach a

convergence point less than that reached if they used a decreasing exploration rate.

This particular behavior is mainly due to the fact that they continue to explore the

124


0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

32

34

36

38

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Wor

ld U

tility

Weeks

ε = 1.0ε = 0.9ε = 0.7ε = 0.5ε = 0.3ε = 0.1

Figure 8.24: Comparison between the performance of Bar-4 (SU function) obtained using

different values of ǫ (0.1, 0.3, 0.5, 0.7, 0.9, 1.0), with Equation (7.5) used for the world

utility and Equations (8.6) and (8.7) used for the characteristic function. Each experiment

is a mean of 5 different runs and we plot one world utility value every 100 values (that is this

experiment was executed over 500,000 weeks)

environment rather than exploiting it. This further exploration is useless for the

agents, in fact it does not permit to the agents to improve their policy because it

induces a random action selection with probability ǫ. This random action selection

is usefull in first weeks, but it becomes more and more useless during the time.

Up to now we have analyzed and discussed the results obtained using high ǫ

values, but it is interesting to compare these results with those obtained using

high initial Q-table values (keeping ǫ = 0.1 as in the standard configuration of the

experiments described here).

Figure 8.25 depicts the result obtained with Bar-4 (since it is the more difficult

to be solved). Even if the ǫ-greedy exploration policy is a random exploration

policy, it seems that the agents using this policy (with the SU function) obtain

better world utility values (and remarkable convergence time) than those using

high initial Q-values. These values cause agents to equally choose among different

actions. In first weeks these actions give to agents poor rewards and the Q-learning

125


0

2

4

6

8

10

12

14

16

18

20

22

24

26

0 20000 40000 60000 80000 100000 120000 140000 160000

Wor

ld U

tility

Weeks

ε = 0.3High q-values

Figure 8.25: Comparison between the performance of Bar-4 (SU function) obtained using

ǫ = 0.3 and high q-values, with Equation (7.5) used for the world utility and Equations (8.6)

and (8.7) used for the characteristic function

algorithm will decrease the goodness of that state-action pair following Equation

(2.3). The problem is that this value update will decrease that value, but not as

fast as desired in order to consider better (that is not the highest) Q-values. This

behavior is followed by all agents, thus there is a high probability to have a joint

action that is interpreted as a poor coalition structure CS. Rather, using different

values of the ǫ parameter the agents will already choose the action which foresee a

high expected reward in first weeks.

Exploiting the Environment

In the previous subsection we have discussed about the exploration strategy and

we have seen in particular that increasing the exploration ratio (that is the ǫ value,

since we are using an ǫ-greedy policy) the agents using the SU function can reach a

slightly higher world utility values. In order to better exploit the environment, each

agent must be configured with an opportune value α used in the Q-learning update

formula (Equation (2.3)): greater values of α lead agents to discard their past

126


0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

0 20000 40000 60000 80000 100000 120000 140000

Wor

ld U

tility

Weeks

WLUSUTGUD

(a) Bar-1

0

2

4

6

8

10

12

14

16

18

20

22

24

26

0 20000 40000 60000 80000 100000 120000 140000

Wor

ld U

tility

Weeks

WLUSUTGUD

(b) Bar-4

Figure 8.26: Bar-1 and Bar-4 with αS = 0.5, αNS = 0.1, ǫ = 0.1 over 150,000 weeks (here

we used the characteristic functions of Equations (7.5), (8.6) and (8.7)). These graphs are

an average mean over 10 different runs. Each agent runs the CoLF algorithm

experience, lower values lead agents to consider their past experience and discard

the expected future reward. The CoLF algorithm [1] proposes to use different

learning rates in order to interact with rewards obtained by the environment. A

non-stationary learning rate is used when an agent receives an unexpected payoff,

otherwise it uses a stationary learning rate (that is greater than the non-stationary

one).

Figure 8.26 depicts the result obtained in Bar-1 and Bar-4 configurations, where

127


all agents use the CoLF algorithm. Comparing these results with those described

in Figure 8.21, it is straightforward to understand that not even CoLF is able to

improve the world utility convergence value of the agents. On the other hand, the

convergence time of agents using the WLU and TG utility functions seems to be

improved in Bar-4 configuration (Figure 8.26(b)), but in this case they reach about

the same convergence value depicted in Figure 8.21(b).

These results confirm what said above, that is we are dealing with a heavy

bounded state space. Each agent must deal with that state space and with its

action set. It chooses an action according to the present state, hence if it is in

an awful state it chooses actions not improving that state quality. Therefore, an

agent must deal with 2 orthogonal components (the state space and the action

space), and the state space is the factor degrading the overall system performance.

As seen above in Figure 8.24, an easy but expensive way to partially avoid this

inconvenient is to increase the exploration ratio (in the first weeks all agents choose

their action in a semi-random fashion), but here we need more and more weeks in

order to let agents to learn a suboptimal policy.

8.3.2 Empty State Space

In this subsection, here we analyze the results obtained with the Cooking Teams

Problem using an empty state space. These results refer to the four bar configu-

rations proposed in Section 7.3.1. We used different parameters to configure the

environment and the agents in order to see whether we can reach an optimal solu-

tion (or a near one). In all these experiments all agents use an ǫ-greedy exploration

policy, where the initial value of ǫ is equal to 0.1 and it decreases over time following

the update formula of Equation (8.8).

Standard Configuration

In this test all agents use the well known Q-learning algorithm in order to find

an optimal policy. The learning rate α is equal to 0.5. Here we compare the

performance obtained in all the four bar configurations presented in Section 7.3,

first using the characteristic function of Equation (7.5), then Equations (8.6) and

128


(8.7) in order to see how these characteristic functions model this problem. Each

graph is obtained as average over 10 different runs of 100,000 weeks, and on the

final average we calculate a mobile mean.

0

4

8

12

16

20

24

28

32

36

40

44

48

52

56

60

64

0 20000 40000 60000 80000 100000

Wor

ld U

tility

Weeks

WLUSUTGUD

(a) Bar-1 standard

0

4

8

12

16

20

24

28

32

36

40

44

48

52

56

60

64

0 20000 40000 60000 80000 100000

Wor

ld U

tility

Weeks

WLUSUTGUD

(b) Bar-1 Gaussian

0

4

8

12

16

20

24

28

32

36

40

44

48

52

56

60

64

68

72

76

80

84

0 20000 40000 60000 80000 100000

Wor

ld U

tility

Weeks

WLUSUTGUD

(c) Bar-2 standard

0

4

8

12

16

20

24

28

32

36

40

44

48

52

56

60

64

68

72

76

80

84

0 20000 40000 60000 80000 100000

Wor

ld U

tility

Weeks

WLUSUTGUD

(d) Bar-2 Gaussian

Figure 8.27: Bar-1 and Bar-2 with α = 0.5 and ǫ = 0.1 over 100,000 weeks (here we used

the characteristic functions of Equations (7.5), (8.6) and (8.7)). These graphs are an average

mean over 10 different runs

In Figures 8.27 and 8.28 we can clearly see how agents behave in this envi-

ronment configuration. In order to see how the characteristic functions induce

different behavior among agents, we show the different results obtained using the

characteristic function of Equation (7.5) to compute both rewards and the world

utility (Figures 8.27(a), 8.27(c), 8.28(a) and 8.28(c)). Furthermore we used the

same approach presented in Section 8.3.1, so we used the characteristic functions

of Equations (8.6) and 8.7 to compute rewards, while Equation (7.5) is used to

compute the world utility.

129


0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

0 20000 40000 60000 80000 100000

Wor

ld U

tility

Weeks

WLUSUTGUD

(a) Bar-3 standard

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

0 20000 40000 60000 80000 100000

Wor

ld U

tility

Weeks

WLUSUTGUD

(b) Bar-3 Gaussian

0

4

8

12

16

20

24

28

32

36

40

44

48

52

56

60

64

68

0 20000 40000 60000 80000 100000

Wor

ld U

tility

Weeks

WLUSUTGUD

(c) Bar-4 standard

0

4

8

12

16

20

24

28

32

36

40

44

48

52

56

60

64

68

0 20000 40000 60000 80000 100000

Wor

ld U

tility

Weeks

WLUSUTGUD

(d) Bar-4 Gaussian

Figure 8.28: Bar-3 and Bar-4 with α = 0.5 and ǫ = 0.1 over 100,000 weeks (here we used

the characteristic functions of Equations (7.5), (8.6) and (8.7)). These graphs are an average

mean over 10 different runs

Looking at agents’ behavior depicted in Figures 8.27(a), 8.27(c), 8.28(a) and

8.28(c) we note it always outperforms the behavior represented in Figures 8.27(b),

8.27(d), 8.28(b) and 8.28(d). In this problem configuration, the agents coordinate

themselves, so they reach an equilibrium of the game (that is an optimal coalition

structure CS∗). This coordination is mainly due to the absence of the state space.

In the previous experiments where we considered the state space, all agents never

reach the optimal coalition structure, in fact they must deal with both the state

space and the action space. The former is heavily bounded because the goodness

of a specific state is a function of the number of cooks and helpers visiting such

state. As a consequence, all agents choose actions based upon the goodness of

that state, thus they have not incentives to explore the environment (they might

130


be in local maximum of the world utility function). Instead, in this configuration,

without any state state space the goodness of a specific day for an agent is only

based upon the action it has chosen, hence each agent focuses only on its action

space.

Looking at the optimal values reported in Table 7.1, we see that in Bar-1,

Bar-2 and Bar-4 (Figures 8.27(a), 8.27(c) and 8.28(c)) agents reach an admissible

optimal coalition structure CS∗ (or a near one). Instead, in Bar-3 (Figure 8.28(a))

all agents do not form an optimal coalition structure: in particular, the agents

using the TG and WLU functions have clearly low performance than those using

the SU and UD ones. In this configuration they probably need more exploration

and/or time in order to exploit the optimal coalition structure.

In the previous case all agents use the characteristic function of Equation (7.5)

both to compute rewards and to evaluate the world utility value. If we use the

characteristic functions of Equations (8.6) and (8.7) to compute rewards, we can

clearly see how all agents do not form an optimal coalition structure (Figures

8.27(b), 8.27(d), 8.28(b) and 8.28(d)). This particular behavior is due to the

fact that the characteristic functions used to compute rewards rate as good, for

example, a coalition S = {7, 0}, thus the agents tend to choose that coalition.

On the other hand, the world utility computed according to Equation (7.5) rates

with 0 that coalition, hence the agents’ behavior will not be fully aligned with that

characteristic function.

131

Chapter 9Conclusions and Future Works

The important thing is not to stop questioning.


Contents

9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

9.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

133

9.1 Conclusions 9. Conclusions and Future Works

9.1 Conclusions

In this thesis we have proposed a new methodology to study interactions among

different types of agents acting in the same environment. In particular, we have

focused on cooperative interactions aimed to get a goal done by all agents through

learning using different RL techniques. Agents acting in a generic environment are

completely independent, that is they do not know intentions (policies) of others

and we do not allow any kind of communication among them.

At first, we have studied a simplified problem, thus we have many agents of

the same type behaving in different environments, both stationary and not. Some

examples are the Bar Problem (Section 5.3, [19]) and the Gridworld (Section 5.2,

[18]).

In a single agent environment people focus on the policy learning algorithm,

that is which is the best way to use a reward assigned by that environment in

order to learn an optimal policy to let the agent to reach a goal. In a multiagent

case we have further difficulties related to the presence of many agents, thus to

interactions among them inducing more non-stationariness. These interactions

may be formalized as cooperative or competitive. In this thesis we have focused on

cooperative interactions always adopting the constraints described above (policy

unawareness, no communication). It is easy to understand that this case is more

difficult than the single agent one. If in the single agent case we focus on how

best learn a policy (thus on the algorithm behavior), now in the multiagent one we

have to focus also on the reward assignment, so that it can model a cooperative

or a competitive behavior. If we use the same methods of the single agent case (in

the same environment too), obviously we will induce a greedy behavior to agents

(selfish agent).

In this case COIN ([23], [6]) is useful to induce a cooperative behavior in a

multiagent system rather than a competitive one. Unfortunately, studying COIN

we found some gaps related to both the problem approach and to its theoretical

grounds. As stated by ’t Hoen ([18]), the WLU function used to compute the

reward to be assigned to agents is symmetric, thus we can run into slow learning

speed. Furthermore, we have discovered some other gaps about the problem ap-

134

9. Conclusions and Future Works 9.1 Conclusions

proach, in particular related to the reward function of the Bar Problem which is

not enough selective (see Section 6.3). Finally, we used the Q-learning dynamics

approach ([20]) in order to understand how an environment evolves during the

learning phase.

At this point, we have introduced new constraints on the previous approach. In

the real world there exists many situations where the presence of different players

induces to group themselves in coalitions in order to reach a goal. Further on,

coalition formation may be mandatory to reach a goal. This new working envi-

ronment causes new difficulties, because here we must consider two further facets

(as well as interactions among agents and non-stationariness due to the presence

of many agents, as stated above):

• interactions among coalitions (thus more non-stationariness and environment

bounds);

• distribute up a reward among all agents belonging to a given coalition.

In this framework ([16]) we have two main approaches used to distribute a re-

ward among agents (the core and the Shapley’s value, see Section 6.5.3) focusing

on different characteristics (respectively, coalition stability and payoff distribution

among agents).

Before this thesis, this framework lacked a formalization on a particular kind

of real environments. We refer to such environments where we have different types

of agents that, in order to reach a given goal, must unite themselves in suitable

coalitions.

In this thesis, we have introduced and formalized a new typology of game (task

allocation via coalition formation games) matching these characteristics. This new

game remarks the interesting aspects of dispersion games (like the Bar Problem)

and coalition formation games leaving out heavy computational (and useless) char-

acteristics like the Shapley’s value computation. In this formalization, we have de-

fined new methodologies used to distribute the reward among agents of a coalition

that, with those used to assign rewards to a coalition (TG, WLU, SU, UD), play

a fundamental part to achieve a learning goal. Furthermore, we have applied this

135

9.2 Future Works 9. Conclusions and Future Works

framework to the Cooking Teams Problem ([25]) in order to see whether agents

can reach an optimal coalition structure.

Another important feature to consider is the state space size. Already in the

multiagent case we may deal with a priori large state space. Furthermore, with

coalition formation games this state size may became even larger, because it may

depend on the number of agents that form different coalitions. By using a featured

state space, its size will be lower, but it is still heavily bounded. Each state reached

depends on the joint action, the same for its goodness. As a consequence, we do

not use any kind of state space (to be precise, the state space size is equal to 1) in

order to avoid these constraints. With this configuration we studied the Cooking

Teams Problems and we obtain encouraging results about these techniques.

9.2 Future Works

This thesis aims to give a more realistic problem approach in a multiagent system.

If in the literature coalition formation games are already studied, they do not focus

on systems where we have many different types of agents. An interesting facet is to

try this formalization with known environments used in dispersion games in order

to see its usefulness.

Another important problem to extend is the curse of state space size, that

is which is the best state space representation in this kind of problems. In our

testbed problem we experimentally verified that, using an empty state space, all

agents reach an optimal coalition structure. This might not be true a priori, since

we may have different problems where the state space plays a fundamental role to

find an optimal coalition structure. At this point it may be necessary to investigate

how to find a useful state space representation. We might apply an approach like

LEAP ([2]), meant as to find a valid mapping function from a state space to a

feature space. Otherwise, instead of using a state space elaboration like LEAP, it

is interesting to investigate the case where all agents in a coalition use a state space

representation (thus using a feature space) different from that of other coalitions,

and discarding all those useless states. This case is particularly interesting, since

136

9. Conclusions and Future Works 9.2 Future Works

we can have agents in a coalition discarding the states related to other coalitions.

An appealing facet is related to the definition of marginal contribution. It is

deeply related to the characteristic function used to model a problem. At this point

it is useful to test different characteristic functions in our testbed problem, thus

modeling different behaviors (i.e. a moderator). As a consequence, it is necessary

to study how the agent performance may change using marginal contribution.

While studying marginal contribution, we realized it gives a way to create coali-

tions, but it does not formalize a method to distribute a reward among different

agents belonging to a coalition. Marginal contribution is the core of the Shap-

ley’s value, where the latter is used to distribute a reward among agents. Since

the Shapley’s value is computationally heavy, it is interesting to find an associ-

ation between marginal contribution and the Shapley’s value in order to rebuild

the Shapley’s value given all the agent experience (thus exploiting all marginal

contribution values obtained by that agent). At this point we may obtain an ap-

proximated Shapley’s value as well as an expected future Shapley’s value, so it can

be used to distribute the reward obtained among agents of a coalition.

137

Bibliography

[1] A. Bonarini, A. Lazaric, E. Munoz de Cote, and M. Restelli. Improving Co-

operation among Self-Interested Reinforcement Learning Agents. 2005.

[2] A. Bonarini, A. Lazaric, E. Munoz de Cote, and M. Restelli. LEAP: an

Adaptive Multi-Resolution Reinforcement Learning Algorithm. Journal of

Machine Learning Research 1, 2006. To appear.

[3] M. Bowling and M. Veloso. An Analysis of Stochastic Game Theory for Mul-

tiagent Reinforcement Learning. 2000.

[4] M. Bowling and M. Veloso. Multiagent Learning Using a Variable Learning

Rate. In Artificial Intelligence, volume 136(2), pages 215–250, January 2002.

[5] C. Claus and C. Boutilier. The Dynamics of Reinforcement Learning in Co-

operative Multiagent Systems. In American Associations for Artificial Intel-

ligence, pages 746–752, 1998.

[6] O. Etzioni, J. P. Muller, and J. M. Bradshaw, editors. General Principles of

Learning-Based Multi-Agent Systems, New York, May 1999. Proocedings of

the Third Annual Conference on Autonomous Agents, ACM Press.

[7] S. Hoberg. Reinforcement Learning for Autonomous Agents in a simulated

Multi-Player Game. June 2004.

[8] J. Hu and M. P. Wellman. Nash Q-learning for General-Sum Stochastic

Games. Journal of Machine Learning Research 4, pages 1039–1069, November

2003.

139

BIBLIOGRAPHY BIBLIOGRAPHY

[9] International Conference on Machine Learning. Correlated-Q learning, Wash-

ington DC, 2003.

[10] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement Learning:

A Survey. In Journal of Artificial Intelligence Research, chapter 4, pages

237–285. May 1996.

[11] M. Lauer and M. Riedmiller. An Algorithm for Distributed Reinforcement

Learning in Cooperative Multi-Agent Systems.

[12] J. Laumonier and B. Chaib-draa. Multiagent Q-learning: Preliminary Study

on Dominance between the Nash and Stackelberg Equilibriums. July 2005.

[13] T. Mitchell. Machine Learning. McGraw-Hill, 1997.

[14] E. Munoz de Cote. Learning to Form Coalitions. April 2006.

[15] M. Restelli. A Multi-Agent System for Multi-Agent Learning. PhD thesis,

Politecnico di Milano.

[16] T. Sandholm, K. Larson, M. Andersson, O. Shehory, and F. Tohm’e. Coalition

Structure Generation with Worst Case Guarantees. In E. S. B.V., editor,

Artificial Intelligence 111, number 111, pages 209–238. 1999.

[17] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT

Press, 1998.

[18] P. J. ’t Hoen and S. M. Bohte. Collective Intelligence with Sequences of

Actions - Coordinating Actions in Multi-Agent Systems. In Lecture Notes in

Artificial Intelligence, volume LNAI 2837, 2003.

[19] P. J. ’t Hoen and S. M. Bohte. COllective INtelligence with Task Assignment.

Report SEN-E0315, Stichting Centrum voor Wiskunde en Informatica, P.O.

Box 94979, 1090 GB Amsterdam (NL) Kruislaan 413, 10980 SJ Amsterdam

(NL), December 2003.

[20] P. J. ’t Hoen and K. Tuyls. Analyzing Multi-Agent Reinforcement Learning

using Evolutionary Dynamics.

140

BIBLIOGRAPHY BIBLIOGRAPHY

[21] K. Tuyls, K. Verbeeck, and T. Lenaerts. A Selection-Mutation Model for

Q-learning in Multi-Agent Systems. ACM, July 2003.

[22] D. H. Wolpert and K. Tumer. Using Collective Intelligence to Route Internet

Traffic. 1999.

[23] D. H. Wolpert and K. Tumer. An Introduction to Collective Intelligence.

Technical Report 99-63, NASA-ARC-IC, June 2005.

[24] D. H. Wolpert, K. Tumer, and A. Agogino. Learning Sequences of Actions in

Collectives of Autonomous Agents. July 2002.

[25] M. Wooders. The Tiebout Hypothesis: Near Optimality in Local Public Good

Economies. In Econometrica, pages 1467–1486. 1980.

141

Date post:	06-Jul-2021
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Distributed Algorithms for Learning Balanced Partitions in Heterogeneous … · 2006. 8. 28. ·...

Documents