POLITECNICO DI MILANO
Corso di Laurea in Ingegneria Informatica
Dipartimento di Elettronica e Informazione
Distributed Algorithms for Learning
Balanced Partitions in
Heterogeneous Multiagent Systems
AI & R Lab
Artificial Intelligence
and Robotics Laboratory of the Politecnico di Milano
Advisor:
Prof. Andrea Bonarini
Co-Advisors:
Eng. Marcello Restelli
Eng. Enrique Munoz de Cote
Thesis Dissertation of:
Maurizio Lattuada, matricola 666971
Academic Year 2005-2006
POLITECNICO DI MILANO
Corso di Laurea in Ingegneria Informatica
Dipartimento di Elettronica e Informazione
Algoritmi Distribuiti per
Apprendere Partizioni Bilanciate in
un Sistema a piu Agenti Eterogenei
AI & R Lab
Laboratorio di Intelligenza Artificiale e Robotica
del Politecnico di Milano
Relatore:
Prof. Andrea Bonarini
Correlatori:
Ing. Marcello Restelli
Ing. Enrique Munoz de Cote
Tesi di Laurea di:
Maurizio Lattuada, matricola 666971
Anno Accademico 2005-2006
A mamma, per tutto quel che mi ha dato e insegnato.
A papa, perche sarebbe stato molto orgoglioso di me e fiero di vedere “ul me fioeu
diventa ingegne”.
Summary
Nowadays, Reinforcement Learning techniques (RL for short) are widely applied
in many real environments as a plain trial-and-error framework. This peculiar
characteristic makes RL techniques very tricky to apply without having any dy-
namic environment model. RL has been also applied in multiagent systems. As a
consequence, we can formalize different behaviors of players as well as interactions
among them.
The aim of this thesis is to study how and why different types of agents form
coalitions (thus cooperate) in order to satisfy a goal that has certain characteristics.
In this dissertation we have formalized a new type of games in order to study such
social behaviors. We have studied how to formalize an environment when we have
such agents behaving in order to reach an optimal configuration. We have obtained
encouraging results with this framework. We have also demonstrated the existence
of a social cooperation among coalitions of different types of agents and that this
kind of cooperation is needed to reach a goal.
I
Sommario
In questa tesi abbiamo studiato le tecniche di apprendimento per rinforzo (Reinfor-
cement Learning, RL) in ambito multiagente. RL punta a imitare gli esseri viventi,
in modo particolare per quel che riguarda come essi apprendono delle abilita in un
ambiente a loro sconosciuto. Tramite l’apprendimento, un organismo impara a
essere autonomo e a interagire in maniera ottimale con l’ambiente in cui opera
(si veda [17]). Questo campo dell’intelligenza artificiale (AI) modella questa atti-
vita di apprendimento attraverso un agente che opera in un ambiente, e da questa
interazione, formalizzata con un segnale di rinforzo, affina la propria politica di
azione.
Prima si e citato il termine “agente”: ma cos’e un agente? Possiamo modelliz-
zare un agente come una entita che percepisce l’ambiente in cui opera attraverso
dei sensori e agisce in tale ambiente attraverso degli attuatori. Di conseguenza,
possiamo identificare come agente un robot che si muove in una stanza oppure un
programma che bilancia opportunamente il carico in una rete di calcolatori.
Un’altra fondamentale caratteristica degli agenti che deve essere presente in
RL riguarda la razionalita delle proprie azioni. Un agente e detto razionale se le
proprie azioni possono permettere a tale agente di raggiungere un obiettivo, che
solitamente e formalizzato come l’estremizzazione di una funzione di utilita.
RL e stato ampiamente studiato nel caso singolo agente e al giorno d’oggi puo
essere considerata una disciplina matura. La fase di apprendimento e completa-
mente autonoma e non e soggetta in alcun modo a una supervisione iniziale in cui
si istruirebbe l’agente con degli esempi noti. Invece, in questo caso si ha a che
III
fare con un puro paradigma “impara dagli errori” (trial-and-error) in cui l’agente
impara come comportarsi basandosi sulla propria esperienza passata.
Una chiara e immediata estensione di questo ambiente vede la presenza di
piu agenti che operano nello stesso ambiente (sistemi a molti agenti, RL-MAS).
Le conoscenze maturate nel caso a singolo agente sono state impiegate anche in
questa estensione con delle opportune modifiche. Ovviamente questa nuova forma-
lizzazione presenta maggiori difficolta che il caso a singolo agente, infatti bisogna
comprendere le interazioni tra gli agenti e l’ambiente e tra gli agenti stessi. Di con-
seguenza, si ha a che fare con un ambiente non stazionario, perche e influenzato
dalla politica di ogni agente.
Inoltre, le interazioni tra i diversi agenti possono essere modellizzate in modo
cooperativo o competitivo per soddisfare l’obiettivo da raggiungere (di solito e un
equilibrio di Nash, NE). In questa tesi ci focalizzeremo su comportamenti coope-
rativi, dove si ha un obiettivo che, per essere soddisfatto, necessita una forma di
cooperazione tra gli agenti. Questo comportamento cooperativo puo essere indotto
attraverso opportuni segnali di rinforzo immediato (reward) assegnati agli agenti.
A questo punto e naturale prevedere un’ulteriore estensione del caso a molti
agenti introducendo la possibiltia di formare coalizioni tra agenti. Chiaramente, in
questo caso si dovra tener conto anche delle interazioni tra le coalizioni di agenti.
Il rinforzo ora assume un carattere globale, dato che e assegnato a ogni coalizione.
Si necessitano dunque dei meccanismi per suddividere tale rinforzo tra tutti gli
agenti della coalizione (i piu noti sono il core e lo Shapley’s value, si veda [16]);
con questi meccanismi si puo dare diversa importanza e/o priorita a determinati
agenti.
Il contributo dato da questa tesi e duplice: dapprima sono stati analizzati diversi
metodi per coordinare un insieme di agenti individuando i pro e i contro in parti-
colari ambienti conosciuti in letteratura. In seguito, abbiamo proposto una nuova
tipologia di giochi (task allocation via coalition formation games) che riguardano i
problemi di allocazione delle risorse con la presenza di agenti eterogenei.
La coordinazione tra agenti analizzata si basa sull’approccio COIN (COllective
INtelligence, [23]) di Wolpert et al. e punta a indurre forme di cooperazione tra
IV
agenti senza bisogno di avere un modello della dinamica del mondo e senza che
questi possano comunicare tra loro, ma solo attraverso l’interazione con l’ambiente
(naturalmente attraverso i segnali di rinforzo immediato). Questa metodologia e
stata analizzata e verificata in problemi noti in letteratura che prevedono la coo-
perazione tra agenti per poter raggiungere un obiettivo (in particolare, il classico
mondo a griglia e il Bar Problem di Brian Arthur, [24] e [6]). Sono state eviden-
ziate le peculiaria di questo approccio e nel contempo sono state analizzate anche
le carenze in particolari ambienti di tipo non stazionario. Un grosso punto a fa-
vore di questo approccio prevede, oltre al fatto di non avere alcun modello della
dinamica, la disponibilita di una funzione di utilita globale (world utility) che e
impiegata per valutare il comportamento globale del mondo che emerge dai sin-
goli comportamenti degli agenti. Con questa funzione di utilita globale, vengono
opportunamente calcolati i segnali di rinforzo immediato da distribuire agli agenti
in modo tale da indurre una forma di cooperazione implicita. Analizzando questo
approccio sono state evidenziate delle lacune: alcune riguardano le prestazioni in
ambienti non stazionari (che sono comunque buone, ma non ottime come presenta-
to in letteratura), altre riguardano il modo con cui i segnali di rinforzo immediato
sono calcolati (questi possono essere simmetrici, dunque si potrebbero avere delle
velocita di convergenza non ottimali).
A questo punto si e deciso di complicare ulteriormente il problema introducendo
una diversificazione tra gli agenti, vale a dire la presenza di piu agenti, ma con
ruolo diverso. Queste diverse tipologie di agenti hanno un duplice scopo: devono
trovare una forma di coalizione e con questa devono cercare di raggiungere un
obiettivo prefissato. In letteratura, il campo che studia la creazione di coalizioni
(coalition formation) e stato studiato e analizzato nei suoi diversi aspetti ([16]),
ma quel che noi proponiamo e una ulteriore estensione che vede la presenza di
coalizioni nei giochi che si occupano di allocazione di risorse. E stata definita una
metodologia di assegnamento dei rinforzi che pone particolare riguardo alle richieste
computazionali per fare in modo che ogni agente sia in grado di discernere la
migliore struttura di coalizione da formare per poter poi raggiungere l’obiettivo; se
poi questo obiettivo non e stato raggiunto, gli agenti sono in grado autonomamente
V
(cioe senza avere alcuna possibilita di comunicazione, dunque senza una forma
di coordinamento esplicito) di cambiare la struttura della coalizione per tentare
nuovamente di raggiungere l’obiettivo prefissato.
I risultati ottenuti con questo approccio sono particolarmente incoraggianti,
soprattutto nel problema analizzato (si tratta di una estensione del Bar Problem,
[25]) in cui sono state prese in considerazione diverse configurazioni per poter
verificare l’estensione del problema da noi proposta.
Questo studio apre nuovi campi di ricerca futuri in diverse direzioni. E interes-
sante studiare l’influenza che ha lo spazio di stato su queste particolari tipologie di
problemi: a questo punto potrebbe tornare utile un approccio simile a LEAP ([2])
in cui si passa da uno spazio a uno spazio di caratteristiche piu compatto, oppure
una diversa caratterizzazione dello spazio di stato attuata da ciascun agente in
relazione alle coalizioni formate finora.
Un altro aspetto da studiare riguarda la definizione di contributo marginale da
noi proposta. Dato che questa funzione e strettamente legata alla definizione di
funzione caratteristica di una coalizione, si possono formalizzare diversi comporta-
menti con quest’ultima e dunque si rende necessario studiare le differenti prestazioni
ottenute. Inoltre, la funzione di contributo marginale si preoccupa principalmente
di assegnare un rinforzo alla coalizione di agenti, ma non di come questo sia poi
suddiviso tra essi. Il contributo marginale e il nocciolo del valore di Shapley. Sicco-
me quest’ultimo e computazionalmente pesante da calcolare, sarebbe interessante
trovare una relazione tra tale valore e la funzione di contributo marginale in modo
tale da ricostruire una approssimazione o un valore di Shapley atteso futuro in
modo da poter poi essere utilizzato per dividere effettivamente il rinforzo tra tutti
gli agenti di una coalizione.
VI
Ringraziamenti
Prima di tutto vorrei ringraziare il Relatore, Prof. Andrea Bonarini, per avermi
introdotto e motivato in questo lavoro e per la grande disponibilita accordatami.
Un gigantesco grazie va a Marcello, Enrique e Alessandro per tutto il tempo
che mi hanno dedicato, per le innumerevoli discussioni (e risate) e soprattutto per
la vicinanza dimostratami in particolari periodi “extra-tesi”. Senza il loro aiuto
e soprattutto senza la loro amicizia non sarei arrivato fino a questo traguardo.
Grazie di cuore.
Ringrazio anche gli amici che mi hanno accompagnato in questa avventura
chiamata “Poli” (Gabriele, Emanuele, Massimo, Bedda, il Guso, il Vince, Ciusipp,
. . .) e tutti quelli con cui ho diviso il tempo in AIRLab (Simone, Daniel e la sua
sangria, Mario, il Lazza, Matteo, . . .) scambiandoci consulenze tecniche su come
smontare il PRLT (per la gioia di Alessandro). Ovviamente non dimentico anche
altri amici: Homer, Centu, Mini, Lindi, Monfro, Passe, quelli de “LZD”, Lele,
Dino, . . .
Un grazie e piu dovuto anche per Jessica che ha avuto la pazienza e la forza di
sopportarmi in quest’anno di tesi, e anche per tante altre cose che non sto qui a
scrivere.
Infine un ringraziamento enorme va a mamma e papa per . . . per . . . per tutto!
A mamma, perche possa essere orgogliosa di me. A papa, perche ha fatto in tempo
a leggere un primissimo abbozzo di questa tesi e perche tanto desiderava esserci
. . . anche se ora sara in Qualche altro posto.
VII
Contents
Summary I
Sommario III
Ringraziamenti VII
List of Figures v
List of Tables ix
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 5
I State of the Art 7
2 Reinforcement Learning 9
2.1 Learning from Interaction . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 TD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Multi-Agent Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Change or Learn Fast . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Change & Keep . . . . . . . . . . . . . . . . . . . . . . . . . 13
i
2.2.3 Minimax Q-learning . . . . . . . . . . . . . . . . . . . . . . 14
2.2.4 Nash Q-learning and Friend or Foe Q-learning . . . . . . . . 15
3 COIN: COllective INtelligence 17
3.1 Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Artificial Intelligence and Machine Learning . . . . . . . . . 20
3.2.2 Social Science-Inspired Systems . . . . . . . . . . . . . . . . 22
4 A Framework Designed for COINs 25
4.1 Problems with a Model-Based Approach . . . . . . . . . . . . . . . 26
4.2 Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.1 Preliminary Definitions and Terminology . . . . . . . . . . . 27
4.2.2 Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.3 Learnability . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 A Descriptive Framework for COINs . . . . . . . . . . . . . . . . . 33
4.3.1 Candidate Salient Characteristics . . . . . . . . . . . . . . . 33
4.3.2 Factoredness . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.3 Wonderful Life Utility . . . . . . . . . . . . . . . . . . . . . 36
4.3.4 How to Induce these Salient Characteristics? . . . . . . . . . 39
5 Experimental Applications 41
5.1 Packet Routing in a Network . . . . . . . . . . . . . . . . . . . . . . 42
5.1.1 COIN for Network Routing . . . . . . . . . . . . . . . . . . 43
5.1.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Learning Sequences Of Actions . . . . . . . . . . . . . . . . . . . . 46
5.2.1 COIN Solution . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Bar Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
ii
II Innovation 53
6 Theoretical Considerations 55
6.1 Class of Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.1.1 Matrix Games . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.1.2 Stochastic Games . . . . . . . . . . . . . . . . . . . . . . . . 57
6.1.3 Differences between Grid world and Bar Problem . . . . . . 58
6.2 Delayed Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3 Reward Function of the Bar Problem . . . . . . . . . . . . . . . . . 60
6.4 Q-learning Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.5 Introduction to Coalition Formation . . . . . . . . . . . . . . . . . . 66
6.5.1 Coalition Structure Generation . . . . . . . . . . . . . . . . 67
6.5.2 Optimization within a Coalition . . . . . . . . . . . . . . . . 69
6.5.3 Payoff Division . . . . . . . . . . . . . . . . . . . . . . . . . 69
7 Task Allocation via Coalition Formation 73
7.1 Game Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.1.1 Curse of the State Space Size . . . . . . . . . . . . . . . . . 76
7.1.2 Fuzzy Games and Groups of Agents . . . . . . . . . . . . . . 78
7.2 Utility Functions of the Game . . . . . . . . . . . . . . . . . . . . . 79
7.2.1 Reward Distribution among Agents . . . . . . . . . . . . . . 79
7.2.2 Characteristic and Reward Functions . . . . . . . . . . . . . 81
7.3 Testbed Problem: Cooking Teams . . . . . . . . . . . . . . . . . . . 83
7.3.1 Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.3.2 Reward Functions . . . . . . . . . . . . . . . . . . . . . . . . 87
7.3.3 State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8 Results 91
8.1 Grid world . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.1.1 First Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.1.2 Second Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.2 Bar Problem and its Reward Functions . . . . . . . . . . . . . . . . 98
8.2.1 First Bar Configuration . . . . . . . . . . . . . . . . . . . . 99
iii
8.2.2 Second Bar Configuration . . . . . . . . . . . . . . . . . . . 105
8.2.3 Q-learning Dynamics of the Bar Problem . . . . . . . . . . . 109
8.3 Cooking Teams Problem . . . . . . . . . . . . . . . . . . . . . . . . 116
8.3.1 Nonempty State Space . . . . . . . . . . . . . . . . . . . . . 117
8.3.2 Empty State Space . . . . . . . . . . . . . . . . . . . . . . . 128
9 Conclusions and Future Works 133
9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
9.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Bibliography 139
iv
List of Figures
5.1 Network architectures (from [22]) . . . . . . . . . . . . . . . . . . . 43
5.2 Overall delay of the networks (from [22]) . . . . . . . . . . . . . . . 45
5.3 System performance with 10 agents on a 10×10 grid (from [24]) . . 49
5.4 System performance with 100 agents on a 32×32 grid (from [24]) . . 50
5.5 Average performance of the Bar Problem (from [6]) . . . . . . . . . 51
6.1 Exponential functions of the Bar Problem . . . . . . . . . . . . . . 60
6.2 WLU rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.1 Untypical grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.2 Results of the untypical grid . . . . . . . . . . . . . . . . . . . . . . 95
8.3 Grid proposed by ’t Hoen and Bohte (from [18]) . . . . . . . . . . . 96
8.4 Results of the grid proposed by ’t Hoen and Bohte . . . . . . . . . . 97
8.5 Results of the first bar configuration with the exponential reward
functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.6 Results of the first bar configuration with the Gaussian reward func-
tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.7 Mobile mean of the WLU functions relative entropy . . . . . . . . . 103
8.8 Mobile mean of the TG and UD utility functions relative entropy . 104
8.9 Attendance of the first bar configuration . . . . . . . . . . . . . . . 104
8.10 Results with the second bar configuration with the exponential re-
ward functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
v
8.11 Results of the second bar configuration with the Gaussian reward
functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.12 Mobile mean of the WLU functions relative entropy . . . . . . . . . 107
8.13 Mobile mean of the TG and UD utility functions relative entropy . 108
8.14 Attendance of the second bar configuration . . . . . . . . . . . . . . 109
8.15 Bar dynamics with 8 agents . . . . . . . . . . . . . . . . . . . . . . 111
8.16 τ for the bar dynamics with 8 agents . . . . . . . . . . . . . . . . . 112
8.17 Bar dynamics with 8 agents and τ = 0.14 . . . . . . . . . . . . . . . 113
8.18 Uniform policies obtained with Πu . . . . . . . . . . . . . . . . . . . 114
8.19 Policies of agents obtained with Πc . . . . . . . . . . . . . . . . . . 115
8.20 Bar-1 and Bar-4 with α = 0.5, ǫ = 0.1; the standard characteristic
function of Equation (7.5) is used to compute both the world utility
and the quality of a coalition of agents attending the bar . . . . . . 118
8.21 Bar-1 and Bar-4 with α = 0.5, ǫ = 0.1; the characteristic function of
Equation (7.5) is used to compute the world utility, while the char-
acteristic functions of Equations (8.6) and (8.7) are used to evaluate
the quality of a coalition of agents attending the bar . . . . . . . . . 120
8.22 Q-table visits for cooks and helpers in Bar-4 with α = 0.5, ǫ = 0.1;
the characteristic function of Equation (7.5) is used to compute the
world utility, while the characteristic functions of Equations (8.6)
and (8.7) are used to evaluate the quality of a coalition of agents
attending the bar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.23 Bar-4 with α = 0.5, ǫ = 0.3 over 150,000 weeks (here we used the
characteristic functions of Equations (7.5), (8.6) and (8.7)). These
graphs are an average mean over 10 different runs . . . . . . . . . . 123
8.24 Comparison between the performance of Bar-4 (SU function) ob-
tained using different values of ǫ (0.1, 0.3, 0.5, 0.7, 0.9, 1.0), with
Equation (7.5) used for the world utility and Equations (8.6) and
(8.7) used for the characteristic function. Each experiment is a mean
of 5 different runs and we plot one world utility value every 100 val-
ues (that is this experiment was executed over 500,000 weeks) . . . 125
vi
8.25 Comparison between the performance of Bar-4 (SU function) ob-
tained using ǫ = 0.3 and high q-values, with Equation (7.5) used for
the world utility and Equations (8.6) and (8.7) used for the charac-
teristic function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8.26 Bar-1 and Bar-4 with αS = 0.5, αNS = 0.1, ǫ = 0.1 over 150,000
weeks (here we used the characteristic functions of Equations (7.5),
(8.6) and (8.7)). These graphs are an average mean over 10 different
runs. Each agent runs the CoLF algorithm . . . . . . . . . . . . . . 127
8.27 Bar-1 and Bar-2 with α = 0.5 and ǫ = 0.1 over 100,000 weeks (here
we used the characteristic functions of Equations (7.5), (8.6) and
(8.7)). These graphs are an average mean over 10 different runs . . 129
8.28 Bar-3 and Bar-4 with α = 0.5 and ǫ = 0.1 over 100,000 weeks (here
we used the characteristic functions of Equations (7.5), (8.6) and
(8.7)). These graphs are an average mean over 10 different runs . . 130
vii
List of Tables
5.1 Source–destination pairings for the three traffic loads . . . . . . . . 45
7.1 Optimal coalition structures for the four bar problems . . . . . . . . 86
7.2 Another admissible coalition structure for the Bar-2 . . . . . . . . . 87
ix
Chapter 1Introduction
Joyful the sound, the world goes around
From father to son, to son. . .
“Father to Son” – Queen
Contents
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 5
1
1.1 Overview 1. Introduction
1.1 Overview
In this thesis we have studied Reinforcement Learning techniques (RL for short) in
a multiagent field. RL aims to mimic natural living beings, in particular the way
how they learn in an uncertain environment. An organism learns to be autonomous
and to interact in an optimal fashion with the environment where it behaves. This
field of Artificial Intelligence (AI for short) models the learning activity of an agent
acting in an environment, and from that interaction it hones its action policy.
First, what is an agent? There is not a unique definition of agent: we can
imagine an agent as an entity that can perceive the environment where it acts
through sensors, and it acts upon that environment through actuators. With this
general definition we can identify as agent a robot moving in a room or a program
that controls the load in a computer network.
Another important characteristic related to agents is their rationality. An agent
is said to be rational if its actions can achieve one of the agent’s goals. In order to
achieve that goal it usually maximizes a utility function.
We have cited the learning phase and we said that an agent learns how to
interact in a perceived environment. According to [13], a computer program is said
to learn from experience E with respect to some class of tasks T and performance
measure P, if its performance at task in T, as measured by P, improves with
experience E. So, what does an agent learn? In a RL problem, an agent learns a
policy, i.e., which actions it has to perform in each state in order to reach a goal
by the maximization of some measure of the long-term future expected payoff.
RL has been widely studied in a single agent case and nowadays it may be
considered mature. The learning phase is autonomous and it is not a result of a
supervised learning, where an agent has been thought about how to learn. Hence,
in this case we are dealing with a plain trial-and-error framework, where an agent
learns how to act given only its past experience, thus without any kind of external
directions. However, in the beginning its logical operations must be supervised
by a designer in the form of a careful optimization of its internal parameters.
Obviously, this framework lacks of grounding knowledge in order to compare this
artificial learning technique with the natural one of living beings (reuse of past
2
1. Introduction 1.1 Overview
knowledge in similar domains, making up complex actions derived from simple
ones, . . .).
A clear extension of this framework foresees the presence of more than one
agent acting in an environment. This new extension of RL is usually known as RL
MultiAgent System (RL-MAS for short) and it is relatively novel with respect to
the single agent case. This expansion is straightforward, since we can deal with
large and complex problems involving distributed reasoning knowledge and data
managing. It is easy to understand this framework extension is even more com-
plex, in fact we must understand that interactions among agents and environment
and agents themselves. As a consequence, agents deal with a non-stationary envi-
ronment, because it is influenced by each agent’s policy, thus the learning phase
becomes more and more articulated.
In addition to this difficult problem, agents’ behavior may be formalized in
a cooperative fashion or in a selfish one, and obviously that behavior must be
compatible with agents’ main goal (that is usually a Nash Equilibrium). In this
thesis we mainly focus on cooperative behavior: we have a goal that, in order to be
fully satisfied, must be reached in a cooperative fashion by all agents. If we allow
any kind of selfish behavior, we may get into two main situations: Tragedy of the
Commons (TOC for short) or liquidity trap ([23]). The former is concerned with
the avarice of each individual that works to lower world utility, thus how the overall
emerging agents’ behavior is rated. The latter happens if a behavior of a subset of
agents, if adapted by all agents, results in lower values of the world utility.
Even in this case, RL approach is valid, since it focuses on a functional treat-
ment of goal oriented problems. The goal is formalized as a reward signal assigned
to each agent by the environment and it is related to agents’ behavior. Hence,
RL is mainly based on the reward signal and how it is used by agents. By the
reward signal definition we are able to model different behaviors, especially in the
multiagent case. By the reward signal use we can impose to an agent different ways
of learning an optimal action policy. An innovative approach is the COllective IN-
telligence (COIN for short), proposed by D. Wolpert and K. Tumer (see [6]), that
computes the reward to be assigned to each agent in an intuitive way (see Chap-
3
1.2 Main Contributions 1. Introduction
ter 4). This reward computation is compared with other trivial approaches (team
game, selfish utility, . . . – see [6] and [24]) and it is analyzed using the Q-learning
dynamics (see [20]) in order to show how COIN can induce a cooperative behavior
among agents obtaining good performance.
Furthermore, there exists another extension of this framework: we allow to a
subset of agents to form a coalition in order to let agents to reach a goal with
peculiar characteristics ([16]). Once again, RL approach is still useful in this new
MAS case study. As stated above, besides the formalization of interactions among
environment and agents, here we must deal with interactions among coalitions of
agents in order to find a suitable learing policy to reach a goal. Another critical
point is the reward usage, which is assigned to each coalition of agents by the
environment. The way we split the reward signal among agents is crucial in order to
assign different priorities and/or importance to each agent belonging to a coalition.
The two most important techniques used to split the reward among agents are the
Shapley’s value and the core ([16]). The former lies on the joining order of agents
in a coalition, thus the reward of each agent is bounded on that joining order as
well as to the joint action. The latter focuses on the stability of a coalition, that
is we have not any further coalition change because any agents can’t achieve more
by changing their policy.
1.2 Main Contributions
The aim of this thesis is twofold. At first, we analyze the different methods used
to cooordinate a set of agents and we study the pros and cons derived from these
methods in particular known environments in literature. Next, we propose a new
typology of games (task allocation via coalition formation games) which involves
task allocation problems, but with a set of heterogeneous agents. Plain task allo-
cation problems focus on how to associate a number of tasks to a (great) number
of agents with a suitable partition ([6] and [19]). Coalition formation games ([16])
focus on games where we have a set of coalitions of agents, and each coalition may
be seen as a super-agent acting in an environment. Each coalition gains a reward
4
1. Introduction 1.3 Outline of the Thesis
based on the joining order of agents in that coalition as well as on the joint ac-
tion. Furthermore, the reward must be distributed among agents belonging to the
coalition. The new class of games here proposed takes the most significant part of
coalition formation games and task allocation ones in order to formalize different
real situations, where we are dealing with different types of agents and a set of
tasks that must be executed with some given balance of different types of agent.
The way how they learn to coordinate themselves will induce the overall behavior
and, given that behavior, we can evaluate how these agents have acted.
By adopting this new kind of game we have faced well known problems in liter-
ature (the Cooking Teams Problem, [25]) and we show how these games formalize
agents’ behavior with different configurations of the environment and of agents.
Here we have found interesting results about the curse of the state space and we
motivate how this game does not work well with any kind of state information with
different configurations of agents and environment.
1.3 Outline of the Thesis
This thesis is organized as follows:
Chapter 2 : we give a brief introduction about RL, both from single agent and
multiagent viewpoints. We describe the most popular algorithms and we
show how these algorithms were adapted to be used in the multiagent case.
Furthermore, we depict how some concepts of Game Theory were used when
we are dealing with a multiagent environment and their pros and cons about
the solutions found.
Chapter 3 : we give a glance about the COllective INtelligence theory (COIN)
and its theoretical basis on different scientific fields in order to understand
how it is structured and which is its ground key idea.
Chapter 4 : we analyze in depth COIN and which kind of problems is designed
to solve. A key factor of this theory is to avoid to build a model of the envi-
ronment dynamic we want to understand in order to find a suitable solution,
5
1.3 Outline of the Thesis 1. Introduction
that is how to induce a behavior to agents acting in that environment in order
to satisfy a goal. Instead, with this backward theory, given an environment,
a set of agents and a goal, we are able to find a convenient solution to the
learning problem without any necessity to build a dynamic model of different
interactions among agents. At first, we show some useful functions used to
measure the learning goodness, then we describe the desired characteristics
of such functions in order to have good, inexpensive and reusable solutions
to the learning phase.
Chapter 5 : we briefly describe different environments where COIN has been
applied, and we show how it has been applied, the results obtained and the
key factors of that theory.
Chapter 6 : we introduce different challenging problems of RL related to COIN
and to the examples presented in Chapter 5. We motivate these difficulties
using the Q-learning dynamics. At last we introduce the coalition formation
approach that will be extended in Chapter 7.
Chapter 7 : we introduce a new kind of games involving both coalition formation
games and task allocation ones. This new class of games takes the most
significant features from task allocation and coalition formation games in
order to formalize different real environments, where we have different types
of agents undertaking tasks.
Chapter 8 : we explain the results obtained with the experiments proposed in
Chapter 5 and how COIN was changed in order to improve agents’ perfor-
mance. Furthermore we propose the results obtained with the new class of
games in a well known problem in literature (Cooking Teams problem, [25]).
Chapter 9 : we discuss the work developed in this dissertation and we present
some future directions that can be furthermore studied starting from this
thesis.
6
Part I
State of the Art
7
Chapter 2Reinforcement Learning
Reinforcement learning is learning what to do
R. Sutton, A. Barto
Contents
2.1 Learning from Interaction . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 TD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Multi-Agent Learning . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Change or Learn Fast . . . . . . . . . . . . . . . . . . . 13
2.2.2 Change & Keep . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 Minimax Q-learning . . . . . . . . . . . . . . . . . . . . 14
2.2.4 Nash Q-learning and Friend or Foe Q-learning . . . . . 15
9
2.1 Learning from Interaction 2. Reinforcement Learning
As reported in [17] with reinforcement learning (RL for short) a generic agent
(hardware or software) discovers which actions to execute. It is interesting to note
in most cases actions taken may affect not only the immediate reward, but also
the future ones. These characteristics (trial-and-error and delayed reward) are the
most significant of the reinforcement learning problem formulation.
Reinforcement Learning became a standard used for studying how agents can
learn a curse of actions when they act in an unknown or merely uncertain environ-
ment. Notice that reinforcement learning is completely different from supervised
learning: in the latter case we have a set of pre-classified of examples given by a su-
pervisor that is used by the learner to build an approximated relationship between
the elements an the results obtained by that set.
In the next sections we briefly introduce these facets of Artificial Intelligence
explaining some methodologies applied in this thesis. A deeper presentation may
be found in [15] and [17].
2.1 Learning from Interaction
In many problems we may not have a deeper knowledge of the environment where
an agent acts. The aim of reinforcement learning is to build an action policy based
on one of the following two methodologies:
model based : with these methods agents use their past experience to build
a model of the environment where they act which allows to construct an
approximation of state transition and reward functions.
model free : these models try to learn value functions directly from the reward
signal without building any kind of model of the environment.
Usually, the model free methods aim to estimate the expected value of the
reward signal using a formula like
Xk = Xk−1 + αk · (xk − Xk−1), (2.1)
where Xk is the new estimated value, which includes information from k samples
xk of a random variable, while αk ∈ [0, 1] is the learning rate parameter. Given
10
2. Reinforcement Learning 2.1 Learning from Interaction
some particular values of αk we obtain different update rules from Equation (2.1):
• with αk = 1k
we have the sample mean of the instances {x1, x2, . . . , xk};
• with αk = α, 0 < α < 1 we have a weighted mean on recent values of xk.
In RL many algorithms exist working in different ways. Some of the most used
algorithms belong to the class of Temporal Difference (TD) algorithms. They are
classified as on policy and off policy : the former evaluate and try to improve the
same policy used in the learning stage and in the control stage (e.g. SARSA), while
the latter typically use two different policies for the learning step and the control
step (e.g. Q-learning). Many other methodologies in this wide area of artificial
intelligence may be found in [15] and [17].
2.1.1 TD
These methods learn directly from past experience without any kind of model of the
environment (model free methods). The easiest method is TD(0) which updates
the value of function V (state) with the following update rule:
V (st) = V (st) + α [rt+1 + γ · V (st+1) − V (st)] , (2.2)
where rt+1 is the reward obtained and st is the actual state at time step t. Agent
chooses an action at based on its policy π(st) and, after executing it, it is in the
state st+1 and then it updates the value of function V (·) as stated in Equation
(2.2).
2.1.2 Q-learning
One of the most important and most used method in RL is the Q-learning al-
gorithm: given an action at based on the policy π(st) and a reward rt+1 ob-
tained in st+1 after executing that action, this algorithm updates the function
Q(state, action) as follow:
Q(st, at) = Q(st, at) + α[rt+1 + γ · max
a′
Q(st+1, a′) − Q(st, at)
](2.3)
11
2.2 Multi-Agent Learning 2. Reinforcement Learning
2.2 Multi-Agent Learning
The algorithms sketched before are widely used in such systems where there is only
one agent operating. The Q-learning algorithm has been applied in environments
where many agents behave with some underlying changes to allow cooperation
between them rather than operating in a self-interested fashion.
First of all it is important to note that multiagent environments are intrinsi-
cally non-stationary, because the agents are learning, so their policies are changing.
These non-stationary changes cannot be foreseen by other agents and related pay-
offs may be misleading, thus negatively affecting cooperation.
Some learning algorithms for multiagent systems focus on each single agent’s
behavior to find some admissible policies leading it to an equilibrium (typically the
Nash equilibrium). We point out that these algorithms may impose strong con-
ditions to converge and, with the presence of some other conditions (e.g. there is
more than one Nash equilibrium), we need some coordination mechanisms, treated
by other theories rather than reinforcement learning. Some algorithms more widely
studied and applied are minimax Q-learning (minimax-Q) in [7], Nash Q-learning
(Nash-Q) in [8] and [12], Friend or Foe Q-learning (FoF-Q) and Correlated Equi-
librium Q-learning (CE-Q) in [9].
Other algorithms focus on maximizing the reward obtained by an agent acting
in an environment supposing that its actions have not any kind of side effect on
other agents. As a consequence, an agent learns a policy that fits actions of other
agents. Some algorithms are Infinitesimal Gradient Ascent (IGA), Win or Learn
Fast Gradient Ascent (WoLF-IGA) and Win or Learn Fast Policy Hill Climbing
(WoLF-PHC ), all discussed in [4].
Many other algorithms learn an optimal policy for an agent focusing on co-
operation among other agents acting in the same environment. These algorithms
may require more or less strong conditions to hold like knowledge about actions
taken by other agents. In this class we include Change or Learn Fast (CoLF )
and Change & Keep (CK ) (see [1]), Independent Learner (IL) and Joint Action
Learner (JAL) in [5], Distributed Q-learning in [11].
In the following sections we briefly explain some interesting algorithms men-
12
2. Reinforcement Learning 2.2 Multi-Agent Learning
tioned before.
2.2.1 Change or Learn Fast
The CoLF algorithm [1] suggests a variable learning rate to learn quickly while the
agent is losing and slowly while the agent is winning. To improve learning, these
different learning rates were proposed to foster cooperation: if an agent achieves a
payoff unexpectedly changed, then it learns slowly, otherwise learns fast.
For each pair 〈state, action〉 (which is the argument of function Q(·, ·)) we
calculate P and S-values. P -values are exponential averages of the collected payoffs
with weight factor λ (λ ∈ (0, 1)), while S-values are exponential averages of the
absolute differences between the current payoff and the respective P -value. This
algorithm uses two different learning rates αNS (payoffs have rapid variations)
and αS (agents have nearly stationary policies) such that αNS < αS. The choice of
which learning rate must be used in the update phase of Q(s, a) depends on whether
the absolute difference between the current payoff and the respective P -value is
greater than the respective S-value.
P -values are an estimate of the expected payoffs and S-values are an estimate of
their variability. If the actual payoff is near to the expected one, then we assume
the environment is enough stationary and we update the Q-values with a high
learning rate (αS). On the other hand, when the current payoff is highly different
with respect to the associated P -value we use αNS to update Q-values in order to
reduce the non stationary effects of the environment.
2.2.2 Change & Keep
The CK algorithm (see [1]) is based on the following plain observation: when an
agent chooses a different action due to either learning or exploration, it typically
obtains an uninformative payoff. These misleading payoffs cannot be foreseen by
other agents and consequently their payoffs may be deceptive, thus negatively
affecting cooperation.
The idea proposed is to discard the payoff received immediately after the chang-
ing in the action selection, so the update of the Q-value is suspended. The agent
13
2.2 Multi-Agent Learning 2. Reinforcement Learning
repeats the same action and then the related Q-value will be finally updated. This
temporary suspension of the update phase gives time for the other agents to react
to its new action, thus having a more informative payoff for the update of the
related Q-value.
This behavior can be simply described by a simple finite state machine: starting
from state sC , while an agent selects the same action, it remains in that state
where it updates the associated Q-value and it chooses a new action according to
its strategy. When the selected action changes, agent passes in state sK where it
suspends the update phase, it still selects the same action, it updates the related
Q-value then it comes back to sC .
2.2.3 Minimax Q-learning
The Q-learning algorithm converges regardless of the learning rate α, but values
associated to that factor may influence the learning speed. This algorithm con-
verges to the true value of the pair 〈state, action〉 with a continuous refinement of
estimates given by payoffs and by current estimates.
In zero-sum games with two players (they are usually identified as max and
min), the payoff associated to a state st is related to the best action chosen by
the other player in state st+11. The estimate of the Q function is computed in a
different way depending on the player: max player computes mina′ Q(st+1, a′) in
the update equation of the Q function, while min player computes maxa′ Q(st+1, a′)
in the same equation.
When two players execute at the same time their actions we have a different
definition of the Q function:
V (s) = maxa∈A
mino∈O
Q(s, a, o), (2.4)
where a indicates the action chosen by an agent, while o is the action chosen by
the opponent.
1This is due to the term maxa′ Q(st+1, a′) of Equation (2.3), since that player chooses in turn
its action to execute.
14
2. Reinforcement Learning 2.2 Multi-Agent Learning
2.2.4 Nash Q-learning and Friend or Foe Q-learning
The Nash-Q algorithm is tightly related to the previous one, but now each agent
keeps a copy of the Q function of other agents: as a consequence it will be defined as
Q(s, a1, a2, . . . , an), where ai indicates the action of the i-th agent. There must be
one and only one Nash equilibrium in order to guarantee the algorithm convergence:
this is a strong condition, since that in not zero-sum games there may exist more
than one Nash equilibrium. In that case some coordination mechanism must exist
in order to have all agents reaching the same equilibrium.
In [8] the authors prove the conditions which must hold in order to have con-
vergence in Markovian non zero-sum games:
• each state must be visited infinite number of times;
• the learning rate α must satisfy the following hypothesis:
– 0 6 αt(s, a1, . . . , an) 6 1,
∑∞
t=0 αt(s, a1, . . . , an) = ∞,
∑∞
t=0 [αt(s, a1, . . . , an)]
2< ∞
– αt(s, a1, . . . , an) = 0 if (s, a1, . . . , an) 6= (s, a1
t , . . . , ant )
Bowling has studied this kind of games [3] and he has confuted the previous
theorem giving a counterexample; he added some stronger conditions which must
hold on the initial values of the Q function to guarantee the convergence of this
algorithm.
The algorithm Friend or Foe Q-learning proposed by Littman overcomes some
limitations introduced by the previous one. The convergence of the Nash-Q algo-
rithm is guaranteed towards a Nash equilibrium if a Nash equilibrium exists for
the opponents and a coordination equilibrium exists, both defined for all games
associated to the Q functions seen in the learning phase. This condition implies a
priori knowledge of which equilibrium an agent may reach: if an agent is friendly
(see the adjective “friend” in the name of this algorithm), it reaches a coordina-
tion equilibrium (then it applies the classic Q-learning algorithm), otherwise (“foe”
case) it reaches an equilibrium of the opponents (then it applies the minimax-Q
algorithm).
15
Chapter 3COIN: COllective INtelligence
The rich behavior of social insect colonies arisen not from
the sophistication of any individual entity in the colony,
but from the interactions among those entities
D.H. Wolpert, K. Tumer
Contents
3.1 Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Artificial Intelligence and Machine Learning . . . . . . . 20
3.2.2 Social Science-Inspired Systems . . . . . . . . . . . . . . 22
17
3.1 Preamble 3. COIN: COllective INtelligence
3.1 Preamble
In the last decade two particular fields of computer science were deeply studied
whose intersection may have great success. The first concerns the ways to control
distributed systems adaptively that have little (if any) centralized communication
with minimal knowledge of any dynamic behavior. The second is Reinforcement
Learning (RL), a branch of machine learning concerned with an agent acting in an
environment from which it receives rewards evaluating its behavior (see Chapter
2, [15] and [17]). The goal of a RL algorithm is to define how, using those rewards,
an agent may update its policy to maximize its utility.
We might hope that RL may be used in the control scenario introduced before,
since RL is adaptive and it is not restricted to a particular domain. However, al-
ready in the single agent case, while acting in a generic environment we must face
the computational limitations, since the policy space may be too large. We might
introduce more agents (remind that each agent has its own utility function) each
controlling only part of the system. Unfortunately, we have implicitly introduced
a new global reward function (from now we refer to it as world utility) evaluating
the overall behavior emerging from the environment. All agents have their private
utility: how can we map the world utility into the private utility of agents? More-
over we may find a valid characteristic to choose utility functions of each agent
such that each agent can maximize its utility function and the overall behavior
may increase the world utility.
With the term COllective INtelligence (COIN) we refer to any pair of large,
distributed collection of interacting computational processes among which there is
little to no centralized communication or control, with a world utility function that
rates the possible dynamic histories of the collection. If each process uses a RL
algorithm, we are interested to study how we can set the utility functions of each
RL algorithm to achieve high values of the world utility without having any prior
knowledge of the dynamic nor any model of the system.
18
3. COIN: COllective INtelligence 3.2 Background
3.2 Background
There are many computationally distributed systems where restrictions on central-
ized communication or on the central controller may exist. Moreover, the controller
may be uncertain about what kind of algorithm it may use for the control phase.
Just a few of the potential examples include:
• vehicular traffic control;
• control system for routing over a communication network;
• control system for a team of planetary exploration rovers.
These systems may be controlled with an artificial COIN albeit COIN may reach
out these engineered fields: the COIN design problem (that is how to induce a
cooperative behavior among a set of self-interested agents) is an inverse problem,
whereas the overall set of scientific fields are concerned to systems which are best
characterized as a “forward problem” (that is how, given a desired behavior, an
environment may be formalized). The latter, given the dynamic laws of each
single part of that system, determines its overall behavior. Rather we wish a way
to configure each dynamic law to induce an expected global behavior.
As an example we may consider a generic country with capitalist economy
where the world utility is a mean of the gross domestic product (GDP), while the
reward functions for agents may be related to the achievements of their private
goal.
In general, to achieve high world utility values in a COIN, agents should not
have common goals or, at worst, work at cross-purposes (self-interested agents). In
these situations the system may suffer the economic phenomena known as Tragedy
of the Commons (TOC), where the avarice of each individual works to lower world
utility. Another undesired phenomena is the liquidity trap, in which a subset of
agents behaving in a certain manner helps the world utility, but this behavior, if
employed by all agents, results in lower values of the world utility.
To have a clear viewpoint, this is what we mean by COIN:
19
3.2 Background 3. COIN: COllective INtelligence
1. there are many processes running concurrently, performing actions which
affect themselves;
2. there is little (if any) centralized communication among processes, but we do
not prohibit a broadcast communication started by a single process;
3. there is little to no centralized personalized control, but we do not prohibit
the communication of a single control signal to all other processes;
4. there is a well specified task in the form of extremizing a utility function that
is related with the behavior of the overall distributed system.
The following elements distinguish the typical approach to COIN:
• they satisfy (4), then the approach is scalable to very large number of pro-
cesses;
• the approach for tackling (4) is widely applicable, since it works with little
(if any) broadcasting communication as stated in (2) and (3). Moreover it
is adaptive and robust and it doesn’t need a deeper knowledge about the
system is formalized;
• each individual process is implemented as a RL algorithm (but it is not a
necessity).
3.2.1 Artificial Intelligence and Machine Learning
There is an extensive body of work in AI and Machine Learning related to COIN: in
the following subsections we explain how them can be applied to approach COIN.
Reinforcement Learning
In RL (see Chapter 2) we find some interesting features suited for any distributed
environment where there is not a primary controller nor a model of that system used
to learn strategies by agents. Rather, an agent must successfully learn strategies
based on rewards it receives from environment. Typical RL algorithms TD(λ)
20
3. COIN: COllective INtelligence 3.2 Background
(they use value functions) and Q-learning (they use an evaluation of the 〈state,
action〉 pair) have been investigated and applied in real environments.
These features may appear suitable to COIN. Unfortunately, each RL algorithm
will not perform well on large distributed heterogeneous problems in general, be-
cause the policy-action space is very extended. Usually one should use many RL
algorithms rather than one to check their performance in order to choose the ap-
propriate one.
Distributed Artificial Intelligence
This field is essentially a natural extension of AI, where tasks have migrated to-
wards parallel implementation, so we have different modules each one with different
tasks concurrently working towards a common goal. To do this we have to guaran-
tee that the task to accomplish will be well modularized to improve convergence.
As a consequence we need a central controller scheduling various sub-tasks and
processing the associated results.
Despite this evolution, distributed artificial intelligence refers to the traditional
AI ideas (reasoning, understanding, planning, learning, . . . ) rather than on their
cumulative character.
Multi-Agent Systems
This field is concerned with interactions among members of a set of agents as well
as the way they act. The design of a multi-agent system with a central coordinator
may involve:
• decomposing a global task into tractable sub-tasks for each agent;
• establishing communication channels that provide a minimal amount of in-
formation to each agent to enable the execution of that sub-tasks;
• coordinating agents in such a way to guarantee cooperation towards the
global task avoiding any kind of conflicting strategies.
In point of fact, agents act selfishly (each one may have many utility functions)
and we need to provide incentives to each agent to improve cooperation in order to
21
3.2 Background 3. COIN: COllective INtelligence
avoid the TOC. In this instance we may use coordination, negotiations, coalition
formation or contracting. Unfortunately these approaches completely forget the
optimization of the system at the expense of scalability and reliability.
3.2.2 Social Science-Inspired Systems
Some human economies provide examples of occurring systems that may be char-
acterized by COIN. They consider the extremization of constrained world utility
where there are strong conditions on agents and their interactions.
In this section we summarize two economic concepts related to COIN, in that
they deal with how a large number of agents can cooperate.
Mechanism Design
Mechanism design is concerned with the incentives that must be given to a set
of agents interacting each other. Usually these incentives induce Pareto optimal
(PO) joint actions where any agent can do better without hurting another agent.
One important scheme used as incentive is auction, which is applied when there
are many agents interacting in an environment exchanging goods. All auctions
perform the same goal: match supply and demand of goods. A mechanism such
auctions inducing PO does not necessarily extreme the world utility function.
Perfect Rationality Noncooperative Game Theory
The simplest form of a game foresees the existence of two or more agents each of
them has a set of possible actions it can perform and a utility function (also known
as payoff matrix for finite games) mapping any joint action chosen to an associated
utility value for agent i, i.e.: Ai → R.
There are many versions both in the action selection phase and in the strategy
selection phase. The former is related to extensive form games (each agent in turn
selects its action to be executed) and the latter concerns the action that must be
chosen given the state of the environment (it must be deterministic or stochastic).
A solution of a game (also called equilibrium) is a profile in which every agent
behaves rationally. With a Nash equilibrium (NE) we have a configuration where
22
3. COIN: COllective INtelligence 3.2 Background
each agent chooses the best strategy given the strategies of other agents. As a
consequence, if all agents found a NE, they have not any incentive to leave out
that equilibrium. A game must have zero, only one or many NE in the pure-
strategy space, while in the mixed-strategy space (there is a probability distribution
associated to each strategy) the Nash’s theorem always guarantees the existence
of at least one NE.
In the cooperative game theory all agents are able to enter binding contracts
each other so they coordinate their strategies. In this way, agents avoid NE that
are not PO.
23
Chapter 4A Framework Designed for COINs
Do not worry about your difficulties in Mathematics.
I can assure you mine are still greater.
Albert Einstein (1879, 1955)
Contents
4.1 Problems with a Model-Based Approach . . . . . . . . . . . . . . 26
4.2 Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.1 Preliminary Definitions and Terminology . . . . . . . . 27
4.2.2 Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.3 Learnability . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 A Descriptive Framework for COINs . . . . . . . . . . . . . . . . 33
4.3.1 Candidate Salient Characteristics . . . . . . . . . . . . . 33
4.3.2 Factoredness . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.3 Wonderful Life Utility . . . . . . . . . . . . . . . . . . . 36
4.3.4 How to Induce these Salient Characteristics? . . . . . . 39
25
4.1 Problems with a Model-Based Approach 4. A Framework Designed for COINs
In Chapter 3 we saw that changing an existing scientific field to encompass
systems meeting all of the requirements of COINs can be very hard. In this chapter
we motivate a new framework designed to analyze COINs, illustrating both the
nomenclature used and the basic mathematical theory (see [23]).
4.1 Problems with a Model-Based Approach
The most natural approach used to build a COIN involves the following steps:
1. we build a stochastic model of the COIN’s dynamics parametrized by a vector
ϑ (it can contain parameters used by microlearners, the world and local
utilities, . . .);
2. we solve the function f(ϑ) which maps the parameters of that model in the
resulting stochastic dynamics;
3. we wish to have a high expected value of a generic world utility wu;
4. finally we aim to solve the inverse problem, i.e. we would have to search the
ϑ that, via f(·), results in a high value of E(wu|ϑ).
Now we examine some of the challenges of the present approach:
• we are mainly interested in very large, complex, noisy systems which often
operate in a non-stationary environment composed by many microlearners
running simultaneously. Building a detailed model of such a system will be
very difficult; anyway, if we have such a model it will be nearly complicated
and hard to be used;
• even for a simpler model, some difficulties may arise during the application
to the function f(·);
• even if we have the function f(·), the inverse problem may be impossible to
be solved in practice;
• in addition to these difficulties, we wish to have a high level model which
allows to change the microlearning algorithm of each agent without having
to redo the entire model each time.
26
4. A Framework Designed for COINs 4.2 Nomenclature
There is an alternative approach which avoids these difficulties setting up a little
model at higher level that has little to do with the dynamics: if this model is COIN
compliant then its world utility will benefit. Of course, these salient characteristics
must be easy to be induced in a COIN.
4.2 Nomenclature
In this section we concentrate on the four salient characteristics of COIN:
intelligence : it quantifies how well a microlearner learns and performs;
learnability : it is a characteristic of a utility function that we would expect
to be well-correlated with how well a microlearning algorithm can learn to
optimize it;
factoredness : a utility function is factored if whenever its value increases, then
the overall system will benefit;
Wonderful Life Utility (WLU) : it is an example of a utility function that is
both factored and learnable.
4.2.1 Preliminary Definitions and Terminology
1. With the term microlearning the authors refer to a single RL algorithm used
by each agent of the system to modify its behavior. With COIN initialization
the authors refer to the initial construction of a COIN potentially based upon
salient characteristics. With the term macrolearning the authors refer to
any imposed run-time modifications to COIN which are based on statistical
inference concerning salient characteristics.
2. For convenience we suppose time t ∈ Z. During the initialization phase we
suppose t 6 0.
3. All variables affecting a COIN are identified as components of an euclidean
vector states of various discrete nodes. The authors define ζη,t
to be a vector
27
4.2 Nomenclature 4. A Framework Designed for COINs
in the euclidean vector space Zη,t, where ζη,t
indicates the state of the node
η at time t; the ith component of such a vector is indicated by ζη,t;i
.
4. For convenience we indicate with ζ,t∈ Z,t the vector of the states of all nodes
at time t; ζ−η,t
∈ Z−η,t refers to the vector of the states of nodes except η
at time t; finally, ζ refers to the global vector of the states of all nodes at
any time. Moreover we will use a shorthand notation for the gradient, e.g.:
∂ζ,tF (ζ
,) indicates the vector of the partial derivative of F (ζ
,) with respect
to the elements of ζ,t. With ζ
,t<t′we will refer to all the components of ζ
,t
with t < t′.
5. The binary operator • is used to indicate the vector formed by concatenating
the components of two vectors, e.g.: α • β refers to the vector formed by
concatenating the components of α with the components of β.
6. The universe where a COIN behaves is assumed to be completely determinis-
tic, since the real world obeys deterministic physics. COIN, being incorporate
in such a world, must be deterministic too as well as any learning algorithm
acting in a COIN. Any deterministic COIN may be based on merely stochas-
tic concepts in a higher level of the working environment: a high level exists
where the feasible policies are chosen (thus it might be stochastic), while
COIN level is deterministic because we are assured about the action to be
executed.
7. To consider the determinism we bundle all variables we are not directly con-
sidering (but important for the dynamics of the system) in an environment
node.
8. The dynamics of the system is expressed by writing ζ,t′>t
= C(ζ,t) that is
a subset of the set of ζ ∈ Z that are consistent with the deterministic laws
governing COIN. Despite C(·) is defined for any argument like ζ ∈ Z for any
t, generally not all the ζ ∈ Z lie in C,t.
9. The authors do not impose particular boundaries both of what we mean by
“COIN”, whose dynamics is given by C(·), and what we call “macrolearning”
28
4. A Framework Designed for COINs 4.2 Nomenclature
(perturbations instigated from outside). Macrolearning goes beyond the def-
inition of C and it may refer, for example, to any statistical inference process
modifying the private utilities at runtime to induce the salient characteris-
tics (see Section 4.2). We must pay attention to the system to discern what
is owed to C and what was changed from outside. Besides these consider-
ations, whatsover the boundary of the system used to distinguish C from
the macrolearning, the mathematical formalization of COIN is restricted to
a system evolving according to C irrespective of the macrolearning.
10. We are provided with some world utility G : Z → R that ranks the various
conceivable worldlines of a given COIN. Since the environment node is never
observed, we implicitly assume that G is not a function of its state; moreover,
it does not depend from time t. Furthermore we are provided with personal
utilities gη,t : Z ⊗ Z → R that are considered as “virtual” private functions
typically used to analyze the behavior of the system.
11. As mentioned above there may be variables in each state of any node which
represent the utility function that the associated learning algorithm (mi-
crolearning) is trying to extremize. These local variables are members of ζ
and they represent the private utilities function. We recall the fact that the
personal utilities {gη} do not exist in COIN, they are not specified by any
element of ζ: these functions are just used to mathematically formalize the
private utilities.
4.2.2 Intelligence
Given a system and a world utility function G we will need to evaluate the per-
formace of the system in terms of that utility function. The evaluation needs a
mapping of an arbitrary worldline ζ, a utility function and an arbitrary dynamic
law C in R (it is function of a worldline). Such a measure will also allow to quantify
how the behavior of each microlearning algorithm is reflected in the value assumed
by the personal utility evaluated in a given ζ.
We would prefer a less model-dependent approach that only uses only an arbi-
29
4.2 Nomenclature 4. A Framework Designed for COINs
trary utility function, a state ζ and C; this performance measure must not be a
raw utility value like gη(ζ), since that is not invariant with respect to monotonic
transformations of gη; moreover, it must not penalize a microlearner because that
algorithm cannot achieve a prefixed result if that is impossible to achieve due to
C and/or to the actions of other nodes.
A first natural approach is to generalize the game theoretic concept “best re-
sponse strategy” and consider the problem of how well η performs given the ac-
tions of the other nodes. In particular we might compare the utility of the present
worldline ζ to the set of other worldlines ζ ′, where ζ−η,t
= ζ ′
−η,t, and use those
comparisons to quantify the performance of the node η.
To compare the various worldlines we concentrate only on future contributions
given by the substitution of ζ ′ with ζ; if we allow arbitrary ζ ′
,t<0, then the differ-
ences between the past components of ζ ′ and ζ can modify the value of the utility
regardless the effects of any difference in the future components. For many COINs
of interest we may restrict the attention to those ζ ′ where ζ ′
,t<0differs from ζ
,t<0
only for the internal parameters of the microlearner of η, differences that only at
times t > 0 manifest themselves in the utility function. Since these changes do
not affect the t < 0 components and since we are only interested with changes of
ζ affecting the utility, then we impose to leave ζη,t<0
unchanged.
In quantifying the performance of η for behavior given by ζ we compare ζ
with a set of ζ ′ restricted to those ζ ′ sharing the past of ζ (i.e.: ζ ′
,t<0= ζ
,t<0and
ζ ′
−η,0= ζ
−η,0) and ζ ′
,t>0∈ C,t>0. Since ζ ′
η,0is free to vary while ζ ′
,t<0is not, then
ζ ′ /∈ C in general. Thus considering these dynamically impossible ζ ′ is equivalent
to consider a restricted set of ζ ′ with the internal parameters modified, all of which
belong to C.
In a more formal way, given C and a generic measure dµ(ζ,0) demarcating which
points in Zη,0 we are interested in, in [23] the authors define the intelligence for
node η of a point ζ with respect to a utility U as follows:
ǫη,U(ζ) ≡∫
dµ(ζ ′
,0) · Θ
[U(ζ) − U
((ζ
,t<0• C(ζ ′
,0))]
· δ(ζ ′
−η,0− ζ
−η,0), (4.1)
where Θ(·) is the Heaviside theta function (it equals 0 if its argument is less than
0, elsewhere it equals 1), while δ(·) is the Dirac delta function which we assume
30
4. A Framework Designed for COINs 4.2 Nomenclature
∫dµ(ζ ′
−η,0) = 1.
Intuitively ǫη,U (ζ) measures the fraction of alternative states of η (it follows
that 0 ≤ ǫη,U (ζ) ≤ 1) where the performance of η does not improve. For example,
ǫη,U (ζ) = 0.5 means that in 50% of the alternative states η does not improve its
performance; as particular case note that with ǫη,U (ζ) = 1 the node η is fully
rational.
The learning algorithm of the node η that is trying to improve U has intelligence
close to 1: we expect that those algorithms should have high values of intelligence.
Given a particular ζ−η,0
the conditional probability that ζη,0
= p is a monotonically
increasing function of ǫη,gη(ζ
,t<0•C(p•ζ
−η,0)). Since for a given ζ
−η,0the intelligence
ǫη,gηis a monotonically increasing function of gη, then the probability that ζ
η,0= p
is a monotonically increasing function of gη(ζ ,t<0• C(p • ζ
−η,0)). It follows that
the better the microlearning algorithm, the more tightly peaked the associated
probability distribution over intelligence values is.
At any point ζ which is a Nash equilibrium (NE) in the set of the personal util-
ities {gη}, the intelligence of all nodes η must equal 1. Since this is the maximum
value of the intelligence, then every point that is a NE in {gη} is also PO in the as-
sociated intelligence (no deviation from such a ζ can raise any of the intelligences).
If there exists at least one NE in the set {gη}, then there is not any PO point in
the set {ǫη,gη(ζ)} that is not a NE.
4.2.3 Learnability
Intelligence can be a difficult quantity to work with, e.g.: fix η and consider a
region centered in any ζ with whatsover utility U , where ζ is not a local maximum
of U . Then, by increasing the values U takes in that region, the intelligence ǫη,U (ζ)
will increase. Necessarily, values of the intelligence of points outside that region
will decrease. So intelligence has non-local character, in fact we cannot directly
modify it to ensure that is simultaneously high for any and all ζ.
A second general problem of intelligence regards the specification of details of a
microlearner: if these details are not available, then it can be extremely difficult to
predict which of two private utilities the microlearner will be better able to learn.
31
4.2 Nomenclature 4. A Framework Designed for COINs
Moreover, even with the details, the prediction can be nearly impossible. So from
these considerations emerge that it can be difficult to determine which values of
the intelligence of a private utility will accrue to various choices of those private
utilities. As a consequence, macrolearning that involves modifying private utilities
to directly increase intelligence with respect to those utilities can be fairly difficult.
In a team game we have gη = G for all η: using those {gη} as private utilities of
microlearners (maybe in a COIN with many agents) results in a very bad signal-to-
noise ratio, since it may be hard for any agent η to discern the effects of its actions.
As a consequence the effects of its actions upon its utility function (so upon G)
can be undetectable because there are many other processes (players) going into
determining values assumed by G. So agent η will find it difficult to decide how to
act best once the learning phase has completed, since there is nothing η can do to
achieve high intelligence.
We wish a measure of U capturing these effects, but without depending on any
kind of function maximization (or, generally speaking, extremization) nor on any
other aspects of how the node determines its actions. Given a measure dµ(ζ ′
,0)
restricted to C, we define the utility learnability for a utility U for a node η at ζ
in t = 0 as follows:
Λη,U(ζ) ≡∫
dµ(ζ ′
,0) · |U
(ζ
,t<0• C(ζ
−η,0• ζ ′
η,0))− U(ζ)|
∫dµ(ζ ′
,0) · |U
(ζ
,t<0• C(ζ ′
−η,0• ζ
η,0))− U(ζ)|
(4.2)
The intelligence learnability is defined the same way as Equation (4.2) replacing
U(·) with ǫη,U(·).Equation (4.2) may be interpreted as a signal-to-noise ratio. The integrand in
the numerator reflects how much of the change in U that results from replacing
ζ,0
with ζ ′
,0(see the term ζ ′
η,0) is due to the change in t = 0 of the state of node
η (this is the “signal”). The denominator reflects how much of the change in U
that results from replacing ζ with ζ ′ (see the term ζ ′
−η,0) is due to the change in
t = 0 of the state of nodes other than η (this is the “noise”). So we infer that
learnability quantifies how easy it is for a microlearner to discern the consequences
of its behavior in the utility function U . We presume a microlearning algorithm
will achieve higher intelligence values if provided with a more learnable private
32
4. A Framework Designed for COINs 4.3 A Descriptive Framework for COINs
utility.
The differential learnability of a utility function U in ζ is the learnability with
dµ restricted to an infinitesimal n-ball about ζ:
λη,U(ζ) =‖∂ζ
η,0U(ζ
,t<0• C(ζ
,0))‖
‖∂ζ−η,0
U(ζ
,t<0• C(ζ
,0))‖
(4.3)
By itself the value given by Equation (4.3) has no significance; we are interested
to the ratio of differential learnabilities for different U ’s at the same ζ, so giving
a decisive criterion which can be used to select a particular utility function U .
Another significant feature is that it does not depend on the choice of some measure
dµ(·). Usually, in this kind of learnability, we consider an expected value based
upon a region with lower intelligence, in fact in those ζ with higher intelligence we
have λη,U(ζ) = 0.1
In this form, learnabilities are not meant to capture all factors that will affect
how high an intelligence value a particular microlearner will achieve. These factors
are typically incorporated in the microlearners, so this measure may be preferably
used as a guide to improve performance.
4.3 A Descriptive Framework for COINs
In this section we present a descriptive framework for COIN, in particular the
salient characteristics and the relationship between these characteristics and per-
sonal utilities.
4.3.1 Candidate Salient Characteristics
In a framework like this one it is useful to identify certain characteristics we expect
they are associated to a COIN having large world utility. These characteristics
formalize the intuition that we want COINs for which private utilities, if well
initialized, will result in large values of the world utility without any bottleneck,
TOC (see Section 3.2) or the like.
1If ζ is a maximum, then U(ζ) will be a maximum too so its derivative will be 0.
33
4.3 A Descriptive Framework for COINs 4. A Framework Designed for COINs
One candidate for such a characteristic related to PO is the weak triviality,
where we have two worldlines ζ and ζ ′ consistent with the dynamics C of the
system, where for every node η we have gη(ζ) > gη(ζ′). If for any such pair of
worldlines where one Pareto dominates the other one then it is necessary true that
G(ζ) > G(ζ ′). In these systems if the microlearners collectively modify ζ in a way
that ends up helping all of them, then the world utility also rises. As a consequence
the maxima of G are PO points for personal utilities (note that the reverse may
not hold).
Weakly trivial systems can evolve to a world utility minimum. For example let
us consider automobile traffic in the absence of any traffic control system; let each
node be a driver and their private utilities g(·) quantify how quickly they get to
their destination (gη(ζ) is large if driver η gets to his destination in a short amount
of time), while G is the sum of all private utilities. This system is clearly weakly
trivial (for every pair ζ and ζ ′, if gη(ζ) > gη(ζ′) for all η then G(ζ) > G(ζ ′)): if
there is traffic jam (rush-hour, accidents and the like) and each driver tries to get
to his destination as fast as he can, then the system does not result in acceptable
throughput as a whole (in fact G will be low)2. However, this kind of systems are
used in some cases, since each agent, regardless how others behave, guarantees that
its private utility is greater than a certain level. If we assume each agent has a
large amount of actions to guarantee such a behavior, then a weakly trivial system
guarantees that the world utility is not too low. In the extreme case where each
agent knows its utility for every one of its actions, the PO points are NE, so the
point maximizing G is a NE too.
The main problem emerging from the weak triviality is the fact that the in-
dividual microlearners are greedy: in a COIN there is not an incentive to replace
ζ with a different worldline ζ ′ that would improve personal utility of each agent
as stated in the definition of weak triviality. Rather, the incentives applied to
each microlearner motivate the learners to behave in a way that may hurt some
of them. So, from these considerations weak triviality is not an optimal choice as
salient characteristic of a COIN.
2Obviously here we do not allow any change to private utilities.
34
4. A Framework Designed for COINs 4.3 A Descriptive Framework for COINs
We can assume that if the microlearners are well designed, then each one will be
doing close to as well it can given the behavior of the other nodes. So, the system
is more likely to be in ζ rather than in ζ ′ if for all η we have ǫη,gη(ζ) > ǫη,gη
(ζ ′).
Such a system is defined coordinated if for any such pair ζ, ζ ′ ∈ C and for all η for
which ǫη,gη(ζ) > ǫη,gη
(ζ ′) we have G(ζ) > G(ζ ′).
4.3.2 Factoredness
In this section, we discuss a third candidate characteristic which does not suffer
of the negative aspects of weak triviality. In this case we do not replace personal
utility {gη} with intelligence {ǫη,gη} as coordination does, but rather we consider
different worldlines whose differences at time 0 involve a single node (this is more
related to NE concept than PO one).
Say that our worldline of COIN is ζ, while ζ ′ is another worldline which ζ,t<0
=
ζ ′
,t<0and ζ ′
,t>0∈ C,t>0; let us restrict our attention to those ζ ′ where at t = 0 differ
from ζ only for node η. If for all such ζ ′ we have
sgn[gη(ζ) − gη
(ζ
,t<0• C(ζ ′
,0))]
= sgn[G(ζ) − G
(ζ
,t<0• C(ζ ′
,0))]
(4.4)
and if this is true for all nodes η, then that COIN is factored for all those utilities
{gη} at ζ in t = 0 with global utility G. Equation (4.4) states that, for any node
η, given the rest of the system, if the state of such node at t = 0 changes in a way
improving the utility of that node, then it necessarily improves world utility. So,
the more is performant a microlearner, the largest are values of G.
For a factored system we have
ǫη,gη(ζ) = ǫη,G(ζ) ∀η (4.5)
and the NE are local maxima of world utility.
It is important to note that having a factored system does not mean that a
change to ζη,0
improving gη(ζ) cannot also hurt gη′(ζ) for some η′ 6= η: the side
effects on the rest of the system due to the increase of the utility of η do not end
up decreasing world utility, but they may have arbitrary effects on other private
utilities.
35
4.3 A Descriptive Framework for COINs 4. A Framework Designed for COINs
Another fact to consider is that if gη,t′ is factored with respect to G, then a
change at ζη,t′
improving gη,t′(ζ ,t<t′, C(ζ
,t′)) improves G(ζ
,t<t′, C(ζ
,t′)), but it may
hurt some gη,t′′ 6=t′(ζ ,t<t′, C(ζ
,t′)) and/or ǫ(η,t′′),gη,t′′
(ζ,t<t′
, C(ζ,t′
)).
In general we cannot have both perfect learnability and factoredness: let us sup-
pose that ∀t,Zη,t = Z−η,t = R and dynamics is the identity operator (∀t, C(ζ,0),t =
ζ,0). So if G(ζ
,0) = ζ
η,0• ζ
−η,0and if we assume the system is perfectly learnable,
then it will be never perfectly factored. However, any change to ζη,0
improving gη
may help or hurt G depending on the sign of ζ−η,0
. So, from these considerations,
we prefer having a system as more as factored to keep it closer to NE (this will be
the goal of macrolearning).
If a system is factored for some utilities {gη}, then it will be factored for any
utilities {g′η} where, for all η, g′
η is a monotonic increasing function of gη.
Theorem 4.1. A system is factored at all ζ ∈ C if and only if for all those ζ and
for all η we can write:
gη(ζ) = Ψη
(ζ
,t<0, ζ
−η,0, G(ζ)
)(4.6)
for some function Ψ(·, ·, ·) such that ∂GΨη(ζ ,t<0, ζ
−η,0, G(ζ)) > 0 for all ζ ∈ C.
With Theorem 4.1 (the proof is in [23, Section 4.3.2]) the authors guarantee
that the system is factored without any concern for C. As example, consider a team
game (see Section 4.2.3) where gη = G ∀η: these COINs are obviously factored
regardless of C, in fact, if gη increases then necessarily G increases too.
4.3.3 Wonderful Life Utility
In practice, team game utilities often are poor choices personal utilities due both
to their low learnability and the fact they require centralized communication. Let
define for t = 0 the effect set Ceffη (ζ) of node η at ζ as the set of all components
ζη′,t
for which ∂ζη,0
(C(ζ,0))η′,t 6=
−→0 : this set is the set of all components ζ
η′,twhich
would be affected by a change in the state of node η at t = 0. Moreover we define
Ceffη without the dependence from ζ as
⋃ζ∈C Ceff
η (ζ) and ¬Ceffη as the set of the
components of the space Z which are not in Ceffη .
36
4. A Framework Designed for COINs 4.3 A Descriptive Framework for COINs
For any set σ of components (η′, t) define CLσ(ζ) as the vector formed by
clamping the components of σ in ζ to a prefixed arbitrary value (here it equals−→0
for all the components of σ). Consider a Wonderful Life set σ: the value of the
wonderful life utility (WLU for short) for the set σ in ζ is defined as follows
WLUσ(ζ) = G(ζ) − G(CLσ(ζ)
)(4.7)
The WLU for the effect set of node η is G(ζ)−G(CLC
effη
(ζ)) which for ζ ∈ C can
be written as G(ζ,t<0
• C(ζ,0)) − G(CL
Ceffη
(ζ,t<0
• C(ζ,0))).
WLU for the effect set of node η can be viewed as the change of the world
utility as if that node η had never existed. The CL operation produces a new ζ
without any concern about the dynamics C of the system (so ζ may not lie in
C): this independence from the dynamics is a crucial strength of the WLU, in fact
to evaluate WLU, we do not infer how the world would have evolved from t = 0
setting the state of η to−→0 .
If the set of all nodes is partitioned in subworlds such that all nodes belonging
to the same subworld ω share the same effect set, then those nodes will have
essentially the same personal utilities. If they have large intelligence values, this
utility sharing means that all the nodes of subworld ω behave in a coordinated
way.
Theorem 4.2. 1. A system is factored for all ζ ∈ C if and only if for all ζ and
for all η we can write
gη(ζ) = Ψη
(ζ¬C
effη
, G(ζ))
(4.8)
for any function Ψ(·, ·) such that ∂GΨη(ζ¬Ceffη
, G) > 0 ∀ζ ∈ C.
2. A COIN is factored for the set of personal utilities equal to the associated
effect set WLU.
As a generalization of point (2) of Theorem 4.2 we note that a system is factored
if personal utility of all nodes is the WLU of a set ση containing Ceffη (the proof
of Theorem 4.2 is in [23, Section 4.3.3]).
To keep the presentation clear for the remainder of this section we omit the
argument ζ,t<0
.
37
4.3 A Descriptive Framework for COINs 4. A Framework Designed for COINs
Theorem 4.3. Let σ be a set containing Ceffη , then:
λη,WLUσ(ζ)
λη,G(ζ)=
‖∂ζ−η,0
G(C(ζ
,0))‖
‖∂ζ−η,0
G(C(ζ
,0))− ∂ζ
−η,0G(CLσ
(C(ζ
,0)))
‖(4.9)
If we expect to have a large ratio of magnitude of gradients then the effect set
WLU has much higher learnability than in team games, e.g.: suppose to have a
wide COIN where η represents only a minimal amount of such a system; given the
predominance of η′ 6= η, the change of G based upon ζη′,
is essentially independent
by ζη,0
. In such circumstances Theorem 4.3 (the proof is explained in [23, Section
4.3.2]) tells us that the effect set WLU for η will have a larger learnability than
does the world utility.
For a fixed σ, if we redefine the CL function (i.e., we clamp to another fixed
value rather than−→0 ) then we change the function mapping ζ
,0in CLσ(C(ζ
,0)),
as a consequence the mapping (ζη,0
, ζ−η,0
) → G(CLσ(C(ζ,0))) too. Such a change
of the clamping operation can affect ∂ζ−η,0
G(CLσ(C(ζ,0))); therefore, by Theorem
4.3 we change λη,WLUσ(ζ). Consequently, for any choice of σ we should set the CL
function in such a way to maximize learnability.
Now, consider the case where for some node η we can write G(ζ) as G1(ζCeffη
)+
G2(ζ ,t<0•ζ
¬Ceffη
) and it is also true that the effect set of η (Ceffη ) has few elements.
So values of G(·) are much larger than those of G1(·), which means that partial
derivatives of G(·) are greater than G1(·). As a consequence, the effect set WLU
is more learnable than the world utility due to the following results.
Theorem 4.4. If ∃ η, σ : Ceffη ∈ σ and ∃G1(ζσ
∈ Zσ), G2(ζ−σ∈ Z−σ) : G(ζ) =
G1(ζσ) + G2(ζ−σ
), then
λη,WLUσ(ζ)
λη,G(ζ)=
‖∂ζ−η,0
G(C(ζ
,0))‖
‖∂ζ−η,0
G(CL−σ
(C(ζ
,0)))
‖(4.10)
A special case of Theorem 4.4 (proofs are presented in [23, Section 4.3.2]) is
the following:
Corollary 4.1. If for some node η we can write
38
4. A Framework Designed for COINs 4.3 A Descriptive Framework for COINs
1. G(ζ) = G1(ζσ) + G2(ζ−σ,t>0
) + G3(ζ ,t<0)
for some set σ containing Ceffη , and if
2. ‖∂ζ−η,0
G(C(ζ
,0))‖ ≫ ‖∂ζ
−η,0G1
(Cσ(ζ
,0))‖
then
λη,WLUσ(ζ) ≫ λη,G(ζ) (4.11)
In practice, to assure that condition (i) of this corollary is met might require
that σ be a proper subset of Ceffη . Countervailing, to assure that condition (ii) is
met will usually force us to keep σ as small as possible.
More generally, if there is a set σ′ ∈ Ceffη such that for each component (η; 0; i)
the chain rule term∑
(η′,0)∈σ′
[∂ζ
η′,tG(ζ)
]·[∂ζ
η,0;i
[C(ζ
,0)]
η′,t
]= 0, then the effects
on G of changes to ζη,0
that are mediated by the members of σ′ cancel each other
out. In this case we can usually remove the elements of σ′ from Ceffη with no ill
effects.
4.3.4 How to Induce these Salient Characteristics?
As depicted above, such a framework offers theorems relating fundamental charac-
teristics of a COIN to their general properties of the past: we wish a COIN being
in a global state ζ∗ where there is a set {gη} such that ζ∗ is factored for utilities
{gη}, and intelligence ǫη,gη(ζ∗) is as large as possible for all η.
A first approach is to have each microlearner explicitly trying to lead the world-
line towards such a point ζ∗. Initialization of COIN (i.e., set ζ,0) implies setting
the algorithm controlling η, so we impose ζ,0
in such a way to have some special gη
for which C(ζ,0) is factored with respect to gη and with large values of ǫη,gη
(C(ζ,0)).
The main problem is to find such a gη: this implies a careful and a possible mod-
elization of the system, clearly in contrast with the observations stated in Section
4.1.
Other possible approaches are related both to COIN initialization and macrolearn-
ing. In this case we use {gη} as private utilities at some t < 0 inducing a factored
COIN to be as intelligent as possible. Since we deploy private utilities we can use
learnability rather than intelligence, so we choose some {gη} which are as learnable
39
4.3 A Descriptive Framework for COINs 4. A Framework Designed for COINs
as possible while still being factored. The authors usually use inference in COIN
initialization, e.g.: effect set Ceff of a node is composed by those ζη′,t>0
which
have non zero correlation with respect to ζη,0
. Theorem 4.2 guarantees that the
system is factored for effect set WLU personal utilities and by Corollary 4.1, for
small effect sets, effect set WLU has a large differential learnability with respect
to G (see Equation (4.11)). So we evince with this scenario that this framework
advises us to use WL private utilities based on the associated effect sets rather
than the team game private utilities.
When doing macrolearning, the authors initialize the system with initial esti-
mate of effect set of η (initial guessed effect set) and we impose the association
between private utilities and WLU. Next, we watch the system run and we observe
the correlations among the components of ζ and then we change the components
of ζ belonging to the effect set of η (so changing personal utility of η accordingly).
40
Chapter 5Experimental Applications
Who neglects learning in his youth,
Looses the past and is dead for the future
“Phrixus” – Euripides (484 BC, 406 BC)
Contents
5.1 Packet Routing in a Network . . . . . . . . . . . . . . . . . . . . 42
5.1.1 COIN for Network Routing . . . . . . . . . . . . . . . . 43
5.1.2 Experimental Results . . . . . . . . . . . . . . . . . . . 45
5.2 Learning Sequences Of Actions . . . . . . . . . . . . . . . . . . . 46
5.2.1 COIN Solution . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Bar Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
41
5.1 Packet Routing in a Network 5. Experimental Applications
It is possible to formalize many real systems like a COIN (see Section 3.2).
This framework is widely applied, e.g. vehicular traffic control, learning sequences
of actions, packet routing in a network. Across-the-board it is useful to use this
distributed control methodology where we have many processes (packets, rovers,
agents) in such a way hard to be formalized, where centralized controller has plan-
ning (how “make” the controller) and computational (best actions to be chosen
may take a long time according to the domain to be controlled) difficulties.
In the following sections we explain real cases in literature where a world where
many agents act has been formalized and controlled by a COIN:
packet routing in a network : we need to route packets in a network com-
posed by routers and computers with the well known SPA (Shortest Path
Algorithm) and with the new approach introduced by COIN (see [22]);
learning sequences of actions : we have a grid world where agents act trying
to take as many tokens as possible (see [24]);
attendance model of El Farol Bar Problem : in this well known case we
have many agents that must attend the bar in a week both avoiding over-
crowed days and days with few agents (see [6]).
The environments above mentioned are extremely different from each other: if
we try to formalize them just with a simple RL algorithm we may face the problem
of how we can assign rewards to all agents acting in these environments; this may
lead to introduce important approximations that necessarily impact on the system
behavior at runtime. As a consequence, the authors explain how COIN was applied
in known cases in literature in order to induce a cooperative behavior among a set
of agents, and they compare the COIN performance with other utility function
ones in order to verify it usefulness.
5.1 Packet Routing in a Network
We are facing the problem of how can we route information packets in a network
to make them reaching their destinations according to an opportune metrics about
42
5. Experimental Applications 5.1 Packet Routing in a Network
the path chosen. A well known metrics is SPA where each packet is routed on a
link such that it can reach its destination with as less steps as possible. In this
situation, the microlearner sets the internal parameters of its router running the
SPA. Microlearners do not completely address the problem of guaranteeing that
private utilities do not induce the learning algorithm to work at cross-purposes to
satisfy the main goal of the routing.
In this case, we use 3 algorithms: SPA and COIN both with full knowledge
(FK) of the true rewards obtained following a specified path (with reward being
the time taken by a packet to be routed) and COIN memory based (MB – it has
just local knowledge).
5.1.1 COIN for Network Routing
In this example we concentrate on the two networks depicted in Figure 5.1 where
traffic originated by routers represented by white boxes had only the routers rep-
resented by dark boxes as ultimate destinations (point out that in both networks
router 2 is a bottleneck). As standard definition traffic at router is a pair 〈r, d〉,
(a) Network A (b) Network B
Figure 5.1: Network architectures (from [22])
where r is a real number (source router) and d is the destination tag (e.g.: a com-
puter) to be reached. At each timestep each router sums all traffic received from
upstream routers to obtain the amount of traffic, then it chooses where to send
that load.
43
5.1 Packet Routing in a Network 5. Experimental Applications
We keep a running average of the total value of the load of each router in L
previous timesteps; this average is used to compute W (x) to get the sum of the
total delay accumulated at this timestep by all the packets traversing this router at
this timestep. Each router has a different definition of such a W (x) (according to
the hardware, queue length, . . . ): in this testbed routers 1 and 3 have W (x) = x3,
while router 2 (bottleneck) has W (x) = log(x + 1). Obviously, the overall goal is
to minimize the total delay encountered by all traffic.
In COIN, with η we identify the pair 〈router, destination〉, so ζη,t
is the vector
of traffic sent along all link exiting from the router η (with the destination of packets
traversing this router) at time t. As subworld we identify each set of routers whose
packets share the same ultimate destination.
In the classic SPA each node η tries to set ζη,t
to minimize the sum of delays to
be accumulated by traffic on the way to its ultimate destination. In COIN we use
a complementary approach, i.e., η tries to set ζη,t
in such a way to optimize gω for
subworld ω containing η. With the term “full knowledge” we mean that at time t
all routers know the average loads of all routers at time t−1 and assume that those
values will be the same at time t (this can be a good assumption for large values
of L), so we can make accurate estimates of how best route their traffic according
to their respective criteria.
Having limited knowledge, COIN routing can only predict the WLU value re-
sulting from each routing decision. More precisely, for each pair 〈router, destination〉the microlearner estimates the mapping between the load on all outgoing links to
WLU-based reward; then each router sends packets along the path resulting with
the best estimated reward. In this case, we use a more conservative method, i.e.
we randomly choose between the path with the best estimated reward and the path
chosen by FK SPA.
The load of a router r at time t is given by ζ, while Wr,t(ζ) is the function
W (x) of node r at time t. The world utility function is given by the total delay,
i.e. G(ζ) =∑
r,t Wr,t(ζ). Using WLU to set the local utility of each microlearner
we have g(ζ) =∑
r,t ∆ω,r,t(ζ), where ∆ω,r,t =[Wr,t(ζ) − Wr,t(CLw(ζ))
].
At each time step the MB COIN uses∑
r ∆ω,r,t(ζ) as reward signal for trying
44
5. Experimental Applications 5.1 Packet Routing in a Network
to optimize this full WLU. This reward is computed in a decentralized way: all
packets have a header containing a running sum of the ∆s encountered in all routers
it has traversed so far. Each destination node sums the values of the headers and
sends this sum back to all routers that had routed to it.
5.1.2 Experimental Results
The networks discussed above (Figure 5.1) were tested under light, medium and
heavy traffic loads as depicted in Table 5.1; moreover, from each source router a
new packet was fed at each time step.
Network Source Dest. (light) Destin. (medium) Destin. (heavy)
A 4 6 6;7 6;7
5 7 7 6;7
B 4 7;8 7;8;9 6;7;8;9
5 6;9 6;7;9 6;7;8;9
Table 5.1: Source–destination pairings for the three traffic loads
(a) Network A (b) Network B
Figure 5.2: Overall delay of the networks (from [22])
As depicted in Figure 5.2 (these results are averaged on 50 different executions
with a window-size L = 50) we see that FK COIN outperforms the FK SPA. So
with COIN we have a system operating in a way that reduces the average total
45
5.2 Learning Sequences Of Actions 5. Experimental Applications
delay for all packets, not in a greedy fashion like SPA. Moreover MB COIN has
better performance than FK SPA: so we deduce that COIN will always outperform
any algorithm that estimates the shortest path.
5.2 Learning Sequences Of Actions
Another typical application of RL is the Grid world, where an agent navigates in
a two-dimensional grid and at each time step it receives a reward related to the
action chosen. In the episodic version of the Grid world an agent moves for a
certain amount of time steps and then it is returned to its initial position; in this
situation we need a learner optimizing the sum of rewards obtained (Q-learning
and SARSA are typically used for this problem).
In this problem there are many agents navigating in the grid simultaneously
and interacting with the reward of each others. These interactions are modeled
through tokens with different values laid on the grid: each one has values between
0 and 1 and each cell may have at most one token. When an agent moves into a
cell with a token it receives a reward equal to the value of the token and then it
removes that token (so that reward will no longer be available if an agent enters
in that cell). At the end of the episode all the tokens are reset and each agent is
returned to its initial position. The main goal is to collect the highest amount of
tokens in a fixed number of time steps.
Interactions among agents are a useful formalization to examine coordination
and selfish behavior, so considering TOC and the likes.
5.2.1 COIN Solution
Here we pose this problem in the form of COIN and we define:
• Lη,t is the matrix representing the location of agent η at time t. If it is in
location (x, y) then Lη,t,x,y = 1, otherwise Lη,t,x,y = 0. With {Lη,t} we denote
the set of location matrices.
• Laη,t is the location of agent η would have had at time t had it taken action
46
5. Experimental Applications 5.2 Learning Sequences Of Actions
a at time step t − 1.
• Lη is the location matrix of agent η across all time (Lη =∑
t Lη,t).
• Lη,<t is the location matrix of agent η across all time before t (Lη =∑
t′<t Lη,t′).
• L is the location matrix of all agents across all time (L =∑
t
∑η Lη,t).
• L<t is the location matrix of all agents across all time less than t (L =∑
t′<t
∑η Lη,t′).
• L−η is the location matrix of all agents, but η, across all time (L−η = L−Lη).
• L−η,<t is the location matrix of all agents other than η across all time before
t (L−η,<t = L<t − Lη,<t).
• Θ stores the initial values and locations of all tokens.
The space Z is composed by Θ and the set {Lη,t} of all location matrices, while
a worldline ζ is a point in that space. We define the function V (L, Θ) which returns
the value of a token received from a location matrix as follows:
V (L, Θ) =∑
x,y
Θx,y · min(1, Lx,y) (5.1)
The world utility function G(ζ) is given by the sum of all the tokens taken
during an episode:
G(ζ) = V (L, Θ) (5.2)
To formulate the WLU in this problem let us suppose that the operator CLη
sets the state of η to the null vector, so we have the WLU where the agent is
removed from the worldline:
WLU−→0
η (ζ) = G(ζ) − V (L−η, Θ) (5.3)
The utility stated in Equation (5.3) is different from one where the values of the
tokens present in locations visited by agents are summed (rather such a function
is known as Selfish Utility, SU for short). WLU−→0 returns the values of the tokens
in locations not visited by other agents, i.e. the values of the tokens that would
not have been taken, if agent η had not been in the system.
47
5.2 Learning Sequences Of Actions 5. Experimental Applications
These utility functions are based on the performance on a full episode: to learn
an optimal sequence of actions we introduce a reward related to a single time step.
To that end, let us decompose an arbitrary utility function U as follows:
U(L) =∑
t
U(L<t+1) − U(L<t)
The reward for a single time step is given by:
Rt(L) = U(L<t+1) − U(L<t) (5.4)
As a consequence for the two utilities depicted above (global and WL ones) we
introduce the associated single time step utilities:
GRt(ζ) = V (L<t+1, Θ) − V (L<t, Θ) (5.5)
WLUR−→0η (ζ) = GRt(ζ) − (V (L−η,<t+1, Θ) − V (L−η,<t, Θ)) (5.6)
5.2.2 Results
In the experiments we used 3 different utility functions (they were opportunely
changed as stated in Equation (5.4)):
• SU (Selfish Utility): each agent receives the discounted sum of the values of
tokens that it alone collected;
• TG (Team Game utility): each agent receives the full world utility;
• WLU (Wonderful Life Utility): is the contribute given by an agent to the
token collection, i.e. it is the difference in the total token collection with and
without that agent.
Each agent was controlled by a Q-learner: the input space for each one consists
of its location in the grid, while the action space is given by the 4 directions an
agent can choose. The discount parameter γ is set to 0.95 and actions are chosen
stochastically based on Q-values, so the probability an agent can choose action ai
in state s is given by
P sai
=kQ(s,ai)
∑j kQ(s,aj)
, k = 50
48
5. Experimental Applications 5.3 Bar Problem
Figure 5.3 depicts clearly that the SU function produces poor results, worse
than random actions, because each agent tries to collect as token as possible, so
competing with the others. TG utility seem to be quite good with respect to SU,
but the learning time is extremely large, because each agent receives a noisy reward
(it does not perceive clearly the consequences related only to its actions). Instead,
this problem does not occur with WLU (there is also the Aristocrat Utility in [24],
but here is not examined), because an agent can discern clearly how its actions
affected the world reward.
Figure 5.3: System performance with 10 agents on a 10×10 grid (from [24])
In Figure 5.4 we have qualitatively similar results depicted in Figure 5.3. TG
utility has harder time learning respect to the small grid, because in this case the
payoff is even more noisy. Instead, if agents use WLU they were able to cooperate
collectively because they discern the consequence of their actions from the obtained
rewards.
5.3 Bar Problem
In this well known problem in literature (see [6, Section 4]) we try to apply COIN
in a problem widely examined (this is known as dispersion game, where we have
n agents and k tasks to be assigned to those agents). Here we have n agent, each
of whom picks one of seven nights to attend a bar the following week (this process
is repeated every week) to avoid both overcrowed nights and boring ones (nights
49
5.3 Bar Problem 5. Experimental Applications
Figure 5.4: System performance with 100 agents on a 32×32 grid (from [24])
where there are few agents attending the bar). In each week each agent uses its
own RL algorithm to choose which night to attend the bar to maximize its utility.
The world utility function is given by:
G(ζ) =∑
t
7∑
k=1
γk(xk(ζ, t)) (5.7)
where xj(ζ, t) is the j-th component of the vector x(ζ, t), i.e. the number of agents
attending night j at week t, while γk(y) ≡ αk · y · exp(−y/c) (c and {αk} are real
values, where c is the given optimal number of agents attending the bar and {αk}are weight factors). This world utility is the sum of rewards for each night in each
week.
This G is chosen to reflect the attendance at the bar in different night config-
urations: with few and many agents G returns a small (or even negative) reward
(note that γk(u) has a maximum in u = c).
The vector α weights different nights to give them more or less importance. In
[6] the author chooses two different vectors to guarantee the effectiveness of COIN:
α1 = [1 1 1 1 1 1 1] and α2 = [0 0 0 7 0 0 0].
In these experiments, the authors set c = 6 and a number n of agents 4 times
than the number of agents necessary to have c agents attend the bar on each of
the seven nights (n = 7 · 6 · 4 = 168 agents).
Each agent is configured with different reward functions1:
1Pay attention to not confuse reward functions and utility functions, since the former reflect
50
5. Experimental Applications 5.3 Bar Problem
• Uniform Division reward (UD for short):
UD(dω(t), ζ, t) ≡ γdω(xdω
(ζ, t))/xdω(ζ, t)
• Global Reward (GR for short):
GR(dω(t), ζ, t) ≡7∑
k=1
γk(xk(ζ, t))
• Wonderful Life reward (WL for short):
WL(dω(t), ζ, t) ≡7∑
k=1
γk(xk(ζ, t)) −7∑
k=1
γk(xk(CLω(ζ), t)) =
= γdω(xdω
(ζ, t)) − γdω(xdω
(CLω(ζ), t))
where dω is the night chosen by subworld ω (to keep the problem simple we
assume that each subworld ω is composed by only one agent).
With the GR utility each agent receives the same reward of others: obviously
the system is factored, but evaluating this function requires a centralized commu-
nication. This characteristic (to be avoided) is not present in WL, since each agent
only needs to know the total attendance on the night it attended.
(a) Performance with α1 (b) Performance with α2
Figure 5.5: Average performance of the Bar Problem (from [6])
the state of the system in a given time step (is is potentially observable) and the latter are a
formalization of the main goal of each agents (there depend by the past behavior of each agent).
51
5.3 Bar Problem 5. Experimental Applications
5.3.1 Results
Figure 5.5 depicts the average performance averaged over 50 separate runs; for
both Figure 5.5(a) and Figure 5.5(b) the top curve is WL, the middle is GR and
the bottom is UD. Using WL we have convergence to near optimal performance,
so we deduce that the Bar Problem is enough suited to have cooperation.
Note the convergence time of WL as compared with GR in Figure 5.5(b), about
4 times faster. With α1 (Figure 5.5(a)) the convergence time of WL is 30 times
lower than the GR. In both cases UD utility has awful performance worsening in
future weeks, behavior due typically to a low signal-to-noise ratio (see team games
in Section 4.2.3).
52
Part II
Innovation
53
Chapter 6Theoretical Considerations
In mathematics you don’t understand things. You just get
used to them.
Johann von Neumann (1903, 1957)
Contents
6.1 Class of Games . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.1.1 Matrix Games . . . . . . . . . . . . . . . . . . . . . . . 56
6.1.2 Stochastic Games . . . . . . . . . . . . . . . . . . . . . . 57
6.1.3 Differences between Grid world and Bar Problem . . . . 58
6.2 Delayed Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3 Reward Function of the Bar Problem . . . . . . . . . . . . . . . . 60
6.4 Q-learning Dynamics . . . . . . . . . . . . . . . . . . . . . . . . 63
6.5 Introduction to Coalition Formation . . . . . . . . . . . . . . . . 66
6.5.1 Coalition Structure Generation . . . . . . . . . . . . . . 67
6.5.2 Optimization within a Coalition . . . . . . . . . . . . . 69
6.5.3 Payoff Division . . . . . . . . . . . . . . . . . . . . . . . 69
55
6.1 Class of Games 6. Theoretical Considerations
In this chapter we discuss coordination problems in dispersion games. Coordi-
nation problems are a key factor of RL if we have many agents and our aim is to
induce them to cooperate. At first, we introduce the bothersome problem of the
delayed reward that is often present in RL giving an intuitive example, then we
introduce the theoretical differences between Grid world and Bar Problem; then
we analyze the Bar Problem, in particular its reward function and its behavior (the
latter using the Q-learning dynamics following the approach proposed in [20]).
Next, we discuss about the Q-learning dynamics in order to understand how
an agent policy evolves during the learning phase using different reward functions.
Due to this analysis, we can understand how the policy of each agent evolves during
the learning phase using different reward functions.
Finally, we introduce the theoretical grounds of the Coalition Formation theory,
that will be extended in Chapter 7.
6.1 Class of Games
In the following sections, we discuss about two main classes of games, how their
structures influence the trend of an environment formalized with such games and
how they can avoid or emerge some bothersome aspects of RL. After a brief the-
oretical introduction, we analyze the Bar Problem and the Grid world in the eyes
of these definitions.
6.1.1 Matrix Games
Definition 6.1. A matrix game is defined as a tuple 〈N, A1...N , R1...N〉, where N
is a collection of n agents, Ai is the set of actions available to agent i (let A be the
joint action space, i.e. A1 × A2 × . . . × An) and Ri is the reward function A → R
for each agent i.
In such a game, each agent chooses actions maximizing its own reward function
(it can be viewed as a n-dimensional matrix) which depends on the actions chosen
by other agents.
56
6. Theoretical Considerations 6.1 Class of Games
What does it mean “solve such a game”? Solving the game means to find
the agent’s best response policy, which allows the agent to collect the highest
reward given the other agents’ policies. A stationary strategy can be evaluated
only if strategies of other agents are known (for instance the prisoner dilemma,
the matching pennies game, . . . ). Agents can also play mixed strategies, so they
select actions according to a probability distribution. The latter lead us to define
an opponent-dependant solution (i.e., the best response to the joint action of other
players), thus the definition on Nash equilibrium.
These games can be purely collaborative (agents share the same reward func-
tion, so these games belong to the class of zero-sum games) or purely competitive
(each agent has a reward function counteracting with that of others, these games
belong to the class of general-sum games), and they are usually played for an un-
defined number of iterations. In particular, games each agent can perceive actions
of others, but they know neither the intentions of others nor the reward functions
(for such a game see [1, Section 3]).
6.1.2 Stochastic Games
Definition 6.2. A stochastic game is defined as a tuple 〈N, S, A1...N , T, R1...N〉,where N is a collection of n agents, S is the set of all the possible states of the
game, Ai is the set of actions available to agent i (let A be the joint action space,
i.e. A1 × A2 × . . . × An), T is a transition function S × A × S → [0, 1] and Ri is
the reward function of agent i (Ri : S × A → R).
Definition 6.2 looks very similar to Definition 6.1, in fact each state of a stochas-
tic game can be viewed as a matrix game with the payoff for each joint action deter-
mined by Ri(s, a): after playing the matrix game and receiving payoffs, agents are
transitioned to another state determined by their joint action (that is associated
to a new matrix game).
57
6.1 Class of Games 6. Theoretical Considerations
6.1.3 Differences between Grid world and Bar Problem
The Grid world and the Bar problem are formalized as different typology of games:
the former belong to the class of stochastic games (Section 6.1.2), the latter to the
matrix games ones (Section 6.1.1). This is a classification based upon the structure
of a game. Moreover, we can have episodic games (i.e., games played a certain
amount of times) as well as games where there is an environment dynamically
changed by agents’ behavior.
Following the notation introduced in Section 6.1.1 and in Section 6.1.2, in the
Bar Problem (Section 5.3) Ai are the nights available to each agent i and Ri
depends only on a function of the joint action (see function xk(·, ·) of Equation
(5.7)). Furthermore, the state space is drastically reduced since each agent knows
only the number of agents attending the bar on the same night it has chosen and
not their identity1, so it can be viewed as a single-agent problem (with a reduced
state space size too). With this particular configuration and, above all, since the
Bar Problem belongs to the class of matrix games, we avoid the delayed reward
issue: the reward of a joint action (i.e. in which night each agent attends the bar)
is immediately available at the next time step, so agents have a factual evaluation
of their own actions.
The Grid world environment (Section 5.2) is a stochastic non-stationary game,
since the environment where agents act is dynamically changed accordingly on
actions chosen by agents (i.e. tokens collected are no longer present), and mainly
because each single reward depends upon the joint action and the system’s state.
In this kind of games we could face the delayed reward problem. As we will see in
Section 6.2, TG, SU and UD utility functions are immune to this problem, since the
reward returned to each agent relies only on the joint action executed at previous
time step (it obviously depends on the environment configuration, but this one
does not involve a delayed reward). With the WLU function these considerations
do not hold, as shown by the example proposed in Section 6.2.
1So having agents η1 and η2 or agents η41 and η79 does not make difference.
58
6. Theoretical Considerations 6.2 Delayed Reward
6.2 Delayed Reward
RL is based upon the concept of reward: an agent (or more than one) perceives a
state st of the environment at time t, it chooses an action at complying with its
policy π (so it reaches to another state s′t) and then it receives a reward signal rt+1
(which usually ranges on R). Given that reward, an agent can change its policy
following, for instance, Q-learning or SARSA (or any other RL algorithm). It is
desirable that this reward signal will be immediately available, in particular if it
is fundamental to evaluate the action at chosen in state st: in this case, an agent
can understand the consequences of its action (“it is a good action” or “it is not a
good action”) because the reward immediately obtained is directly related to that
action. If this reward signal is given to that agent with a certain delay τ and that
agent is not aware of this reward, it could not understand that such a reward is
related to an action at rather than the current action at+τ (τ ∈ N, τ > 0). In
this case the agent changes its policy accordingly to that reward, but it refers to a
wrong situation (i.e. to the pair 〈st+τ , at+τ 〉 rather than 〈st, ar〉).This is a situation known as delayed reward in literature (see [10] and [17,
Chapter 7]), and next we see a simple example.
Recall the Grid world presented in Section 5.2, where two or more agents move
in a grid collecting tokens in such a way to avoid gathering tokens that will be taken
by other agents (this is a modelization to induce cooperation among them). Let
us consider Equation (5.6): the first term computes the sum of the token values
collected at time t, while the second one computes the sum of the token values
collected at time t without agent η. Let us take that this sequence of events holds:
• agent η1 takes the token k1 with value 2.5 at time t;
• since time t to time t + τ − 1 (τ ∈ N, τ > 1) no tokens will be collected;
• at time t + τ agent η2 is on the cell where at time t there was the token k1,
while agent η1 is far away from agent η2.
In such a situation the first term of Equation (5.3) is zero, since in this time step
t + τ no tokens were found, while the second one for agent η1 equal to the sum
59
6.3 Reward Function of the Bar Problem 6. Theoretical Considerations
of the token values collected in this time step as if it would never be existed, that
is k1. So agent η1 receives -2.5 as reward that penalizes the fact it has taken the
token k1 despite it is away from such token at time t + τ .
Note that this problem only affects WLU function in the Grid world, but not
TG, SU and UD utility functions. This is obvious, in fact this consideration can
be derived by looking at the definition of such utility functions (see Section 5.2.2):
TG, SU and UD utility functions are only related to tokens collected in the actual
time step, without considering tokens taken in the past by other agents.
6.3 Reward Function of the Bar Problem
With a deeper observation of the Bar Problem we can discuss about its exponential
world utility function γ(·) (Equation (5.7)). It is not symmetrical and it has a large
right tail that increases when c raises as depicted in Figure 6.1.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Rew
ards
Agents
Exponential functions
c = 4c = 5c = 6c = 7c = 8
Figure 6.1: Exponential functions of the Bar Problem
The Bar Problem aims to introduce an implicit coordination among many
agents to avoid overcrowed nights and boring ones. The exponential function
proposed in [6] has a maximum in c = 6 (see the blue line of Figure 6.1): if we
60
6. Theoretical Considerations 6.3 Reward Function of the Bar Problem
have n < c agents attending the bar, this function rapidly decreases the difference
among rewards returned to agents, while for n > c these differences are minimal,
e.g. from n = 6 to n = 10 agents the reward returned only decreases of 14.43%
about2, so agents are induced to consider the nights with 10 agents attending the
bar as safe. This non-uniform reward might return unclear information to agents,
so that they could not concentrate on coordination to obtain as high as possible
reward values. This characteristic becomes more important with the WLU func-
tion, since it computes the difference between two bar configurations to obtain the
reward to be returned.
In such a case, it may be worth to adopt a symmetrical function that propa-
gates as uniform as possible rewards based on the absolute value of the difference
between the optimal number of agents and the number of agents attending the
bar. A well known function belonging to this class is the Gaussian function with
mean µ equal to the optimal number c of agents attending the bar and with ap-
propriate variance σ2. The variance should not be too large, so we relapse in the
previous problem, that is we have a function still returning uniform rewards, but
the difference between reward obtained by an agent when there are n + 1 agents
and when there are n agents can be very small. On the other side, it cannot be
too small, otherwise we have a very slow learning rate and a large convergence
time. For instance, suppose we have agents using the WLU function. If we im-
pose σ2 = 1 we could think that with this small variance agents will accurately
learn optimal actions. However, we have an important problem: if we have n = 6
agents attending the bar, then these agents receive a reward r = 0.157, so they
learn that the night they chose should be good; at the contrary, if we have n = 7
agents attending the bar they receive a reward r = −0.157, so they learn that the
night they chose is bad. With any other value of n, agents receive a reward r ≃ 0.
With this simple example it is easy to understand that agents are inclined to pick
overcrowed nights (r ≃ 0) or boring ones (r ≃ 0 because we have few agents), so
discarding the main goal, because they prefer to obtain reward r = 0 (or near to
2With 6 agents the exponential reward function returns r6 = 6 · exp(−1) ≃ 2.21, and with 10
agents it returns r10 = 10 · exp(− 5
3) ≃ 1.89.
61
6.3 Reward Function of the Bar Problem 6. Theoretical Considerations
0) rather than a negative one. Therefore, we must choose a suitable variance to
obtain high learning rate and small convergence time, and the Gaussian function
should not be larger than the exponential one (thus its variance must not be too
high) in order to avoid awful (difference) rewards.
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0 2 4 6 8 10 12 14 16 18 20 22 24 26
Rew
ard
Agents attending the bar
Gaussian and Exponential WL Reward Functions
Std. deviation = 1.0Std. deviation = 1.5Std. deviation = 2.0Std. deviation = 2.5Std. deviation = 3.0Std. deviation = 3.5
Exponential
Figure 6.2: WLU rewards
Figure 6.2 depicts the shape of the exponential and the Gaussian reward util-
ity functions when we have agents using the WLU function. We multiplied each
Gaussian function with a real positive number k to have approximately the same
maximum value when we have n = c agents attending the bar using the WL expo-
nential utility function. It is easy to see that the WL exponential utility function
returns good rewards when there are less than 6 agents, but the shape of this func-
tion is too smooth with 7 or more agents, in fact, it returns too similar rewards:
for instance, between n = 11 and n = 17 agents we have an offset of 16.7% about.
With the WL Gaussian utility function we have more significant rewards. When
there are less than 6 agents attending the bar in a specific night, they are induced
to attend the bar in that night, but when there are a little bit more than 6 agents
attending the bar, they are induced to attend the bar at another night. In the
extreme case when n ≫ c (e.g. n = 14 with σ = 2.5), agents receive essentially
62
6. Theoretical Considerations 6.4 Q-learning Dynamics
zero-reward and they are induced to consider quite good this night.
In Chapter 8 we propose the results obtained by all the previous considerations,
that is we compare the performance between the Bar Problem with the exponential
reward function and the Gaussian one, the Q-learning dynamics with the latter as
well as the performance of the Grid world.
6.4 Q-learning Dynamics
To understand how Q-learning works, we need to analyze its dynamics. Following
[20] and [21] we draw on the Replicator Dynamics (RD) model from Evolutionary
Game Theory (EGT). Concepts and techniques developed in EGT were initially
formulated in the context of evolutionary biology, so we have a population com-
posed by the strategies of all agents where such strategies evolve: analyzing of the
expected value of this process gives an approximation called RD. This evolutionary
process usually combines selection and mutation: the former favors some varieties
over others, the latter provides variety in the population. RD is mainly focused
on the role of selection describing how a system consisting of different strategies
changes over time.
The general form of a replicator dynamics is the following:
dxi
dt= [(Ax)i − x · Ax] xi (6.1)
where xi represents the density of strategy i in the population and A is the payoff
matrix which describes the different payoff values each strategy receives when in-
teracting with others in the population. The state of the population is described by
x = (x1, x2. . . . , xJ) and represents the different densities of all the different types
of strategies in the population. As a consequence, (Ax)i is the payoff received by
strategy i and x · Ax is the average payoff in the population. The growth ratedxi
dt
xi
of the population using strategy i equals the difference between the current
strategy payoff and the average payoff in the population.
Let us have 2 different populations p and q using Q-learning: we need a system
of two differential equations that corresponds to a RD for asymmetric games. If
63
6.4 Q-learning Dynamics 6. Theoretical Considerations
A = Bt then Equation (6.1) holds and we can write:
dpi
dt= [(Aq)i − p · Aq] pi
dqi
dt= [(Bp)i − q · Bp] qi
(6.2)
The growth rate of the strategies in each population is now determined by the
performance of the other population.
In this situation each agent has a probability vector [x1, x2, . . . , xn] over its
action set {a1, a2, . . . , an}. The Boltzmann distribution is described by
xi(t) =
exp
(1
τ· Qai
(t)
)
∑n
j=1 exp
(1
τ· Qaj
(t)
) (6.3)
where xi(t) is the probability of playing strategy i at time step t, τ is the temper-
ature3 (τ ∈ R and τ ≥ 0) and Qai(t) is the q-value for action ai at time t. The
temperature controls the necessary tradeoff between exploration (high tempera-
ture) and exploitation (low temperature), hence it will be decreased over time. If
we have payoff matrices A and B for 2 players we can calculate the time limit (it
is well explained in [21]) obtaining for each player4:
dxi
dt= xiα
1
τ[(Ay)i − x · Ay] + xiα
∑
j
xj ln
(xj
xi
)(6.4)
dyi
dt= yiα
1
τ[(Bx)i − y · Bx] + yiα
∑
j
yj ln
(yj
yi
)(6.5)
The first term of Equations (6.4) and (6.5) is exactly the RD and thus takes
care of the selection mechanism; the mutation term for Q-learning is given by the
second term, in fact it can be written as:
xiα∑
j
xj ln
(xj
xi
)= xiα
(∑
j
xj ln (xj) − ln (xi)
)=
= α (−xi ln (xi)) − xiα
(−∑
j
xj ln (xj)
)=
= α (Si − xiSn)
(6.6)
3Notice that this τ has nothing to do with the one used in Section 6.2.
4In [21] the authors use τ instead of1
τ, in literature this term can be used either in the
numerator or in the denominator.
64
6. Theoretical Considerations 6.4 Q-learning Dynamics
where Si is the entropy of strategy i (how much we know about that strategy) and
Sn is the entropy concerning the entire distribution.
If we have more than 2 populations, we can extend Equations (6.4) and (6.5)
using the theoretical interpretation of Equation (6.1): the first term equals the
payoff of the strategy chosen and the second one equals the average payoff of the
population.
Now, we propose the Q-learning dynamics of the Bar Problem following the
previous theory. For ease of computation we analyze the Bar Problem with 8
agents, 2 days per week and with a policy matrix as follow:
Π =
a(t) b(t) . . . h(t)
1 − a(t) 1 − b(t) . . . 1 − h(t)
T
(6.7)
where a(t) is the probability of the first agent to attend the bar on Monday (and
obviously 1 − a(t) is the probability of that agent to attend the bar on Tuesday,
and so on), while the strategy i analyzed for each player is “attend the bar on
Monday”.
The two terms of Equations (6.4) and (6.5) rely on the utility function we use;
for instance, if we use the TG utility function with strategy “attend the bar on
Monday” the first term of that equations for strategy (agent) i may be written as
follow:
(Ay)i = Ψ (1 + agents−i (Monday) , c) + Ψ (agents−i (Tuesday) , c) (6.8)
x · Ay = Pi [Monday] ·
[Ψ (1 + agents−i (Monday) , c) + Ψ (agents−i (Tuesday) , c)]+
Pi [Tuesday] ·
[Ψ (1 + agents−i (Tuesday) , c) + Ψ (agents−i (Monday) , c)]
(6.9)
where:
• Ψ(·, ·) is the reward function (in [6] it matches γk(·)) and it depends by agents
attending the bar and by the optimal number c of agents;
• agents−i(d) gives the expected number of agents attending the bar on day d
except agent i (it can be easily computed with Equation (6.7));
65
6.5 Introduction to Coalition Formation 6. Theoretical Considerations
• Pi [d] gives the probability of agent i to attend the bar on day d (see Equation
(6.7)).
Instead, if we use the WLU function with the same strategy as above we have:
(Ay)i = Ψ (1 + agents−i (Monday) , c) − Ψ (agents−i (Monday) , c) (6.10)
x · Ay = Pi [Monday] ·
[Ψ (1 + agents−i (Monday) , c) − Ψ (agents−i (Monday) , c)] +
Pi [Tuesday] ·
[Ψ (1 + agents−i (Tuesday) , c) − Ψ (agents−i (Tuesday) , c)]
(6.11)
Finally, for the UD utility function with the same strategy as above we have:
(Ay)i = Ψ (1 + agents−i (Monday) , c) (6.12)
x · Ay = Pi [Monday] · Ψ (1 + agents−i (Monday) , c)+
Pi [Tuesday] · Ψ (1 + agents−i (Tuesday) , c)(6.13)
For instance, P1 [Monday] = a(t) and agents−1 (Monday) = b(t) + c(t) + d(t) +
e(t) + f(t) + g(t) + h(t) (see Equation (6.7)).
In Section 8.2.3 we will use this approach and we will show the results obtained.
6.5 Introduction to Coalition Formation
In many domains we may need to work with a large amount of agents in order to
reach a goal. In such cases, we can try to model their interactions with normal
form games, but this model might be neither accurate nor useful, because we have
a huge and unreliable model. Instead, another way to consider this problem is
to study such games in a more abstract setting from a cooperative game theory
point of view called characteristic function game. In such games, the value of
each coalition of agents S is given by a characteristic function v(S). It may be
interpreted as the value created when the members of S come together and interact.
As a consequence, a cooperative game is a pair 〈N, v(·)〉, where N is the finite set
of players and v(·) is a function mapping subsets of N to numbers.
Coalition formation involves three main activities:
66
6. Theoretical Considerations 6.5 Introduction to Coalition Formation
Coalition structure generation : it is the coalition formation phase done by
agents such that agents within each coalition coordinate their activities, but
not between coalitions. This means partitioning the set of agents N into
exhaustive and disjoint coalitions called coalition structure (CS). Notice the
difference between coalition and CS: the former is the powerset of the set
N (also indicated as 2n, where |N | = n), while the latter involves a set of
constraints (i.e.: there is not a coalition structure where a generic agent η
belongs to two or more disjoint coalitions C1 and C2).
Solving the optimization problem of each coalition : this means polling the
task and resources of agents in the coalition maximizing monetary value, that
is money received from outside the system for accomplishing tasks minus the
cost of using resources.
Dividing the value of the general solution among agents : the reward is
assigned to each coalition, but agents need that reward in order to update
their policies. As a consequence, that reward must be divided up among
them so they can understand the goodness of their actions.
These activities may overlap and they are not independent, e.g.: the coalition that
an agent wants to join depends on the portion of the value that the agent would
be allocated in each potential coalition.
6.5.1 Coalition Structure Generation
Let N be the set of all agents, and n = |N |, while S is a generic coalition. Here we
assume that each coalition’s value v = v(S) is nonnegative. In a coalition structure
CS each agent belongs to exactly one coalition and some agents may be alone in
their coalitions. We will call this set of coalition structures M . The value of a
coalition structure is given by:
V (CS) =∑
S∈CS
v(S) (6.14)
67
6.5 Introduction to Coalition Formation 6. Theoretical Considerations
Usually, the goal is to maximize the social welfare of agents by finding a coalition
structure CS∗ such that:
CS∗ = arg maxCS∈M
V (CS) (6.15)
It is easy to note that the number of coalition structures is large (Θ (nn)), so not
all the coalition structures can be enumerated unless the number of agents is small.
The exact number of coalition structures is:n∑
i=1
Z(n, i), (6.16)
where Z(n, i) is the number of coalition structures with i coalitions. This quantity
is also known as the Stirling number of the second kind and it is captured by the
following recurrence:
Z(n, i) = i · Z(n − i, i) + Z(n − 1, i − 1), (6.17)
where Z(n, n) = Z(n, 1) = 1. The first term counts the number of coalition
structures formed by adding the new agent to one of the existing coalitions (there
are i choices because the existing coalition structure has i coalitions). The second
term considers adding the new agent to a coalition of its own, and therefore existing
coalition structures with only i − 1 agents are counted.
Recall that if we have n agents, all the possible coalitions (that is the powerset of
N) is 2n−1 (not counting the empty set). If we decide to exclude some coalitions a
priori, this exclusion might cause the value of the best remaining coalition structure
to be arbitrarily far from the optimum.
In literature many researchers have mostly focused on superadditive games
(v(S ∪ T) > v(S) + v(T) for all disjoint coalitions S,T ⊆ N): in such games
coalition structure generation is trivial, in fact all agents form the grand coalition
so they operate together. However, many games are not superadditive, because
a cost to form a coalition or some constraints may exist (coordination overhead,
anti-trust penalties, limited amount of time to carry out the communications and
computations). This class of games may be subadditive (v(S ∪ T) < v(S) + v(T)
for all disjoint coalitions S,T ⊆ N), where agents are best off by operating alone,
or it may be neither superadditive nor subadditive, where some coalitions are best
off merging while others are not.
68
6. Theoretical Considerations 6.5 Introduction to Coalition Formation
6.5.2 Optimization within a Coalition
Under limited and costless computation, each coalition would solve its optimization
problem. However, in many domains we can’t solve the problem from a combi-
natorial viewpoint, so an approximate solution must be found. In such a case,
selfish interested agents would want to strike the optimal tradeoff between solution
quality and the associated computation. This will affect the values of coalitions,
which in turn will affect which coalition structure gives the highest welfare.
6.5.3 Payoff Division
Payoff division strives to divide the value gained by a coalition structure among
agents in a fair and stable way so that agents are motivated to stay with the
coalition structure rather than move out of it. Many payoff division methods have
been proposed in literature. Here we discuss about two of them: the core and the
Shapley’s value.
The Core
The core of a coalition formation game is a set of payoff configurations (−→x , CS),
where −→x is a vector of payoffs given to agents in such a way that no subgroup is
motivated to depart from the coalition structure CS (it is like the Nash equilib-
rium):
Core =
{(−→x , CS) | ∀S ⊂ N,
∑
i∈S
xi > v(S) ∧∑
i∈N
xi =∑
S∈CS
v(S)
}(6.18)
The core is the strongest of the classical solution concepts in coalition formation.
In many cases, it may be empty, because there is no way to divide the social good
so that the coalition structure becomes stable, so there will be an infinite sequence
of steps from one payoff configuration to another. To avoid such problems, explicit
mechanism were proposed, like limits on negotiation rounds, contract costs or some
social norms to limit the negotiation.
Another opposite problem is to have multiple payoff vectors in the core, so all
agents have to agree on one of them (such vector is usually called the nucleolus,
69
6.5 Introduction to Coalition Formation 6. Theoretical Considerations
that is the payoff vector that is in the center of the set of payoff vectors in the
core).
A further problem related to the core is that the constraints in the definition
become numerous as the number of agents increases (point out the term ∀S ⊂ N
in Equation (6.18)).
The Shapley’s Value
The Shapley’s value is another policy for dividing payoffs in coalition formation
games and it will be defined axiomatically. Agent i is called dummy if v(S∪{i})−v(S) = v({i}) for every coalition S that does not include agent i. Agents i and j
are called interchangeable if v((S \ {i}) ∪ {j}) = v(S) for every coalition S that
includes agent i but not agent j. The three axioms of the Shapley’s value are:
Symmetry : if agents i and j are interchangeable then xi = xj .
Dummies : if agent i is a dummy then xi = v({i})
Additivity : for any two games v and w, xi in v + w equals xi on v plus xi in w,
where the game v + w is defined by (v(S) + w(S)) = v(S) + w(S).
The Shapley’s value is the only payoff division scheme that satisfies the previous
three axioms and it is defined as follow:
xi =∑
S⊆N
(|N | − |S|)! − (|S| − 1)!
|N |! · (v(S) − v(S− {i})) (6.19)
This payoff can be interpreted as the marginal contribution of agent i to the
coalition structure averaged over all the possible joining orders (it recalls the ground
idea of COIN). Notice that the payoff must be computed over all the possible |N |!joining orders, thus it is computationally hard.
It is interesting to note that the Shapley’s value, like the core, guarantees that
individual agents and the grand coalition are motivated to stay with the coalition
structure. However, unlike the core, it does not guarantee that all subgroups of
agents are better off in the coalition structure than by breaking off into a coalition
of their own.
70
6. Theoretical Considerations 6.5 Introduction to Coalition Formation
In such games the joining order of agents matters, but in the real world there
may exist situations that can be formalized with games where that joining order is
irrelevant. In this case, the core and the Shapley’s value are unnecessarily hard to
be computed. In Chapter 7 we will analyze these games and we will propose new
solutions of the reward distribution problem.
71
Chapter 7Task Allocation via Coalition Formation
Computers are useless. They can only give you answers.
Pablo Picasso
Contents
7.1 Game Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.1.1 Curse of the State Space Size . . . . . . . . . . . . . . . 76
7.1.2 Fuzzy Games and Groups of Agents . . . . . . . . . . . 78
7.2 Utility Functions of the Game . . . . . . . . . . . . . . . . . . . . 79
7.2.1 Reward Distribution among Agents . . . . . . . . . . . . 79
7.2.2 Characteristic and Reward Functions . . . . . . . . . . . 81
7.3 Testbed Problem: Cooking Teams . . . . . . . . . . . . . . . . . 83
7.3.1 Configurations . . . . . . . . . . . . . . . . . . . . . . . 84
7.3.2 Reward Functions . . . . . . . . . . . . . . . . . . . . . 87
7.3.3 State Space . . . . . . . . . . . . . . . . . . . . . . . . . 88
73
7.1 Game Outline 7. Task Allocation via Coalition Formation
In Section 6.5 we briefly introduced coalition formation problems (recall that
in this kind of problems there is a set of agents coordinating their activities, thus
receiving rewards). A game is completely described by a set N of n agents and by a
characteristic function v(S) that evaluates a given coalition S. The characteristic
function of pure coalition formation games considers the joining order of agents
belonging to a generic coalition. In this chapter we introduce a new kind of game
involving task allocation and coalition formation games.
7.1 Game Outline
As said above, in the real world there are different problems which can be formalized
as coalition formation games. However, this formalization might involve some
constraints restricting the application on real problems. For example, a possible
constraint is the joining order of agents in a coalition that could not be considered
while formalizing a problem.
In such situations all the previous payoff divisions described in Section 6.5.3 are
unnecessarily complicated and hard to apply (recall the Shapley’s value, Equation
(6.19)), so we need another way to evaluate a coalition and to split the payoff
among its agents.
To satisfy these requirements here we introduce a new kind of game involving
task allocation games and coalition formation ones. In such games, we have an
environment with some tasks to be allocated to all agents, but these tasks must
be executed with a priori fixed number of agents.
Given a set N of n agents and a set T of t tasks, we define a Dispersion Game
as a game where each agent has to decide which of the t tasks to undertake in order
to achieve full world utility. A world utility function is provided and it is used to
evaluate the overall behavior emerging from the environment. A well known game
belonging to this class of games is the Bar Problem (Section 5.3), where we have
a large amount of agents with respect to the number of tasks. In general these
games are known as anti-coordination games, in fact each agent tends to choose a
task in order to maximize its own reward without interest about other agents.
74
7. Task Allocation via Coalition Formation 7.1 Game Outline
Definition 7.1 (Coalition formation games). Given a set N of n agents and a
coalition S of agents, we define Coalition Formation Games the games where the
value of each coalition S is given by a characteristic function v(S) (see Section
6.5). These coalition values may represent the quality of an optimal solution for
each coalition optimization problem. Moreover, in general they may depend on non-
members’ actions due to positive and negative externalities (interactions of agents’
solutions).
Dispersion games and coalition formation games focus on different kind of prob-
lems. In the real world we can deal with problems that can be formalized with
dispersion games, but this modeling might be neither complete nor accurate. We
could use coalition formation games, but in these games we must necessarily face
out with the important problem of the difficult computation of reward division
among agents. In order to deal with some problems which require a sharing among
ground characteristics of the two previous games, we define a new kind of games
where we can deal with the main characteristics of dispersion games and coalition
formation ones.
Definition 7.2 (Task allocation via coalition formation games). These games in-
volve dispersion games and coalition formation ones (Definition 7.1), so we have
a set N of n agents and a set T of t tasks some of which can only be computed
by a prefixed group of agents. Furthermore, each agent η is identified with a type
kη (kη ∈ K = {1, 2, . . . , k}), thus a generic coalition S can be formed by different
types of agents. Since we have a finite number of tasks, not all coalition structures
in M are feasible, so we have to find an optimal coalition structure CS∗ ∈ M in
order to achieve full world utility (which is given a priori).
In order to describe a coalition, in games of Definition 7.2 (see [14]) we only
need to specify how many agents of each type are participating. A coalition can
be identified with a point S ∈ Rk such that 0 6 S 6 Q, where Q ∈ R
k+ specifies
the total number of players of each type.
From this point of view, coalition formation is an ongoing, dynamic process with
payoff generated when coalitions create, regroup or dismiss. As a consequence, we
75
7.1 Game Outline 7. Task Allocation via Coalition Formation
only consider the process of coalition generation where agents belonging to a set
learn how to distribute themselves into exhaustive and disjoint coalitions. Under
this new learning framework, a farsighted agent will move away from a certain
coalition if and only if it expects to increase its payoff in the future from such
deviation.
Unlike dispersion games, this class of games is subsumed by the class of goal
satisfaction problems, where we have a goal that cannot be satisfied by only one
agent.
7.1.1 Curse of the State Space Size
Now, let us consider the state space size, that is a well known issue of RL. It is
usually useful to consider the state space size in order to foresee whether a problem
can be formalized (that is whether we have not a very huge state space size) and,
above all, if we can find a suitable solution of the learning problem. This new
kind of games involves coalition formation games, so we could consider both the
number of agents acting in an environment and the agent types, thus these games
may possibly have a huge state space. The number C of possible coalitions is
C =k∑
type=1
∑
B∈type−subset
∏
i∈B
|Qi| ∀B ∈ type − subset and i ∈ B (7.1)
where the first sum runs through all the k agents’ type, type− subset is the subset
of N that contains all sets B of exactly type elements, i represents the type i in B
and |Qi| is the number of elements of type i, e.g.: if we have Q = {Q1, Q2, Q3, Q4},then
C = [Q1 · Q2 · Q3 · Q4]
+ [Q1 · Q2 + Q1 · Q3 + Q1 · Q4 + Q2 · Q3 + Q2 · Q4 + Q3 · Q4]
+ [Q1 + Q2 + Q3 + Q4]
+ [Q1 · Q2 · Q3 + Q1 · Q2 · Q4 + Q1 · Q3 · Q4 + Q2 · Q3 · Q4]
In this situation we have C coalitions (Equation (7.1)), but in such games we
must consider also all the coalition structures, thus we have a vector Υ such that
76
7. Task Allocation via Coalition Formation 7.1 Game Outline
|Υ| = 2C . As a consequence, we have a representation state used to represent the
perceived environment of any players of size is 2C ·C. The representation state size
may become very large, but we must consider that not all the possible coalition
structures are admissible, thus an agent needs not to consider all the 2C ·C coalition
structures.
It is easy to understand that even with a small amount of types of agents we can
have a huge state space and the problem may become intractable. For example, let
us suppose to have a population Q = {3, 2, 4, 5} thus having C = 359 coalitions.
The representation space is 2C · C ≃ 4.2156 · 10110 and it is obviously intractable.
Furthermore, if we are dealing with a non-stationary environment the state space
will get increasingly huge, and we must not ignore the well known problems of such
games as depicted in Section 6.1.2 and in Section 6.2.
As a consequence of the previous considerations, we do not use any state space
information. In addition to its size, we must deal with the fact that if we use
the state space, each state goodness depends on all types of agents belonging to
the coalition visiting such state. Let us suppose to have a coalition S1 = {0, 4}visiting the state s1 and the characteristic function v(·) that rates as good this
coalition (thus the related state). The four agents belonging to S1 believe that
they impliedly1 formed a good coalition in this state s1 related to a particular task
t, thus they tend to choose this task in the future (let us suppose to ignore the
exploration policy, if exists). If at next time step one agent of the first type joins
S1 (thus we have a new coalition S2 = {1, 4}), all of them choose the same task t
and the characteristic function rates as bad this new coalition, then the four agents
of the second type do not understand that the bad payoff obtained is not due to
themselves, but instead it is due to the agent of the first type. As a consequence,
now they rate as not good the task t (thus they choose another task), while it was
previously rated as good. In this situation, all agents tend to pick other tasks with
respect to those picked in the past and to rate in a wrong way different states of
environment.
In such a situation, the state space induces a further non-stationary extension
1In these games we do not allow any centralized communication among agents.
77
7.1 Game Outline 7. Task Allocation via Coalition Formation
to the problem. To reach a (near) optimal state in this huge space we must ensure
coordination among different agents belonging to a coalition, and these coalitions
must be coordinated by themselves. In the simple example described above we
have two coalitions not coordinated, thus agents do not reach any optimal state.
7.1.2 Fuzzy Games and Groups of Agents
Let P(S) be the set of all possible coalitions. Any population defined by the vector
Q ∈ Rk+ generates a characteristic function called fuzzy game:
Definition 7.3 (Fuzzy game). A Fuzzy game is a pair (Q, v) such that:
• v is the characteristic function and is a mapping function v : P \ {∅} 7→ R;
• Q ∈ Rk+.
The number of possible coalition structures is limited by the number of tasks
t. Given such a number of tasks, no more than t coalitions can be formed, thus we
introduce the following
Definition 7.4 (Coalition structure). A coalition structure CS = (S1,S2, . . . ,SH}(where 0 6 H 6 |Q|) is a partition of Q, that is Sh 6= ∅ for any h ∈ {1, 2, . . . , H},⋃H
h=1 Sh = Q and Si ∩ Sj = ∅ for any i, j ∈ {1, 2. . . .H} with i 6= j.
With this model we allow agents belonging to a set to organize themselves into
a precise coalition in order to achieve much efficient individual payoffs and possibly
to obtain large world utility values. As mentioned in Section 6.5.1, not all games
are superadditive, thus large organizations could operate less efficiently than the
sum of their constituent parts, thus a grand coalition will not form.
Given such games, in this section we describe economies with a small group
of agents belonging to a finite number of types. The main goal is to build stable
coalitions that will end up in a stable and possibly meaningful coalition structure.
Here we need to redefine the concept of core used in Equation (6.18) because we
are dealing with a priori non superadditive characteristic functions.
Definition 7.5 (Core of a fuzzy game). The core of the fuzzy game (v,Q), that is
Fcore(v,Q), is the set of vectors x = {x1,x2, . . . ,xn} such that:
78
7. Task Allocation via Coalition Formation 7.2 Utility Functions of the Game
1. xQ = maxCS∈P(Q)
∑S∈C vS, where P(Q) denotes the set of all possible coali-
tion structures and C is a set of coalitions;
2. xS > v(S) for any coalition S ∈ C.
7.2 Utility Functions of the Game
In the following sections we discuss about different utility functions of this type
of game as well as how to find a useful way to deal the reward obtained. These
functions are used both to evaluate a coalition S and the resulting task allocation,
so they formalize different aspects of this type of game.
7.2.1 Reward Distribution among Agents
One of the main interesting problems consists of finding a suitable way to distribute
the reward among agents of a coalition, so we can apply the Shapley’s value or the
core as stated in Section 6.5.3.
As discussed in Section 7.1 now we are facing with a new kind of game where we
do not deal with the joining order of agents in a coalition, but we must consider task
allocation. In order to find a right method to split rewards, we should consider its
usefulness as well as its computational cost. The Shapley’s value shares the ground
idea of COIN (notice the similarity between the meaning of the difference terms
of Equation (4.7) and Equation (6.19)) and it has some useful properties as stated
in Section 6.5.3. Unfortunately, this method suffers of possibly bad performance,
since in Equation (6.19) we compute the reward to be assigned to each agent by
examining all possible coalitions having a set N of n agents. In RL this means that
at each time step, a priori, any agent can change the resulting coalition structure
(thus coalitions) according to their policy (it obviously depends on the reward
received). All that means we must compute the Shapley’s value for each agent,
and this process repeats over time. Furthermore the Shapley’s value considers the
joining order of agents in each coalition, but in this kind of game this is useless
(besides its computational cost).
79
7.2 Utility Functions of the Game 7. Task Allocation via Coalition Formation
In order to avoid such undesirable features we propose marginal contribution,
a new payoff assignment method that shares the interesting characteristics of the
Shapley’s value, but avoids such a heavy computational burden.
Definition 7.6. Given a coalition S = {n1, n2, . . . , nk} we define marginal contri-
bution for agent type i the following function:
mi = v (S) − v (S−i) , (7.2)
where:
• v (S) is the characteristic function of the task allocation via coalition forma-
tion problem;
• S−i is the coalition S without one agent of type i;
• ni is the number of agents of type i.
Example 1. Let us take to have S = {5, 1}. We can compute the marginal
contribution for the first agent type (m1) and for the second one (m2) obtaining:
• m1 = v (5, 1) − v (4, 1);
• m2 = v (5, 1) − v (5, 0).
Recall the definition of marginal contribution in Equation (7.2): we can see the
marginal contribution shares the ground idea of COIN (in particular the clamping
function, that is CLσ(·) in Equation (4.7)), since it evaluates the goodness of
each agent belonging to a given coalition by computing the difference between
the characteristic function value obtained with full coalition and the characteristic
function value obtained without one agent of a certain type of that coalition.
It is interesting to note that the marginal contribution described in Definition
7.6 computes the payoff to be assigned to each coalition S in a symmetric way
with respect to each agent of the same type. The second term of Equation (7.2)
is computed on the clamped coalition S−i, and this is done by each agent of such
coalition. Hence we consider fair important all agents belonging to S, because in
this framework we are interested about each agent’s type rather than an agent
itself in order to satisfy a task as described in Definition 7.2.
80
7. Task Allocation via Coalition Formation 7.2 Utility Functions of the Game
7.2.2 Characteristic and Reward Functions
As stated in Section 6.5, the characteristic function of a game affects the game
itself, since it evaluates the goodness of a coalition, thus the structure of game.
In literature, many researchers have been focused on superadditive games (Section
6.5.1) where coalition structure generation is trivial. However, many games are
not superadditive, so we aim to have a characteristic function not superadditive:
this will avoid the grand coalition formation thus coalitions will obtain meaningful
rewards.
It is useful to note that, during the formalization of a problem, we must use a
feasible (and possibly robust) characteristic function to model a desired behavior. If
we introduce a strict characteristic function (i.e., a characteristic function allowing
one feasible coalition structure to have a large amount of tasks and agents) we
might not reach an optimal (or a near one) coalition structure. This problem
becomes more evident if we consider the state space, since in this case, besides
the characteristic function, we must deal with a huge state space in order to avoid
negative effects on the learning process.
We draw on Example 1 a useful feature of the characteristic function and its
association with the game. In this example we compute the difference between
two marginal contribution values, but it may happen that one of them might be
unfeasible. In such situation, the characteristic function should be robust and it
must return a value indicating that such coalition is unfeasible in order to make
possible to compute the marginal contribution, e.g.: in Example 1 if the coalition
S = {5, 0} is unfeasible, then we could impose v(S) = 0.
RL considers that each agent, after the execution of an action selected with
respect to its policy, receives a reward based on a reward function. Until now,
here we have considered only characteristic functions which evaluate only coalition
goodness, but we need to introduce a reward function such that it can parcel out
the coalition value given by the characteristic function.
We can define the reward value gi for each agent of type i in task allocation via
coalition formation games.
81
7.2 Utility Functions of the Game 7. Task Allocation via Coalition Formation
Definition 7.7. The reward value gi for each agent of type i is given by
gi = g(mi), (7.3)
where g(mi) is the reward function that evaluates the marginal contribution mi
computed according to Equation (7.2).
With these two functions we aim to induce an implicit coordination mechanism
among different types of agents belonging to a coalition. The marginal contribu-
tion (Equation (7.2)) provides a measure to coordinate different agents within a
coalition, while the reward function (Equation (7.7)) is used to coordinate different
coalitions in order to reach an optimal configuration (that is a equilibrium).
In these games we are dealing also with task allocation, so we need a world
utility function that evaluates the overall behavior of environment.
Definition 7.8. The world utility value G is given by
G = G (g (m1) , g (m2) , . . . , g (mk)) , (7.4)
where G(·) is the given world utility function of environment.
According to Definition 7.7 we can use the reward functions stated in Section 5.2
and in Section 5.3, so we can evaluate the goodness of an environment formalized
with RL.
As stated in Section 4.3.3, the COIN clamping function suggests to clamp the
elements of state ζ pertaining to agent η to a prefixed arbitrary value. In that
situation we are free to prefer one specific clamping value rather than another one
(i.e.: in the Bar Problem we can clamp to the null action or to a random one),
but anyway the clamping function acts on the state space. In this new typology
of game we are dealing with a huge state space, thus we proposed in Equation
(7.2) a clamping function acting on a given coalition S clamping to a fixed value
related to agents’ type in that coalition, that is it removes one agent of type i then
it evaluates the resulting coalition S−i with the characteristic function v(·) of the
game.
Let us focus on the WLU function of Equation (4.7): as said above, this utility
function is mainly based on the state, but here we must deal with a huge state
82
7. Task Allocation via Coalition Formation 7.3 Testbed Problem: Cooking Teams
space as mentioned in Section 7.1.1. As a consequence we must pay attention on
which clamping function we can use, in fact if we clamp to a different state and not
to the null state, each agent will receive a meaningful reward, but useless, in fact
it can’t reuse this precious information in the future in order to avoid or to choose
a particular action because it has not any state information (recall that as stated
in Section 7.1.1 we do not use any state information). Anyway, if we consider the
state space, this reward might be still useless, in fact the state space is huge and
each agent will have difficulties to visit another time the same state (recall that
such a visit depends on the joint action and not only to the agent’s action).
This particular feature does not affect other reward functions (selfish utility
function, team game utility function and uniform division reward), because with
such functions we do not compute any difference involving whatsover clamping
value, thus involving any state information. This may be useful with this kind of
games if we do no consider the state space. However this feature might be affected
by whether the state space exists. For example, following the definition of the TG
utility function (see Section 4.2.3 and Section 5.2.2), in general we must compute
the sum over all tasks of the reward function evaluated in such tasks. This could
lead the learning process towards to non optimal values, since the reward value
assigned to each agent belonging to a coalition can be influenced by other agents
and coalitions related to other tasks.
7.3 Testbed Problem: Cooking Teams
Here, we introduce the cooks and helpers problem (see [25]), where we have 2 types
of players k = {cook, helper} and 3 different cooking teams:
• 1 cook and 2 helpers cook one cake;
• 4 cooks alone cook one cake (too many cooks encounter difficulties reaching
an agreement);
• 1 helper alone cooks one cookie.
83
7.3 Testbed Problem: Cooking Teams 7. Task Allocation via Coalition Formation
Each cake is worth 10 and each cookie 1. Moreover we have some constraints about
the number of possible kitchens available, in fact we have U = 7 different kitchens
(tasks) and each one can be selected by any coalition S = {fc, fh} of fc cooks and
fh helpers.
Let us suppose that Su describes the coalitions cooking in kitchen u (u ∈ U =
{1, . . . , 7}). We define the following characteristic function (n ∈ N):
v(S) =
n · 10 if we have a coalition S = {n, 2 · n} ,
n · 10 if we have a coalition S = {4 · n, 0} ,
n if we have a coalition S = {0, n} ,
0 elsewhere.
(7.5)
The world utility is defined as:
G (Q) =
U∑
u=1
v(Su) (7.6)
where Su is the coalition undertaking the u-th task.
The main goal of this game is to maximize Equation (7.6) with respect to the
U = 7 tasks, without communication among agents acting in this environment.
7.3.1 Configurations
In this problem we will analyze five different possible cases in order to find an
admissible coalition structure CS∗ such that maximizes G (Q).
Case 1 : the population Q = {fc, fh} consists of fc > 0 cooks and fh = 0 helpers.
If fc ≤ 4 the core is nonempty (recall that the core focuses on the stability of
an admissible coalition, and in this case any solution is stable because there
are not leftover cooks creating instability). If fc > 4, the core is nonempty if
and only if fc is an integer multiple of 4 (so they cook fc mod 4 cakes). In any
other case, in the population there are leftover cooks who create instability.
Case 2 : the population Q = {fc, fh} consists of fh > 0 helpers and fc = 0 cooks,
thus the core is always nonempty and it assigns 1 to each helper.
84
7. Task Allocation via Coalition Formation 7.3 Testbed Problem: Cooking Teams
Case 3 : the population Q = {fc, fh} is composed by r1 + r2 coalitions (both of
them greater than 0), where r1 is the number of coalitions (1, 2) (1 cook, 2
helpers) and r2 is the number of coalitions (0, 1). In this case there are many
helpers relative to the number required for teams with composition (1, 2).
Competition among helpers exists and it keeps the price of a helper down.
The core assigns 1 to each helper and 8 to each cook.
Case 4 : the population Q = {fc, fh} is composed by r1 + r2 coalitions, where
r1 is the number of coalitions (1, 2) and r2 is the number of coalitions (4, 0)
(both of them greater than 0). This is the opposite of the previous case:
here we have many cooks with respect to the number of helpers. Now the
competition exists among cooks in order to be in a coalition with 2 helpers.
In this case the core assigns 154
to each helper and only 52
to each cook.
Case 5 : the population Q = {fc, fh} is composed by r1 > 0 coalitions of type
(1, 2). Here the core contains a continuum of points and the extremes are
described by the core of Case 3 and Case 4.
Let us consider the characteristic function of Equation (7.5): at first sight
it is a strict function that rates only known coalitions. Obviously, if we change
the characteristic function, the problem is changed too. For example, with the
characteristic function of Equation (7.5) we do not formalize the fact that there
may exist only one oven. In order to allow the cook of only one cake or only one
cookie we must impose n = 1 in Equation (7.5): i.e., both coalitions S1 = {1, 2} and
S2 = {4, 8} can cook only one cake. Moreover, we can formalize a moderator that
organizes agents belonging to a coalition is such a way to maximize the welfare, e.g.:
if we have the coalition S = {2, 5}, players can cook two cakes and one cookie,
while with Equation (7.5) they can cook anything. Anyway, with these simple
considerations we show the importance to have a good and a robust characteristic
function with respect to the problem we aim to formalize.
To test the goodness of the characteristic function proposed in Equation (7.5)
we propose four different configurations Q = {cooks, helpers} according to the five
different cases discussed above:
85
7.3 Testbed Problem: Cooking Teams 7. Task Allocation via Coalition Formation
Bar-1 : we have 5 cooks and 20 helpers;
Bar-2 : we have 11 cooks and 14 helpers;
Bar-3 : we have 21 cooks and 14 helpers;
Bar-4 : we have 21 cooks and 5 helpers.
Using Equation (7.5) we can obtain the optimal coalition structures described
in Table 7.1. Obviously, according to Equation (7.5), it is considered optimal any
admissible permutation of the coalition structures of Table 7.1, e.g.: an admissible
optimal coalition structure for the Bar-2 is reported in Table 7.2.
Tasks Bar-1 Bar-2 Bar-3 Bar-4
1 (Monday) S = {1, 2} S = {7, 14} S = {7, 14} S = {12, 0}payoff = 10 payoff = 70 payoff = 70 payoff = 30
2 (Tuesday) S = {1, 2} S = {4, 0} S = {4, 0} S = {1, 2}payoff = 10 payoff = 10 payoff = 10 payoff = 10
3 (Wednesday) S = {1, 2} S = {0, 0} S = {4, 0} S = {0, 0}payoff = 10 payoff = 0 payoff = 10 payoff = 0
4 (Thursday) S = {1, 2} S = {0, 0} S = {4, 0} S = {0, 0}payoff = 10 payoff = 0 payoff = 10 payoff = 0
5 (Friday) S = {1, 2} S = {0, 0} S = {2, 0} S = {4, 0}payoff = 10 payoff = 0 payoff = 0 payoff = 10
6 (Saturday) S = {0, 10} S = {0, 0} S = {0, 0} S = {4, 0}payoff = 10 payoff = 0 payoff = 0 payoff = 10
7 (Sunday) S = {0, 0} S = {0, 0} S = {0, 0} S = {0, 3}payoff = 0 payoff = 0 payoff = 0 payoff = 3
World utility value 60 80 100 63
Table 7.1: Optimal coalition structures for the four bar problems
86
7. Task Allocation via Coalition Formation 7.3 Testbed Problem: Cooking Teams
Tasks Coalitions Payoffs
1 (Monday) S = {1, 2} 10
2 (Tuesday) S = {1, 2} 10
3 (Wednesday) S = {1, 2} 10
4 (Thursday) S = {1, 2} 10
5 (Friday) S = {2, 4} 20
6 (Saturday) S = {1, 2} 10
7 (Sunday) S = {4, 0} 10
Table 7.2: Another admissible coalition structure for the Bar-2
7.3.2 Reward Functions
In order to worth different coalitions, we evaluate them with the marginal contri-
bution proposed in Section 7.2.1. Furthermore, in order to split the payoff obtained
with the marginal contribution among agents of a coalition, we propose four dif-
ferent reward functions.
Selfish utility function Each agent η of type kη = i (where
i ∈ K = {cook, helper}) obtains a payoff given by
gi = mi (7.7)
where mi is the marginal contribution computed according to Equation (7.2).
COIN utility function Each agent η of type kη = i (where
i ∈ K = {cook, helper}) obtains a payoff given by
gc = fc ·[v (fc, fh) − v (fc − 1, fh)
]+ fh ·
[v (fc, fh) − v (fc, fh − 1)
]−
[(f r
c + 1) ·(v (f r
c + 1, f rh) − v (f r
c , f rh))+
f rh ·(v (f r
c + 1, f rh) − v (f r
c + 1, f rh − 1)
)](7.8)
gh = fc ·[v (fc, fh) − v (fc − 1, fh)
]+ fh ·
[v (fc, fh) − v (fc, fh − 1)
]−
[f r
c ·(v (f r
c , f rh + 1) − v (f r
c − 1, f rh + 1)
)+
(f r
h + 1)·(v (f r
c , f rh + 1) − v (f r
c , f rh))]
(7.9)
87
7.3 Testbed Problem: Cooking Teams 7. Task Allocation via Coalition Formation
where as usual v(·, ·) is the characteristic function, fc and fh are the number
of cooks and helpers in the day attended by agent η, while f rc and f r
h are the
number of cooks and helpers in a different random day.
Team Game utility function Each agent η of type kη = i (where i ∈ K =
{cook, helper}) obtains a payoff given by
gi =1
n·
D∑
d=1
v(fdc , fd
h) (7.10)
where as in the previous case f dk is the number of agents of type k on day d
(obviously D is the number of days per week) and n is the number of agents.
Uniform Division utility function Each agent η of type kη = i (where i ∈K = {cook, helper}) obtains a payoff given by
gi =1
fc + fh
·∑
i∈k
fi · mi (7.11)
where fi is the number of agents of type i in the day attended by agent η
and mi is the agent’s marginal contribution of type i.
7.3.3 State Space
We will analyze this problem with and without any state information in order to
see how agents behave in this environment. If we use the state space, it will be
huge, so we expect to have worse performance than without any state information.
As a consequence, we introduce the concept of difficulty of a problem that is
related to the number of optimal states in the joint state space. In the four bar
configurations described above we formalized this idea. Bar-1 looks “easiest” than
Bar-4: the former has many helpers with respect to the number of cooks (thus each
cook tends to cook one cake with two helpers and they will not find any difficulty),
while in the latter all cooks tend to take two helpers in order to obtain better payoffs
(helpers are shared resources). In these situations we have a joint state space with
a different amount of optimal states and they depend on the characteristic function
used. The proposed characteristic function rates only coalitions with a predefined
number of agents as described in Equation (7.5), thus the joint state space will be
88
7. Task Allocation via Coalition Formation 7.3 Testbed Problem: Cooking Teams
formed only by single states where an admissible coalition is. For example, let us
analyze the Bar-4: in this case the characteristic function proposed identifies only
few optimal states, so we have a huge state space with a very small amount of good
states.
An easy way to deal with this a priori huge state space is to reduce it using an
appropriate state space representation. Following the idea proposed in Section 5.3,
we analyze this problem with a reduced state space. Each state s represents the
number of cooks and helpers attending the bar in a day chosen by the agent that is
evaluating its policy. Hence, the state space for each bar is drastically reduced as
follow (enclosed in parenthesis is reported the original state space size computed
according to Equation (7.1)):
Bar-1 : the new state space has 126 states (2125 · 125 ≃ 5.3169 · 1039).
Bar-2 : the new state space has 180 states (2179 · 179 ≃ 1.3716 · 1056).
Bar-3 : the new state space has 330 states (2329 · 329 ≃ 3.5980 · 10101).
Bar-4 : the new state space has 132 states (2131 · 131 ≃ 3.5662 · 1041).
As described above, the state space size is heavily decreased, but we must bear in
mind that the goodness of each state depends on the number of cooks and helpers
attending the bar. Hence a state can be rated in a different way by cooks and
helpers, thus if a cook rates as bad a state s, a helper can rate it as good. As
a consequence, they do not agree on the evaluation of that state, so cooks and
helpers tend to visit the same set of states and the world utility will assume lower
values.
In Section 8.3 we propose the result obtained with these different bar configu-
rations, with and without state in order to analyze the overall coalitions’ behavior.
89
Chapter 8Results
The only real valuable thing is intuition. The intellect has
little to do on the road to discovery.
Albert Einstein (1879, 1955)
Contents
8.1 Grid world . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.1.1 First Grid . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.1.2 Second Grid . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.2 Bar Problem and its Reward Functions . . . . . . . . . . . . . . . 98
8.2.1 First Bar Configuration . . . . . . . . . . . . . . . . . . 99
8.2.2 Second Bar Configuration . . . . . . . . . . . . . . . . . 105
8.2.3 Q-learning Dynamics of the Bar Problem . . . . . . . . 109
8.3 Cooking Teams Problem . . . . . . . . . . . . . . . . . . . . . . 116
8.3.1 Nonempty State Space . . . . . . . . . . . . . . . . . . . 117
8.3.2 Empty State Space . . . . . . . . . . . . . . . . . . . . . 128
91
8. Results
In the previous chapters COIN has been described and analyzed with its the-
oretical grounds and applications on real problems. In this chapter we show some
characteristics emerging from real applications with particular kind of problems
like the Grid world and the Bar Problem (see Section 5.2 and Section 5.3 respec-
tively). In the following we show the results achieved with particular environment
configurations, then we analyze some unfavorable characteristics, like the delayed
reward, that in most problems managed by RL can be ignored (for instance matrix
games, see Section 6.1.1 and [1, Section 3] for further details). On the contrary,
with COIN in non-stationary environments this problem should be considered (if
possible) to obtain the best possible solutions.
Furthermore, we analyze the different reward functions used to induce coop-
eration among agents with another viewpoint (in this case we analyze a reduced
version of the Bar Problem) using the Q-learning dynamics (see Section 6.4). Due
to this analysis, we can understand how the policy of each agent evolves during
the learning phase using different reward functions. Thus, we can understand how
the different reward values induce agents towards different solutions (we will see
some solutions where agents randomly act).
Finally, we show the interesting results obtained in the Cooking Teams Problem
(see Section 7.3) with different configurations of the agents and of the environment.
In this case the problem still involves many agents, but they must cooperate and
regroup in coalitions to reach a prefixed goal. These results are twofold: first, they
show how the problem difficulty drastically increases if we introduce some bounds
(that is the presence of different coalitions of agents to reach a goal); second,
these results are useful to understand how the different configurations proposed
can influence the dynamic behavior of each agent and of each coalition of agents.
For each configuration, we propose different experiments that depict the overall
behavior emerging from the environment and how this behavior can be extremely
modified acting on some configurations both of the agents and of the environment.
92
8. Results 8.1 Grid world
8.1 Grid world
In the Grid world presented in [24] we have a good introduction of the theoretical
aspects of COIN, but in practice the problem is not sufficiently described, because
we do not have any kind of information about the arrangement of agents and
tokens.
By a careful inspection of the results proposed in such paper (see Figure 5.3 and
Figure 5.4) we can have doubts; in particular, from the results depicted in Figure
5.3 we see that with the WLU function all agents achieve excellent performance,
larger than those achieved with the TG and the SU functions: the latter suggest
that all agents receive rewards not enough significant to learn the (near) optimal
sequence of actions that leads to a complete collection of tokens (notice that on
Figure 5.3 and Figure 5.4 the world utility function is on y-axis, that is equivalent
to the sum of the token values collected up to the current time step). This result
may be questionable, and in the following we discuss it.
8.1.1 First Grid
We start our analysis with the grid depicted in Figure 8.1 (tokens are represented
with letter T and agents with letter A). In this situation it is plausible to find
results different from those shown in [24]. We expect to obtain increasing utility
function values, since all agents, even acting with a random policy, are able to
collect an increasing amount of tokens laid on the grid.
We executed the experiments with the following configuration:
• 4 agents, each one using Q-learning with learning factor α = 0.5 and discount
factor γ = 0.95, while an ǫ-greedy policy is used for the exploration with
exploration factor ǫ = 0.1 decreasing over time;
• 12 tokens with random values between 0.1 and 2 (50 possible values);
• each agent can execute 4 different actions (up, down, left, right) with zero-
probability to make a mistake (i.e. it believes to be in position p in the grid
93
8.1 Grid world 8. Results
Figure 8.1: Untypical grid
due to execution of action a, but instead it recovers in position p′), and it is
able only to perceive its position on the grid;
• results are averaged over 10 different runs, each one composed by 10,000
trials of 10 steps;
• we introduce the uniform division utility function (UD): it is similar to the
TG one, but here all agents receive a reward averaged on the number of
agents (i.e. sum of the token values collected up to now by all agents divided
by the number of agents).
In Figure 8.2 we notice the good performance of the agents using the WLU
function, in fact they collect 82% of the available world utility value (notice that
all the graphs reported are normalized with respect to the sum of all the token
values), with a large convercenge speed already in the first 2,000 trials (collecting
about 80% of the available world utility value).
The agents using the SU and the TG utility functions tend to show a similar
convergence speed and both of them have a lower convergence value with respect
to the WLU one. The convergence speed is due to the greedy behavior of these
agents. In the earliest trials they receive rewards propelling them to collect the
94
8. Results 8.1 Grid world
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Nor
mal
ized
Wor
ld U
tility
Trials
WLUSUTGUD
Figure 8.2: Results of the untypical grid
same tokens in subsequent trials: in such a case they focus on tokens near them
and/or on tokens with higher value ignoring others.
The agents using the UD utility function show a uniform behavior, in fact their
policy results in a slow convergence speed and in a low convergence value: this is
particularly due to the awful signal-to-noise ratio analyzed in Section 4.2.3.
With the WLU and the SU functions the agents have better performance than
using the TG and UD ones, since with the latter they receive the same reward and
then they encounter difficulties discerning the effects of their actions, while with
the former each agent collects a reward related to its actions and then it is led
towards better understanding the effects of its actions.
8.1.2 Second Grid
To test the effective reliability of the WLU function we analyzed the third grid
presented in [18, Figure 3b] (Figure 8.3) ignoring the ability of each agent to move
itself diagonally and the negative rewards returned to agents which tend towards
95
8.1 Grid world 8. Results
the same path in each single trial1.
Figure 8.3: Grid proposed by ’t Hoen and Bohte (from [18])
The experiments were executed with the following configuration:
• 8 agents, each one using Q-learning with learning factor α = 0.5 and discount
factor γ = 0.95, while an ǫ-greedy policy is used for the exploration with
exploration factor ǫ = 0.1 decreasing over time;
• 8 tokens: tokens 1 and 2 have value equal to 1.6, from 3 to 6 they have value
equal to 1.0, and finally tokens 7 and 8 have value equal to 1.2;
• each agent can execute 4 different actions (up, down, left, right) with zero-
probability to make an error (i.e. it believes to be in position p in the grid
due to execution of action a, but instead it is in position p′), and it is able
only to perceive its position on the grid;
1This particular agent behavior is mainly due to their preference to follow a path guaranteeing
non-negative reward. This means that an agent can continuously follow the same path even if
in such a path there are not tokens, because this gives reward equals to 0, instead of a possible
negative reward due to tokens shared between 2 or more agents.
96
8. Results 8.1 Grid world
• the results are averaged over 10 different runs, each one composed by 10,000
trials of 15 steps;
• as in Section 8.1.1 we evaluate also the UD utility function.
In this situation we aim to have coordination among agents to collect as many
tokens as possible. So, we expect agents will not collect a significant number of
tokens with the TG and UD utility functions, since all these agents receive a reward
related to the joint action and not to each single action; on the contrary with the
SU function agents are led towards tokens with higher value.
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Nor
mal
ized
Wor
ld U
tility
Trials
WLUSUTGUD
Figure 8.4: Results of the grid proposed by ’t Hoen and Bohte
The results depicted in Figure 8.4 are consistent with our expectation; in fact,
we clearly note the awful performance of the agents using the TG and the UD utility
functions that, about after 3,000 trials, tend towards about 25% of the overall
available world utility value. At the opposite, the agents using the SU and the
WLU functions behave in a similar way. Despite the agents using the SU function
show similar growth rate with respect to those using the WLU function in the first
1,000 trials, the SU convergence speed decreases, thus these agents tend towards
about 65% of the global available world utility value, while the agents using the
97
8.2 Bar Problem and its Reward Functions 8. Results
WLU keep collecting tokens (they rech 87% of the avalable world utility). Already
with 5,000 trials these agents collects 80% of the available world utility value using
the WLU function, while with the SU one they remain constant at 65%.
In this situation, what said above still holds: with the WLU and the SU func-
tions all agents receive rewards related to their own actions (with good conse-
quences on the signal-to-noise ratio, because agents keep improving their own pol-
icy), while with the TG and UD functions all agents cannot clearly discern the
consequences of their actions.
8.2 Bar Problem and its Reward Functions
The Bar Problem is significantly different from the Grid world (the former is classi-
fied as a dispersion game, [19, Section 3]). While the Grid world is a non-stationary
environment (due to agent’s changing policies) and stochastic, this environment is
stochastic and stationary. Each agent can perceive this problem as single-agent
single-state environment (it is known as arm bandit problem): each agent chooses
a night to attend the bar and receives a reward not related to which agents are
attending the bar, but rather to the number of agents in that night. Having n
agents, each agent ηi considers agents {η1, η2, . . . , ηi−1, ηi+1, . . . , ηn} not as oppo-
nents, but just as entities related to the environment. With this expedient the size
of the problem is widely reduced, in fact we need to consider an environment with
a number of states equal to the number of agents acting in that environment plus
one2, i.e. n + 1.
In the following sections, we show the results obtained with two different bar
configurations, both with agents configured with ǫ-greedy policy (ǫ = 0.1 decreasing
over time), learning rate α = 0.5 and discount factor γ = 0.95. Furthermore, we
compare the performance obtained with the exponential reward function (see γ(·) of
Equation (5.7)) and the Gaussian one (see Section 6.3) of these two configurations.
2We must also consider the case where no agent attends the bar at a given night
98
8. Results 8.2 Bar Problem and its Reward Functions
8.2.1 First Bar Configuration
In this environment, we have 30 agents that must choose a night to attend the bar
in a week of 5 days, where the optimal number of agents attending the bar is 2.
This experiment consists of 15 thousand weeks and the results are averaged over
10 different runs.
We compared the world utility trend obtained with the WL, the TG and the
UD utility functions, where the TG and the UD values are respectively divided by
the total number of agents and by the total number of agents attending the bar
in the night attended by a specific agent. Furthermore these comparisons depend
from the mathematical form of the reward function; we show the results obtained
with such utility functions when we have an exponential function (as stated in [6,
Section 4] and in Equation (5.7)) and a Gaussian one (see Section 6.3).
The maximum achievable value is given by having 2 agents attending the bar
in 4 days of the week and 22 agents in the remaining one (moreover the latter
give an insignificant contribute). This is easy to verify, since we have the following
nonlinear problem to solve:
arg maxxi
5∑
i=1
Ψ (xi, c)
constraints:
∑5i=1 xi = 30
0 6 xi 6 30 ∀i = 1 . . . 5
where xi is the number of agents attending the bar on night i, c is the desired
number of agents and Ψ(·, ·) is the reward function:
1. Ψ(xi, c) , xi · exp(−xi
c) if we use the exponential reward function;
2. Ψ(xi, c) , k ·N (c, σ2) (where k ∈ R and k > 0) if we use the Gaussian reward
function.
In Figure 8.5 we see the normalized results3 achieved with the exponential
reward function. The agents using the TG and the UD utility functions behave in
the same way: they show a good convergence speed (similar to that of the agents
3The normalization is computed with respect to the maximum value derived above.
99
8.2 Bar Problem and its Reward Functions 8. Results
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
0 2000 4000 6000 8000 10000 12000 14000 16000
Nor
mal
ized
Wor
ld U
tility
Weeks
WLUTGUD
Figure 8.5: Results of the first bar configuration with the exponential reward functions
using the WLU function), but after 4,000 weeks they remain stuck at 55% of the
available world utility. The WLU function is more interesting: the agents using
this function, even starting from the same value as TG and UD, rapidly increase
the world utility already in 2,000 weeks and after week 4,000 they have a constant
growth rate that tends to stabilize after week 12,000 at about 90% of the available
world utility. From these considerations, we infer that the WLU function gives to
agents meaningful rewards so they can clearly and rapidly understand the effective
consequences of their own actions.
In Figure 8.6 the results obtained with the Gaussian reward functions (σ = 2
and k = 7.2) are depicted. The world utility function trends achieved with different
reward functions start from the same point, but the one obtained with the WLU
function rapidly increases to 85% after 3,000 weeks. The agents using the TG and
the UD utility functions show a clearly decresing trend of the world utility, and
then it stabilizes at about 35-40% of the available world utility. It is interesting to
qualitatively compare these trends to those depicted in Figure 8.5 and 8.6 when the
agents use the exponential and the Gaussian TG-UD reward functions: ignoring
the starting value of these two trends, we clearly note that the former have an
100
8. Results 8.2 Bar Problem and its Reward Functions
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
0 2000 4000 6000 8000 10000 12000 14000 16000
Nor
mal
ized
Wor
ld U
tility
Weeks
WLUTGUD
Figure 8.6: Results of the first bar configuration with the Gaussian reward functions
increasing trend in the first weeks, while the latter rapidly decrease towards 40%
in the first 4,000 weeks (recall the signal-to-noise ratio explained in Section 4.2.3).
As we will see below, this behavior is mainly due to the reward function.
It is interesting to compare the results obtained with the exponential reward
function and the Gaussian one. At first sight we may think to compare the world
utilities depicted in Figure 8.5 and Figure 8.6, but it is incorrect since that they
are normalized with respect to different maximum values (one is related to the
exponential reward function and one is related to the Gaussian one), even if we
can realize the difference of the world utility values achieved by both the WLU
functions in comparison to those obtained by the TG and the UD utility ones.
This may induce ourselves to deduce that the exponential WLU function may be
less selective than the Gaussian one so it can cause less accurate results.
Apart these qualitative remarks, a better way to formally compare the results
obtained is to use the relative entropy (also known as Kullback-Leibler distance,
KLd for short). Given two probability distributions p(x) and q(x) of a discrete
101
8.2 Bar Problem and its Reward Functions 8. Results
variable X, the KLd is defined as:
KLd(p ‖ q) =∑
x∈X
p(x) · logp(x)
q(x)(8.1)
The KLd is always non-negative (Gibbs’ inequality) and it equals zero if and only
if p = q. So, in Equation (8.1) we impose q equals to the probability distribution of
the five different optimal bar configurations and p equals to the bar configuration
in a given week (X represent the days of a week), e.g.:
p =
[10
30,
7
30,
1
30,
4
30,
8
30
],
q1 =
[22
30,
2
30,
2
30,
2
30,
2
30
],
q2 =
[2
30,22
30,
2
30,
2
30,
2
30
],
. . . ,
q5 =
[2
30,
2
30,
2
30,
2
30,22
30
]
Thus the KLd for week w is modified as follows:
KLdw(pw ‖ q) = mini=1...5
∑
x∈X
pw(x) · logpw(x)
qi(x)(8.2)
As a consequence, for each week w we can compute the distance between the
probability distribution pw(x) of that week and the optimal probability distribution
qi(x) using Equation (8.2), thus we can obtain the trend of the KLd (that is
influenced by the different utility functions).
We expect that the probability distribution obtained with the Gaussian WLU
function has the KLd lower (or at least similar) than to the one obtained with
the exponential WLU function as a consequence of the considerations explained
in Section 6.3. Notice that we may have zero probability to attend the bar in a
specific night x, so in this situation we cannot compute the KLd. To avoid this
problem we sum 1 to each probability value.
Figure 8.7 depicts the resultant KLd of the distribution probabilities obtained
with the Gaussian and the exponential WLU function over 10 different runs of the
first 2,000 weeks: it is interesting to note that both the WLU functions have the
same trend, but in the early weeks (the first 400) the exponential WLU function
102
8. Results 8.2 Bar Problem and its Reward Functions
0
0.01
0.02
0.03
0.04
0.05
0.06
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Rel
ativ
e E
ntro
py
Weeks
Gaussian WLU (std. dev. = 2.0)Exponential WLU
Figure 8.7: Mobile mean of the WLU functions relative entropy
seems to reach the minimum distance more quickly than the Gaussian one, but
after week 400 they show the same behavior.
For completeness of exposition we present also the KLd of the probability dis-
tributions obtained with the TG and UD utility functions in Figure 8.8(a) and
Figure 8.8(b). It is interesting to note that the TG utility functions (both Gaus-
sian and exponential) show about the same behavior (and it is quite noisy) and
the relative entropy converges to the same value of the starting one of the relative
entropy obtained with the exponential WLU. The relative entropy related to the
agents using the UD utility functions is clearly worse, in fact it shows an increasing
behavior4, so we can assert that both the UD utility functions lead the agents to
avoid the optimal configuration.
Given the theoretical results obtained with the KLd it is interesting to evaluate
agents’ behavior when they use the Gaussian WLU function or the exponential
one. Figure 8.9 depicts the bar attendance: the red boxes represent the optimal
configuration, while the blue error lines represents the fluctuation of the number of
4Recall that the relative entropy measures a distance from a probability distribution p and an
optimal probability distribution q, so the lower is, the better is.
103
8.2 Bar Problem and its Reward Functions 8. Results
0.05
0.052
0.054
0.056
0.058
0.06
0.062
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Rel
ativ
e E
ntro
py
Weeks
Gaussian TG (std. dev. = 2.0)Exponential TG
(a) TG
0.02
0.025
0.03
0.035
0.04
0.045
0.05
0.055
0.06
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Rel
ativ
e E
ntro
py
Weeks
Gaussian UD (std. dev. = 2.0)Exponential UD
(b) UD
Figure 8.8: Mobile mean of the TG and UD utility functions relative entropy
agents from the optimal configuration in each day of the last 10 weeks. With the
exponential WLU function of Figure 8.9(a) we have the overcrowded day with lower
deviation with respect to the overcrowded one obtained with the Gaussian WLU
function of Figure 8.9(b), but the optimal days have large fluctuation compared to
the ones achieved with the Gaussian WLU function. These are small differences,
because this bar configuration counts the presence of not too many agents. Anyway,
these results confirm what stated in Section 6.3 about agents’ behavior when we
change the utility function.
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
1 2 3 4 5
Age
nts
Atte
ndin
g th
e B
ar
Days of Week
Optimal configuration
Deviations
(a) Exponential WLU function
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
1 2 3 4 5
Age
nts
Atte
ndin
g th
e B
ar
Days of Week
Optimal configuration
Deviations
(b) Gaussian WLU function (σ = 2.0)
Figure 8.9: Attendance of the first bar configuration
104
8. Results 8.2 Bar Problem and its Reward Functions
8.2.2 Second Bar Configuration
In this experiment, we have a remarkable environment with 60 agents, 7 days per
week and 4 agents as optimal number. The results are averaged over 10 different
runs, each one composed by 20 thousand weeks. As stated in Section 8.2.1, the
maximum performance is similarly given by 4 agents attending the bar in 6 days of
the week and in the remaining one we have 36 agents. In the same way, we compare
the results obtained with the exponential reward function and the Gaussian one
(here to be taken with σ = 2.5 and k = 9.0) using the minimum KLd between the
current probability distribution and the optimal probability distribution of the bar
attendance. All agents use an ǫ-greedy exploration policy (ǫ = 0.1 decreasing over
time), learning factor α = 0.5 and discount factor γ = 0.95.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
Nor
mal
ized
Wor
ld U
tility
Weeks
WLUTGUD
Figure 8.10: Results with the second bar configuration with the exponential reward functions
In Figure 8.10 we see the normalized results obtained with the exponential
reward functions. All agents have the same world utility starting value and, in the
first 1,000 weeks, they have the same growth rate. After that week the growth rate
of the agents using the TG and UD functions slightly decreases and then these
agents reach about the same world utility value at week 4,000. On the other hand,
the world utility of the agents using the WLU function continues to increase until
105
8.2 Bar Problem and its Reward Functions 8. Results
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
Nor
mal
ized
Wor
ld U
tility
Weeks
WLUTGUD
Figure 8.11: Results of the second bar configuration with the Gaussian reward functions
week 6,000 and it reaches about 90% of the available value. In this situation we
may say that the exponential reward functions seem to distribute similar rewards
to agents, thus they might pick non-optimal nights to attend the bar.
In Figure 8.11 the results obtained with the Gaussian reward functions are
depicted: in this situation we achieve qualitatively different results with respect
to Figure 8.10, in fact all the world utility functions start around the same value,
but the world utility obtained with the WLU function rapidly increases and in
3,000 weeks it adjusts itself at about 80%. On the other hand, it is interesting to
note that qualitatively the agents using the TG and UD Gaussian utility functions
obtain a world utility value less than those using the exponential TG and UD ones
and they have a poor increasing rate which stops already in the first 1,000 weeks.
As a consequence we could further on affirm that the Gaussian reward functions
seem to distribute reward more selectively than the exponential ones.
In order to see whether the Gaussian reward functions are better than the
exponential ones, it is interesting to compare the results obtained with the Gaussian
utility functions and with the exponential ones and to notice the different values
achieved by both the WLU utility functions compared to those obtained with
106
8. Results 8.2 Bar Problem and its Reward Functions
the TG and UD ones. We still use the minimum KLd between the probability
distribution of the bar attendance pw(x) at week w and the optimal ones q1...7(x)
similarly computed with Equation (8.2) (in this case i ranges from 1 to 7 and qi(x)
and pw(x) are opportunely changed) to verify the goodness of the Gaussian WLU
function with respect to the exponential one.
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Rel
ativ
e E
ntro
py
Weeks
Gaussian WLU (std. dev. = 2.5)Exponential WLU
Figure 8.12: Mobile mean of the WLU functions relative entropy
Figure 8.12 depicts the KLd of the distribution probabilities obtained with the
Gaussian and the exponential WLU functions over 10 different runs on 2,000 weeks.
If in the previous bar configuration (Section 8.2.1) the differences between the prob-
ability distributions induced by the Gaussian and the exponential WLU functions
were slight, in this situation the advantages of the Gaussian utility function with
respect to the exponential one are more evident: the probability distribution in-
duced by the Gaussian WLU function rapidly decreases the relative entropy value
already in the first 500 weeks, that is it gets close to the optimal probability distri-
bution of the bar attendance. The probability distribution of the exponential one
reaches about the same value after 800 weeks and both of them remain constant
after week 1,200, even if the former seems to have relative entropy values lower
than the latter and it is less noisy. These results confirm how stated in Section
107
8.2 Bar Problem and its Reward Functions 8. Results
6.3 and the previous considerations about the Gaussian reward functions goodness
with respect to the exponential ones.
0.026
0.027
0.028
0.029
0.03
0.031
0.032
0.033
0.034
0.035
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Rel
ativ
e E
ntro
py
Weeks
Gaussian TGExponential TG
(a) TG
0.033
0.0335
0.034
0.0345
0.035
0.0355
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Rel
ativ
e E
ntro
py
Weeks
Gaussian UD (std. dev. = 2.0)Exponential UD
(b) UD
Figure 8.13: Mobile mean of the TG and UD utility functions relative entropy
For completeness of exposition we present also the KLd of the probability dis-
tributions obtained with the TG and UD utility functions in Figure 8.13(a) and
Figure 8.13(b) in the exponential and in the Gaussian form. It is interesting to
note that in such figures both the probability distributions induced by the TG
and UD utility functions are far away from the optimal one as in the first Bar
Problem configuration (Section 8.2.1). The probability distribution obtained with
the TG utility function tends to increase the distance from the optimal probability
distribution, while the one achieved with D utility function is heavily noisy and it
is mainly due to the signal-to-noise ratio described in Section 4.2.3.
Given the theoretical results obtained with the KLd it is interesting to evaluate
the agent behavior when they adopt the Gaussian WLU function or the exponential
one. Figure 8.14 depicts the attendance of the bar: the red boxes represent the
optimal configuration, while the blue error lines represent the fluctuation of the
number of agents attending the bar from the optimal configuration in every day
of the last 10 weeks. It is easy to verify that with the Gaussian WLU function
the bar configuration has larger fluctuations from the optimal configuration than
the exponential one, but these deviations are not simultaneous, in fact we have a
lower distance from the probability distribution of the optimal configuration when
we use the Gaussian WLU function than the exponential one (Figure 8.12). In
108
8. Results 8.2 Bar Problem and its Reward Functions
some cases it may happen that agents implicitly create a coalition and then they
change the overcrowded day where they attend the bar to not lower the world
utility. Obviously, this particular behavior does not influence neither the KLd nor
the world utility; in fact, they are based only on the distance from the optimal
configuration.
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
1 2 3 4 5 6 7
Age
nts
Atte
ndin
g th
e B
ar
Days of Week
Optimal configuration
Deviations
(a) Exponential WLU function
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
1 2 3 4 5 6 7
Age
nts
Atte
ndin
g th
e B
ar
Days of Week
Optimal configuration
Deviations
(b) Gaussian WLU function (σ = 2.5)
Figure 8.14: Attendance of the second bar configuration
8.2.3 Q-learning Dynamics of the Bar Problem
As explained in Section 6.4, with Equations (6.4), (6.5), (6.7) and from (6.8) to
(6.13), we can compute the Q-learning dynamics of the Bar Problem and we can
examine how the agent behavior changes during different weeks. The change of
agent ηi’s behavior is obviously influenced by behavior of other agents {η−i}: when
τ increases (see Equations (6.4) and (6.5)) agents are led to exploration, that is they
do not take care of their past experience (in the extreme case we may have agents
acting randomly); at the opposite when τ tends to 0 the Q-learning dynamics of
agents lead them to consider only their past experience thus discarding exploration.
Hence, the temperature usually starts from a high value (more exploration) and it
is decreased over time (more exploitation).
We initialize Equation (6.7) as follows (initial conditions for the system of
differential equations (6.2)):
Π =
0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27
0.8 0.79 0.78 0.77 0.76 0.75 0.74 0.73
T
(8.3)
109
8.2 Bar Problem and its Reward Functions 8. Results
and we set α = 0.1 and τ = 0.1; as a consequence we obtain the results depicted
in Figures 8.15(a), 8.15(b) and 8.15(c). Notice that here we always consider the
Bar Problem with a Gaussian reward function (with optimal number of agents
attending the bar c = 1, standard deviation σ = 0.5 and constant k = 1),because
in Section 8.2.1 and in Section 8.2.2 we see that the Gaussian reward function gives
better results than the exponential one.
In Figure 8.15 we can see how bad the UD utility function is, in fact all agents
converge to a uniform policy: they have the same probability to attend the bar on
Monday or on Tuesday and their probability distributions have the same evolution.
The TG and WLU functions induce the same behavior, in fact with these
utilities the agent with higher probability to attend the bar on Monday (agent 8,
see Equation (8.3)) increases that probability towards 1, while the others converge
to attend the bar on Tuesday. In such a case we reach an admissible optimal bar
configuration.
It is interesting to find the critical temperature τ for which the WLU and TG
utility functions lead agents towards a uniform policy like that depicted in fig-
ure 8.15(c). However, we fixed α = 0.1 and we do not consider anymore the UD
function, since that it always leads agents towards a uniform probability distribu-
tion. With some experiments we found τWLU = 0.140 for the WLU function, and
τTG = 0.154 for the TG utility function.
In Figure 8.16 we show the results for the WLU and TG utility functions with τ .
The WLU function with τWLU converges to the optimal policy in about 200 weeks;
increasing τWLU we found singularity solving numerically the system of differential
equations stated in Equation (6.2). The TG utility function with τTG converges
to the optimal policy in about 750 weeks; increasing τTG agents tend towards a
uniform policy.
Given the previous results depicted in Figure 8.16 it is interesting to analyze
how the agents dynamically behave with the TG and WLU functions. Keeping
fixed the temperature τ to 0.14 (note that such a temperature is critical for the
WLU function but not for the TG utility one), in Figure 8.17 we depict how
the agents act. The agents configured with the TG utility function reach 90% of
110
8. Results 8.2 Bar Problem and its Reward Functions
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70
Pro
babi
litie
s of
eac
h ag
ent t
o at
tend
the
bar
on M
onda
y
Weeks
WLU Bar Dynamics with 8 Agents - tau = 0.1, alpha = 0.1
Agent 1
Agent 2
Agent 3
Agent 4
Agent 5
Agent 6
Agent 7
Agent 8
(a) WLU function
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70
Pro
babi
litie
s of
eac
h ag
ent t
o at
tend
the
bar
on M
onda
y
Weeks
TG Bar Dynamics with 8 Agents - tau = 0.1, alpha = 0.1
Agent 1
Agent 2
Agent 3
Agent 4
Agent 5
Agent 6
Agent 7
Agent 8
(b) TG utility function
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70
Pro
babi
litie
s of
eac
h ag
ent t
o at
tend
the
bar
on M
onda
y
Weeks
UD Bar Dynamics with 8 Agents - tau = 0.1, alpha = 0.1
Agent 1
Agent 2
Agent 3
Agent 4
Agent 5
Agent 6
Agent 7
Agent 8
(c) UD utility function
Figure 8.15: Bar dynamics with 8 agents
111
8.2 Bar Problem and its Reward Functions 8. Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 20 40 60 80 100 120 140 160 180 200
Pro
babi
litie
s of
eac
h ag
ent t
o at
tend
the
bar
on M
onda
y
Weeks
WLU Bar Dynamics with 8 Agents - tau = 0.14, alpha = 0.1
Agent 1
Agent 2
Agent 3
Agent 4
Agent 5
Agent 6
Agent 7
Agent 8
(a) WLU function
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800
Pro
babi
litie
s of
eac
h ag
ent t
o at
tend
the
bar
on M
onda
y
Weeks
TG Bar Dynamics with 8 Agents - tau = 0.154, alpha = 0.1
Agent 1
Agent 2
Agent 3
Agent 4
Agent 5
Agent 6
Agent 7
Agent 8
(b) TG utility function
Figure 8.16: τ for the bar dynamics with 8 agents
probability to attend the bar on Monday after 140 weeks, while with the WLU
one after 155 weeks. Both of them reach 100% of probability to attend the bar on
Monday after 200 weeks and they have the same increase ratio (even if agent 6 has
a little bit different behavior).
To justify these results it is important to note that in such bar configuration
we have agents converging towards the optimal bar allocation using both the WLU
and TG utility functions: this is not too surprising, since we are dealing with a
heavily reduced problem (2 days per week and only 8 agents). In such a situation
the signal-to-noise ratio (which, in the general case always penalizes the TG utility
112
8. Results 8.2 Bar Problem and its Reward Functions
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 20 40 60 80 100 120 140 160 180 200
Pro
babi
litie
s of
eac
h ag
ent t
o at
tend
the
bar
on M
onda
y
Weeks
WLU Bar Dynamics with 8 Agents - tau = 0.14, alpha = 0.1
Agent 1
Agent 2
Agent 3
Agent 4
Agent 5
Agent 6
Agent 7
Agent 8
(a) WLU function
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 20 40 60 80 100 120 140 160 180 200
Pro
babi
litie
s of
eac
h ag
ent t
o at
tend
the
bar
on M
onda
y
Weeks
TG Bar Dynamics with 8 Agents - tau = 0.14, alpha = 0.1
Agent 1
Agent 2
Agent 3
Agent 4
Agent 5
Agent 6
Agent 7
Agent 8
(b) TG utility function
Figure 8.17: Bar dynamics with 8 agents and τ = 0.14
function, see Section 4.2.3) does not affect the TG utility function, since each agent
can clearly discern the consequences of its actions.
Notice that in these experiments we used a particular policy matrix Π (Equation
(8.3)), in fact the system converges to the optimal configuration with both the WLU
and TG utility functions. This is obvious because, apart the initial transient state
(see Figure 8.15(a) and Figure 8.15(b)), the fact that the agent with higher initial
probability to attend the bar on Monday always converges on that day is due to
the first term of both Equation (6.4) and Equation (6.5). Initially, the expected
number of agents attending the bar on Monday is 1.88 (that is greater than the
113
8.2 Bar Problem and its Reward Functions 8. Results
optimal one), so all agents will decrease their probability to attend the bar on
Monday (and Tuesday will be overcrowded). After the fifth week agent 8 slowly
increases its probability to attend the bar on Monday because now the expected
number of agents attending that day is about 0.64. This fact leads agent 8 towards
to attend the bar on Monday because it continuously receives higher rewards than
the expected ones so it follows its policy “attend the bar on Monday”.
To verify this behavior it is interesting to see how the system evolves when we
use particular initial policy matrices Πs. Let us start with this policy matrix:
Πu =
0.9999 0.53 0.52 0.51 0.49 0.48 0.47 0.46
0.0001 0.47 0.48 0.49 0.51 0.52 0.53 0.54
T
(8.4)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Pro
babi
litie
s of
eac
h ag
ent t
o at
tend
the
bar
on M
onda
y
Weeks
WLU Bar Dynamics with 8 Agents - tau = 0.1, alpha = 0.1
Agent 1
Agent 2
Agent 3
Agent 4
Agent 5
Agent 6
Agent 7
Agent 8
Figure 8.18: Uniform policies obtained with Πu
In this situation Equation (6.4) and Equation (6.5) tell us that the initial ex-
pected number of agents attending the bar on Monday except agent 1 is equal to
3.46 5, so agents from 2 to 8 will randomly act (uniform probability policy). Agent
1 expects (including itself) 4.4599 agents attending the bar on Monday and 3.4601
agents attending the bar on Tuesday, so it will converge to a uniform (random)
5Obviously if we include agent 1 these results, apart the numerical values, do not change.
114
8. Results 8.2 Bar Problem and its Reward Functions
policy even if it has a high probability to attend the bar on Monday. This behavior
does not depend on the utility function (TG rather than WLU or UD), but only on
Equation (6.4) and Equation (6.5) in particular on the expected number of agents
attending the bar on Monday or Tuesday. In Figure 8.18 we can see how the sys-
tem behaves: all the previous considerations hold; all agents converge towards a
uniform policy, that is they act randomly.
Another interesting policy matrix is the following:
Πc =
0.9 0.22 0.23 0.24 0.21 0.20 0.19 0.18
0.1 0.78 0.77 0.76 0.79 0.80 0.81 0.82
T
(8.5)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Pro
babi
litie
s of
eac
h ag
ent t
o at
tend
the
bar
on M
onda
y
Weeks
WLU Bar Dynamics with 8 Agents - tau = 0.1, alpha = 0.1
Agent 1
Agent 2
Agent 3
Agent 4
Agent 5
Agent 6
Agent 7
Agent 8
Figure 8.19: Policies of agents obtained with Πc
This configuration is like the previous one, given that we have one agent (agent
1) with a high probability to attend the bar on Monday. In the first weeks we
expect that this agent decreases its probability to attend the bar on Monday (with
the WLU and TG utility functions, while with the UD all agents always converge
towards a uniform policy) because on that day, agent 1 excluded, there are 1.47
expected agents, while the others increase their probability to attend the bar on
Monday because on Tuesday there are 5.63 expected agents (including agent 1). At
a given point Monday will appear overcrowded so agents from 2 to 8 will lower their
115
8.3 Cooking Teams Problem 8. Results
probability to attend the bar on that day. In the meanwhile, agent 1 is decreasing
its probability to attend the bar on Monday, but the coalition created by agents
from 2 to 8 is decreasing those probabilities too, so agent 1 realizes that Tuesday is
more overcrowded than Monday, and it increases its probability to attend the bar
on Monday (while the others continue to decrease that probability because they do
not follow the policy “attend the bar on Monday” since it returns rewards lower
than those returned by the average policy); thus, the system converges towards an
admissible optimal configuration. These considerations are well depicted in Figure
8.19.
8.3 Cooking Teams Problem
In Section 7.3 we described a well known problem in literature to formalize a new
kind of game that involves task allocation games as well as coalition formation ones.
We depicted the curse of the state space size in such problems that exponentially
increases with respect to the number of agents for each type. In order to deal with
the state space size, in all these experiments we introduced a new characteristic
function (Equation (8.6) and Equation (8.7)) to evaluate the quality of a coalition
that is different from the one used to evaluate the world utility (Equation (7.5)),
so we can compare the results obtained with these two different characteristic
functions.
N2(fc, fh, µcµh) = κ · 1
2 · π · σ2· exp
(−(fc − µc
)2+(fh − µh
)2
2 · σ2
)(8.6)
In Equation (8.6) fc and fh respectively indicate the number of cooks and helpers,
µc and µh the optimal values according to Equation (7.5), while κ is a real positive
number. Here we take the standard deviation σ = 1.5. This function is used to
evaluate the coalition (n, 2 · n), where, according to Equation (7.5), it refers to
coalition (n, 2 · n) (respectively, number of cooks and number of helpers). The
evaluation of such function is given if and only if n− n = ±1 (n, n ∈ N), elsewhere
the Gaussian characteristic function returns 0.
116
8. Results 8.3 Cooking Teams Problem
For the coalition (4 · n, 0) we use the following Gaussian function:
N2(fc, µc) = κ · 1
σ ·√
2 · π· exp
(−(fc − µc)
2
2 · σ2
)(8.7)
where as before fc and µc respectively indicate the actual number and the optimal
number of cooks according to Equation (7.5), while κ is a real positive number and
the standard deviation σ is equal to 1.5. The evaluation of such function is given
if and only if n − n = ±2 (n, n ∈ N), elsewhere it returns 0.
Finally, for the coalition (0, n) we used the evaluation of the same coalition
described in Equation (7.5), that is n (obviously n ∈ N).
8.3.1 Nonempty State Space
In this subsection we will analyze the results obtained with the Cooking Teams
Problem using the state space described in Section 7.3.3. These results refer to the
four bar configurations proposed in Section 7.3.1. We used different parameters
to configure the environment and agents in order to see whether we can reach an
optimal solution (or a near one).
In the following, we show and analyze the results of different experiments. The
ǫ-greedy exploration policy uses a parameter ǫ that is the probability of an agent to
explore the environment rather than following its policy. This probability decreases
over weeks w with with the following law:
ǫw =ǫ0w
1 + w · drǫ
, (8.8)
where ǫw is the current exploration ratio at week w, ǫ0w is the initial exploration
ratio and drǫ is the exploration decreasing rate (it is greater than 0).
Standard Configuration
In this test, all agents use the well known Q-learning algorithm in order to find
an optimal policy. The learning rate α is equal to 0.5 and each agent use an ǫ-
greedy exploration policy with ǫ = 0.1 decreasing over time. Here we compare the
performance obtained in Bar-1 and in Bar-4 first using the characteristic function
of Equation (7.5), then Equations (8.6) and (8.7) in order to see whether we can
117
8.3 Cooking Teams Problem 8. Results
boost the results achieved by agents. Each graph is obtained as average over 10
different runs of 150,000 weeks, and on the final average we calculate a mobile
mean.
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
0 20000 40000 60000 80000 100000 120000 140000 160000
Wor
ld U
tility
Weeks
WLUSUTGUD
(a) Bar-1
0
2
4
6
8
10
12
14
16
18
20
22
24
26
0 20000 40000 60000 80000 100000 120000 140000 160000
Wor
ld U
tility
Weeks
WLUSUTGUD
(b) Bar-4
Figure 8.20: Bar-1 and Bar-4 with α = 0.5, ǫ = 0.1; the standard characteristic function
of Equation (7.5) is used to compute both the world utility and the quality of a coalition of
agents attending the bar
In Figure 8.20 we can clearly see the influence of the state space on the per-
formance of the problem. Even if in the first weeks we have a remarkable growth
rate, the world utility adjusts on low values with respect to the maximum ones
described in Table 7.1. Furthermore here we can note the concept of problem dif-
118
8. Results 8.3 Cooking Teams Problem
ficulty depicted in Section 7.3.3. In Figure 8.20(b) the agents using the UD utility
function tend to lower the world utility, while in Figure 8.20(a) they show a slight
increasing rate (Bar-4 is more difficult than Bar-1, since we have many cooks with
respect to helpers, so the latter are shared resources sought after the cooks). In-
stead, the agents using the SU functions tend to increase the world utility value,
but it remains clearly lower than the optimal one.
In order to asses the quality of Equation (7.5) to assign rewards to agents, we
use that function to evaluate the world utility, while Equations (8.6) and (8.7)
compute the reward to be assigned to each agent.
In Figure 8.21 we still perceive the lower difficulty of Bar-1 with respect to Bar-
4. If in Figure 8.21(a) the agents using the SU function obtain a lower world utility
with respect to the ones of Figure 8.20(a), in Figure 8.21(b) with all the different
reward functions we achieve better results with respect to 8.20(b). This fact is due
to the smoothness of the Gaussian functions used by the reward function, since
they create attraction fields around the optimal points in the state space. Instead,
with Equation (7.5) these points are negatively evaluated or, at worst, they result
to a zero reward. As a consequence, all agents tend to stay away because they
achieve negative rewards and they prefer to visit those states that provide zero
reward.
Looking at the Gaussian characteristic function of Equation (8.7) we can see
that if we have more than two cooks attending the bar, they gain a useful reward
(either positive or negative) leading them towards optimal states. For example, let
us suppose to have the coalition S = {3, 0}. In this case the characteristic function
of Equation (7.5) returns 0 to each cook. At the opposite, with the Gaussian
characteristic function of Equation (8.7) each cook gains 2.01380 (supposing they
are using the SU function). In this situation each cook is led to visit that state
and hopefully the near optimal one (that is S = {4, 0}).
It is interesting to check which states of the Q-table of each agent are most
often visited in Bar-4 configuration in order to understand which coalitions will
be formed during the learning phase. As stated above, we expect that most cooks
form homogeneous coalitions composed by only cooks, while helpers are shared
119
8.3 Cooking Teams Problem 8. Results
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
0 20000 40000 60000 80000 100000 120000 140000 160000
Wor
ld U
tility
Weeks
WLUSUTGUD
(a) Bar-1
0
2
4
6
8
10
12
14
16
18
20
22
24
26
0 20000 40000 60000 80000 100000 120000 140000 160000
Wor
ld U
tility
Weeks
WLUSUTGUD
(b) Bar-4
Figure 8.21: Bar-1 and Bar-4 with α = 0.5, ǫ = 0.1; the characteristic function of Equation
(7.5) is used to compute the world utility, while the characteristic functions of Equations
(8.6) and (8.7) are used to evaluate the quality of a coalition of agents attending the bar
resources in this configuration, thus the optimal choice for a cook is to form a
coalition with 2 helpers.
Figure 8.22 depicts the different Q-table visits for each agent’s type. In Figure
8.22(a) the most visited states correspond to the coalition formed by 3 to 7 cooks
and 0 helpers. We have experimentally verified that most cooks tend to form
coalition composed by only cooks in order to come near the optimal coalition
(fc, 0) (where fc is a multiple of 4).
120
8. Results 8.3 Cooking Teams Problem
Only in a little number of cases we have some cooks forming the optimal coali-
tion (fc, fh = 2 · fc) (fc cooks and fh helpers). Figure 8.22(b) shows this situation,
in fact the most visited state is the state number 44, that corresponds to 0 cooks
and 2 helpers (thus cooking 2 cookies). The state number 45 (which corresponds
to 1 cooks and 2 helpers) is visited a small amount of times with respect to the
44 one, because each cook tends to form coalition of only cooks because these are
states more numerous than the states associated to the coalition (fc, fh = 2 · fc).
Since each cook tends to form such coalitions and each pair of helpers waits for a
cook (that is unavailable), each helper tends to form coalition composed by only
helpers.
The same considerations still hold for the situation described in Figure 8.22(c);
the most visited state is the state number 66 that corresponds to 0 cooks and 3
helpers. As before, the state number 67 (1 cook and 3 helpers) is visited a small
amount of times with respect to the 66 because there are not any available cooks6.
Exploring the Environment
In the previous subsection we have discussed about the advantages of a Gaussian
characteristic function with respect to a “strict” characteristic function used to
assign rewards in this problem. Now let us suppose to increase the exploration
ratio in order to evaluate whether it results in better performance. This increase is
due to the fact that here we are dealing with a bounded state space. The quality
of each single state depends on the number of cooks and helpers, thus if an agent
has an appropriate exploration strategy then all agents tend to visit the same
non-optimal states.
The simplest method used to induce exploration is the ǫ-greedy exploration,
where we have a (decreasing) probability ǫ used to explore the environment rather
than exploiting it. Another method to induce the exploration is to set the initial
Q-values of each agent to a high value (they are usually set equal to rmax
1−γ, where
6Recall that using the Gaussian functions of Equations (8.6) and (8.7) to compute the rewards,
this coalition is negatively evaluated for each helper and positively for each cook. Anyway, the
world utility is computed using Equation (7.5), thus that coalition does not improve it.
121
8.3 Cooking Teams Problem 8. Results
0
1000
2000
3000
4000
5000
6000
7000
0 1
2 3
4 5
6
0 1
2 3
4 5
6 7
8 9
10
0 1000 2000 3000 4000 5000 6000 7000
Visits
Actions
States
Visits
(a) Cook; states from 0 to 10
0
2000
4000
6000
8000
10000
12000
14000
0 1
2 3
4 5
6
43
44
45
46
47
0 2000 4000 6000 8000
10000 12000 14000
Visits
Actions
States
Visits
(b) Helper; states from 43 to 47
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
0 1
2 3
4 5
6
65
66
67
68
69
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Visits
Actions
States
Visits
(c) Helper; states from 65 to 69
Figure 8.22: Q-table visits for cooks and helpers in Bar-4 with α = 0.5, ǫ = 0.1; the
characteristic function of Equation (7.5) is used to compute the world utility, while the
characteristic functions of Equations (8.6) and (8.7) are used to evaluate the quality of a
coalition of agents attending the bar
rmax is the maximum reward obtainable7 and γ is the discount factor of the future
expected reward used in the Q-Learning update formula8). It is easy to understand
that the former method provides a random exploration policy; in fact, during the
exploration an agent randomly chooses an action to be executed, while the latter
enables a more uniform exploration policy.
In Figure 8.23 we can see that also in this case the Gaussian characteristic
functions of Equations (8.6) and (8.7) gives rewards more smoothed than the char-
acteristic function (7.5). With the former functions the world utility shows an
increasing behavior after the first 5,000 weeks, while with the latter the world
7Here we impose rmax = 10.8Equation (2.3).
122
8. Results 8.3 Cooking Teams Problem
0
2
4
6
8
10
12
14
16
18
0 20000 40000 60000 80000 100000 120000 140000 160000
Wor
ld U
tility
Weeks
WLUSUTGUD
(a) Equation (7.5)
0
2
4
6
8
10
12
14
16
18
20
22
24
26
0 20000 40000 60000 80000 100000 120000 140000 160000
Wor
ld U
tility
Weeks
WLUSUTGUD
(b) Equations (8.6) and (8.7) are used as characteristic function,
while Equation (7.5) is used to evaluate the world utility of Equa-
tion (7.6)
Figure 8.23: Bar-4 with α = 0.5, ǫ = 0.3 over 150,000 weeks (here we used the characteristic
functions of Equations (7.5), (8.6) and (8.7)). These graphs are an average mean over 10
different runs
utility tends to reach a constant convergence value after few weeks.
It is interesting to note how the exploration induced by increasing the initial
value of the ǫ parameter results in better performance. The world utility depicted
in Figure 8.23(b) tends to have the same convergence speed with respect to Figure
8.21(b), but the former, after week 20,000, shows an increasing behavior obtained
by all the four utility functions until week 100,000, while the latter, after week
123
8.3 Cooking Teams Problem 8. Results
20,000, reaches a constant value at convergence. Thus, the exploration strategy
seems to lead agents towards better performance.
The same considerations still hold if we use the characteristic function described
by Equation (7.5): the world utility obtained by the agents using the SU function
depicted in Figure 8.23(a) has a slightly lower convergence speed with respect
to the world utility value achieved by the agents using the SU function depicted
in Figure 8.20(a). The agents using the former continue to improve their policy
already after week 60,000, while agents using the latter reach a convergence value
lower than the one obtained in Figure 8.23(a).
The previous experiment suggests that the exploration strategy seems to give
slightly better results. Figure 8.24 depicts the performance of Bar-4 obtained by
the agents using the SU function and the characteristic function of Equations (8.6)
and (8.7). The different exploration values give a boost around week 80,000, where
we see that the agents with greater exploration factor achieve increasing world
utility values. Agents’ policy is more tuned than the one of those using other
exploration rates. This policy goodness is mainly due to the higher exploration
rates, in fact in the earliest weeks these agents explore the environment in such a
way to visit near optimal states, thus its policy will be positively affected in future
weeks.
From this experiment, we can infer another important consideration about the
state space. Looking at Figure 8.24 we can see that the agents using the SU
function with ǫ = 0.9 obtain better performance (albeit in a greater number of
weeks) than others (obviously they use the SU function). This exploration value
means that these agents act in a semi-random fashion, in fact in first weeks they
randomly choose an action with probability equal to 0.9. This particular behavior
means that the state space is heavily bounded, thus agents seem to achieve better
payoffs acting in a semi-random fashion (almost in the first weeks).
We have experimentally verified that keeping the exploration rate fixed does
not improve the performance. If all agents use a fixed exploration rate they reach a
convergence point less than that reached if they used a decreasing exploration rate.
This particular behavior is mainly due to the fact that they continue to explore the
124
8. Results 8.3 Cooking Teams Problem
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Wor
ld U
tility
Weeks
ε = 1.0ε = 0.9ε = 0.7ε = 0.5ε = 0.3ε = 0.1
Figure 8.24: Comparison between the performance of Bar-4 (SU function) obtained using
different values of ǫ (0.1, 0.3, 0.5, 0.7, 0.9, 1.0), with Equation (7.5) used for the world
utility and Equations (8.6) and (8.7) used for the characteristic function. Each experiment
is a mean of 5 different runs and we plot one world utility value every 100 values (that is this
experiment was executed over 500,000 weeks)
environment rather than exploiting it. This further exploration is useless for the
agents, in fact it does not permit to the agents to improve their policy because it
induces a random action selection with probability ǫ. This random action selection
is usefull in first weeks, but it becomes more and more useless during the time.
Up to now we have analyzed and discussed the results obtained using high ǫ
values, but it is interesting to compare these results with those obtained using
high initial Q-table values (keeping ǫ = 0.1 as in the standard configuration of the
experiments described here).
Figure 8.25 depicts the result obtained with Bar-4 (since it is the more difficult
to be solved). Even if the ǫ-greedy exploration policy is a random exploration
policy, it seems that the agents using this policy (with the SU function) obtain
better world utility values (and remarkable convergence time) than those using
high initial Q-values. These values cause agents to equally choose among different
actions. In first weeks these actions give to agents poor rewards and the Q-learning
125
8.3 Cooking Teams Problem 8. Results
0
2
4
6
8
10
12
14
16
18
20
22
24
26
0 20000 40000 60000 80000 100000 120000 140000 160000
Wor
ld U
tility
Weeks
ε = 0.3High q-values
Figure 8.25: Comparison between the performance of Bar-4 (SU function) obtained using
ǫ = 0.3 and high q-values, with Equation (7.5) used for the world utility and Equations (8.6)
and (8.7) used for the characteristic function
algorithm will decrease the goodness of that state-action pair following Equation
(2.3). The problem is that this value update will decrease that value, but not as
fast as desired in order to consider better (that is not the highest) Q-values. This
behavior is followed by all agents, thus there is a high probability to have a joint
action that is interpreted as a poor coalition structure CS. Rather, using different
values of the ǫ parameter the agents will already choose the action which foresee a
high expected reward in first weeks.
Exploiting the Environment
In the previous subsection we have discussed about the exploration strategy and
we have seen in particular that increasing the exploration ratio (that is the ǫ value,
since we are using an ǫ-greedy policy) the agents using the SU function can reach a
slightly higher world utility values. In order to better exploit the environment, each
agent must be configured with an opportune value α used in the Q-learning update
formula (Equation (2.3)): greater values of α lead agents to discard their past
126
8. Results 8.3 Cooking Teams Problem
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
0 20000 40000 60000 80000 100000 120000 140000
Wor
ld U
tility
Weeks
WLUSUTGUD
(a) Bar-1
0
2
4
6
8
10
12
14
16
18
20
22
24
26
0 20000 40000 60000 80000 100000 120000 140000
Wor
ld U
tility
Weeks
WLUSUTGUD
(b) Bar-4
Figure 8.26: Bar-1 and Bar-4 with αS = 0.5, αNS = 0.1, ǫ = 0.1 over 150,000 weeks (here
we used the characteristic functions of Equations (7.5), (8.6) and (8.7)). These graphs are
an average mean over 10 different runs. Each agent runs the CoLF algorithm
experience, lower values lead agents to consider their past experience and discard
the expected future reward. The CoLF algorithm [1] proposes to use different
learning rates in order to interact with rewards obtained by the environment. A
non-stationary learning rate is used when an agent receives an unexpected payoff,
otherwise it uses a stationary learning rate (that is greater than the non-stationary
one).
Figure 8.26 depicts the result obtained in Bar-1 and Bar-4 configurations, where
127
8.3 Cooking Teams Problem 8. Results
all agents use the CoLF algorithm. Comparing these results with those described
in Figure 8.21, it is straightforward to understand that not even CoLF is able to
improve the world utility convergence value of the agents. On the other hand, the
convergence time of agents using the WLU and TG utility functions seems to be
improved in Bar-4 configuration (Figure 8.26(b)), but in this case they reach about
the same convergence value depicted in Figure 8.21(b).
These results confirm what said above, that is we are dealing with a heavy
bounded state space. Each agent must deal with that state space and with its
action set. It chooses an action according to the present state, hence if it is in
an awful state it chooses actions not improving that state quality. Therefore, an
agent must deal with 2 orthogonal components (the state space and the action
space), and the state space is the factor degrading the overall system performance.
As seen above in Figure 8.24, an easy but expensive way to partially avoid this
inconvenient is to increase the exploration ratio (in the first weeks all agents choose
their action in a semi-random fashion), but here we need more and more weeks in
order to let agents to learn a suboptimal policy.
8.3.2 Empty State Space
In this subsection, here we analyze the results obtained with the Cooking Teams
Problem using an empty state space. These results refer to the four bar configu-
rations proposed in Section 7.3.1. We used different parameters to configure the
environment and the agents in order to see whether we can reach an optimal solu-
tion (or a near one). In all these experiments all agents use an ǫ-greedy exploration
policy, where the initial value of ǫ is equal to 0.1 and it decreases over time following
the update formula of Equation (8.8).
Standard Configuration
In this test all agents use the well known Q-learning algorithm in order to find
an optimal policy. The learning rate α is equal to 0.5. Here we compare the
performance obtained in all the four bar configurations presented in Section 7.3,
first using the characteristic function of Equation (7.5), then Equations (8.6) and
128
8. Results 8.3 Cooking Teams Problem
(8.7) in order to see how these characteristic functions model this problem. Each
graph is obtained as average over 10 different runs of 100,000 weeks, and on the
final average we calculate a mobile mean.
0
4
8
12
16
20
24
28
32
36
40
44
48
52
56
60
64
0 20000 40000 60000 80000 100000
Wor
ld U
tility
Weeks
WLUSUTGUD
(a) Bar-1 standard
0
4
8
12
16
20
24
28
32
36
40
44
48
52
56
60
64
0 20000 40000 60000 80000 100000
Wor
ld U
tility
Weeks
WLUSUTGUD
(b) Bar-1 Gaussian
0
4
8
12
16
20
24
28
32
36
40
44
48
52
56
60
64
68
72
76
80
84
0 20000 40000 60000 80000 100000
Wor
ld U
tility
Weeks
WLUSUTGUD
(c) Bar-2 standard
0
4
8
12
16
20
24
28
32
36
40
44
48
52
56
60
64
68
72
76
80
84
0 20000 40000 60000 80000 100000
Wor
ld U
tility
Weeks
WLUSUTGUD
(d) Bar-2 Gaussian
Figure 8.27: Bar-1 and Bar-2 with α = 0.5 and ǫ = 0.1 over 100,000 weeks (here we used
the characteristic functions of Equations (7.5), (8.6) and (8.7)). These graphs are an average
mean over 10 different runs
In Figures 8.27 and 8.28 we can clearly see how agents behave in this envi-
ronment configuration. In order to see how the characteristic functions induce
different behavior among agents, we show the different results obtained using the
characteristic function of Equation (7.5) to compute both rewards and the world
utility (Figures 8.27(a), 8.27(c), 8.28(a) and 8.28(c)). Furthermore we used the
same approach presented in Section 8.3.1, so we used the characteristic functions
of Equations (8.6) and 8.7 to compute rewards, while Equation (7.5) is used to
compute the world utility.
129
8.3 Cooking Teams Problem 8. Results
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
0 20000 40000 60000 80000 100000
Wor
ld U
tility
Weeks
WLUSUTGUD
(a) Bar-3 standard
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
0 20000 40000 60000 80000 100000
Wor
ld U
tility
Weeks
WLUSUTGUD
(b) Bar-3 Gaussian
0
4
8
12
16
20
24
28
32
36
40
44
48
52
56
60
64
68
0 20000 40000 60000 80000 100000
Wor
ld U
tility
Weeks
WLUSUTGUD
(c) Bar-4 standard
0
4
8
12
16
20
24
28
32
36
40
44
48
52
56
60
64
68
0 20000 40000 60000 80000 100000
Wor
ld U
tility
Weeks
WLUSUTGUD
(d) Bar-4 Gaussian
Figure 8.28: Bar-3 and Bar-4 with α = 0.5 and ǫ = 0.1 over 100,000 weeks (here we used
the characteristic functions of Equations (7.5), (8.6) and (8.7)). These graphs are an average
mean over 10 different runs
Looking at agents’ behavior depicted in Figures 8.27(a), 8.27(c), 8.28(a) and
8.28(c) we note it always outperforms the behavior represented in Figures 8.27(b),
8.27(d), 8.28(b) and 8.28(d). In this problem configuration, the agents coordinate
themselves, so they reach an equilibrium of the game (that is an optimal coalition
structure CS∗). This coordination is mainly due to the absence of the state space.
In the previous experiments where we considered the state space, all agents never
reach the optimal coalition structure, in fact they must deal with both the state
space and the action space. The former is heavily bounded because the goodness
of a specific state is a function of the number of cooks and helpers visiting such
state. As a consequence, all agents choose actions based upon the goodness of
that state, thus they have not incentives to explore the environment (they might
130
8. Results 8.3 Cooking Teams Problem
be in local maximum of the world utility function). Instead, in this configuration,
without any state state space the goodness of a specific day for an agent is only
based upon the action it has chosen, hence each agent focuses only on its action
space.
Looking at the optimal values reported in Table 7.1, we see that in Bar-1,
Bar-2 and Bar-4 (Figures 8.27(a), 8.27(c) and 8.28(c)) agents reach an admissible
optimal coalition structure CS∗ (or a near one). Instead, in Bar-3 (Figure 8.28(a))
all agents do not form an optimal coalition structure: in particular, the agents
using the TG and WLU functions have clearly low performance than those using
the SU and UD ones. In this configuration they probably need more exploration
and/or time in order to exploit the optimal coalition structure.
In the previous case all agents use the characteristic function of Equation (7.5)
both to compute rewards and to evaluate the world utility value. If we use the
characteristic functions of Equations (8.6) and (8.7) to compute rewards, we can
clearly see how all agents do not form an optimal coalition structure (Figures
8.27(b), 8.27(d), 8.28(b) and 8.28(d)). This particular behavior is due to the
fact that the characteristic functions used to compute rewards rate as good, for
example, a coalition S = {7, 0}, thus the agents tend to choose that coalition.
On the other hand, the world utility computed according to Equation (7.5) rates
with 0 that coalition, hence the agents’ behavior will not be fully aligned with that
characteristic function.
131
Chapter 9Conclusions and Future Works
The important thing is not to stop questioning.
Albert Einstein (1879, 1955)
Contents
9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
9.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
133
9.1 Conclusions 9. Conclusions and Future Works
9.1 Conclusions
In this thesis we have proposed a new methodology to study interactions among
different types of agents acting in the same environment. In particular, we have
focused on cooperative interactions aimed to get a goal done by all agents through
learning using different RL techniques. Agents acting in a generic environment are
completely independent, that is they do not know intentions (policies) of others
and we do not allow any kind of communication among them.
At first, we have studied a simplified problem, thus we have many agents of
the same type behaving in different environments, both stationary and not. Some
examples are the Bar Problem (Section 5.3, [19]) and the Gridworld (Section 5.2,
[18]).
In a single agent environment people focus on the policy learning algorithm,
that is which is the best way to use a reward assigned by that environment in
order to learn an optimal policy to let the agent to reach a goal. In a multiagent
case we have further difficulties related to the presence of many agents, thus to
interactions among them inducing more non-stationariness. These interactions
may be formalized as cooperative or competitive. In this thesis we have focused on
cooperative interactions always adopting the constraints described above (policy
unawareness, no communication). It is easy to understand that this case is more
difficult than the single agent one. If in the single agent case we focus on how
best learn a policy (thus on the algorithm behavior), now in the multiagent one we
have to focus also on the reward assignment, so that it can model a cooperative
or a competitive behavior. If we use the same methods of the single agent case (in
the same environment too), obviously we will induce a greedy behavior to agents
(selfish agent).
In this case COIN ([23], [6]) is useful to induce a cooperative behavior in a
multiagent system rather than a competitive one. Unfortunately, studying COIN
we found some gaps related to both the problem approach and to its theoretical
grounds. As stated by ’t Hoen ([18]), the WLU function used to compute the
reward to be assigned to agents is symmetric, thus we can run into slow learning
speed. Furthermore, we have discovered some other gaps about the problem ap-
134
9. Conclusions and Future Works 9.1 Conclusions
proach, in particular related to the reward function of the Bar Problem which is
not enough selective (see Section 6.3). Finally, we used the Q-learning dynamics
approach ([20]) in order to understand how an environment evolves during the
learning phase.
At this point, we have introduced new constraints on the previous approach. In
the real world there exists many situations where the presence of different players
induces to group themselves in coalitions in order to reach a goal. Further on,
coalition formation may be mandatory to reach a goal. This new working envi-
ronment causes new difficulties, because here we must consider two further facets
(as well as interactions among agents and non-stationariness due to the presence
of many agents, as stated above):
• interactions among coalitions (thus more non-stationariness and environment
bounds);
• distribute up a reward among all agents belonging to a given coalition.
In this framework ([16]) we have two main approaches used to distribute a re-
ward among agents (the core and the Shapley’s value, see Section 6.5.3) focusing
on different characteristics (respectively, coalition stability and payoff distribution
among agents).
Before this thesis, this framework lacked a formalization on a particular kind
of real environments. We refer to such environments where we have different types
of agents that, in order to reach a given goal, must unite themselves in suitable
coalitions.
In this thesis, we have introduced and formalized a new typology of game (task
allocation via coalition formation games) matching these characteristics. This new
game remarks the interesting aspects of dispersion games (like the Bar Problem)
and coalition formation games leaving out heavy computational (and useless) char-
acteristics like the Shapley’s value computation. In this formalization, we have de-
fined new methodologies used to distribute the reward among agents of a coalition
that, with those used to assign rewards to a coalition (TG, WLU, SU, UD), play
a fundamental part to achieve a learning goal. Furthermore, we have applied this
135
9.2 Future Works 9. Conclusions and Future Works
framework to the Cooking Teams Problem ([25]) in order to see whether agents
can reach an optimal coalition structure.
Another important feature to consider is the state space size. Already in the
multiagent case we may deal with a priori large state space. Furthermore, with
coalition formation games this state size may became even larger, because it may
depend on the number of agents that form different coalitions. By using a featured
state space, its size will be lower, but it is still heavily bounded. Each state reached
depends on the joint action, the same for its goodness. As a consequence, we do
not use any kind of state space (to be precise, the state space size is equal to 1) in
order to avoid these constraints. With this configuration we studied the Cooking
Teams Problems and we obtain encouraging results about these techniques.
9.2 Future Works
This thesis aims to give a more realistic problem approach in a multiagent system.
If in the literature coalition formation games are already studied, they do not focus
on systems where we have many different types of agents. An interesting facet is to
try this formalization with known environments used in dispersion games in order
to see its usefulness.
Another important problem to extend is the curse of state space size, that
is which is the best state space representation in this kind of problems. In our
testbed problem we experimentally verified that, using an empty state space, all
agents reach an optimal coalition structure. This might not be true a priori, since
we may have different problems where the state space plays a fundamental role to
find an optimal coalition structure. At this point it may be necessary to investigate
how to find a useful state space representation. We might apply an approach like
LEAP ([2]), meant as to find a valid mapping function from a state space to a
feature space. Otherwise, instead of using a state space elaboration like LEAP, it
is interesting to investigate the case where all agents in a coalition use a state space
representation (thus using a feature space) different from that of other coalitions,
and discarding all those useless states. This case is particularly interesting, since
136
9. Conclusions and Future Works 9.2 Future Works
we can have agents in a coalition discarding the states related to other coalitions.
An appealing facet is related to the definition of marginal contribution. It is
deeply related to the characteristic function used to model a problem. At this point
it is useful to test different characteristic functions in our testbed problem, thus
modeling different behaviors (i.e. a moderator). As a consequence, it is necessary
to study how the agent performance may change using marginal contribution.
While studying marginal contribution, we realized it gives a way to create coali-
tions, but it does not formalize a method to distribute a reward among different
agents belonging to a coalition. Marginal contribution is the core of the Shap-
ley’s value, where the latter is used to distribute a reward among agents. Since
the Shapley’s value is computationally heavy, it is interesting to find an associ-
ation between marginal contribution and the Shapley’s value in order to rebuild
the Shapley’s value given all the agent experience (thus exploiting all marginal
contribution values obtained by that agent). At this point we may obtain an ap-
proximated Shapley’s value as well as an expected future Shapley’s value, so it can
be used to distribute the reward obtained among agents of a coalition.
137
Bibliography
[1] A. Bonarini, A. Lazaric, E. Munoz de Cote, and M. Restelli. Improving Co-
operation among Self-Interested Reinforcement Learning Agents. 2005.
[2] A. Bonarini, A. Lazaric, E. Munoz de Cote, and M. Restelli. LEAP: an
Adaptive Multi-Resolution Reinforcement Learning Algorithm. Journal of
Machine Learning Research 1, 2006. To appear.
[3] M. Bowling and M. Veloso. An Analysis of Stochastic Game Theory for Mul-
tiagent Reinforcement Learning. 2000.
[4] M. Bowling and M. Veloso. Multiagent Learning Using a Variable Learning
Rate. In Artificial Intelligence, volume 136(2), pages 215–250, January 2002.
[5] C. Claus and C. Boutilier. The Dynamics of Reinforcement Learning in Co-
operative Multiagent Systems. In American Associations for Artificial Intel-
ligence, pages 746–752, 1998.
[6] O. Etzioni, J. P. Muller, and J. M. Bradshaw, editors. General Principles of
Learning-Based Multi-Agent Systems, New York, May 1999. Proocedings of
the Third Annual Conference on Autonomous Agents, ACM Press.
[7] S. Hoberg. Reinforcement Learning for Autonomous Agents in a simulated
Multi-Player Game. June 2004.
[8] J. Hu and M. P. Wellman. Nash Q-learning for General-Sum Stochastic
Games. Journal of Machine Learning Research 4, pages 1039–1069, November
2003.
139
BIBLIOGRAPHY BIBLIOGRAPHY
[9] International Conference on Machine Learning. Correlated-Q learning, Wash-
ington DC, 2003.
[10] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement Learning:
A Survey. In Journal of Artificial Intelligence Research, chapter 4, pages
237–285. May 1996.
[11] M. Lauer and M. Riedmiller. An Algorithm for Distributed Reinforcement
Learning in Cooperative Multi-Agent Systems.
[12] J. Laumonier and B. Chaib-draa. Multiagent Q-learning: Preliminary Study
on Dominance between the Nash and Stackelberg Equilibriums. July 2005.
[13] T. Mitchell. Machine Learning. McGraw-Hill, 1997.
[14] E. Munoz de Cote. Learning to Form Coalitions. April 2006.
[15] M. Restelli. A Multi-Agent System for Multi-Agent Learning. PhD thesis,
Politecnico di Milano.
[16] T. Sandholm, K. Larson, M. Andersson, O. Shehory, and F. Tohm’e. Coalition
Structure Generation with Worst Case Guarantees. In E. S. B.V., editor,
Artificial Intelligence 111, number 111, pages 209–238. 1999.
[17] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT
Press, 1998.
[18] P. J. ’t Hoen and S. M. Bohte. Collective Intelligence with Sequences of
Actions - Coordinating Actions in Multi-Agent Systems. In Lecture Notes in
Artificial Intelligence, volume LNAI 2837, 2003.
[19] P. J. ’t Hoen and S. M. Bohte. COllective INtelligence with Task Assignment.
Report SEN-E0315, Stichting Centrum voor Wiskunde en Informatica, P.O.
Box 94979, 1090 GB Amsterdam (NL) Kruislaan 413, 10980 SJ Amsterdam
(NL), December 2003.
[20] P. J. ’t Hoen and K. Tuyls. Analyzing Multi-Agent Reinforcement Learning
using Evolutionary Dynamics.
140
BIBLIOGRAPHY BIBLIOGRAPHY
[21] K. Tuyls, K. Verbeeck, and T. Lenaerts. A Selection-Mutation Model for
Q-learning in Multi-Agent Systems. ACM, July 2003.
[22] D. H. Wolpert and K. Tumer. Using Collective Intelligence to Route Internet
Traffic. 1999.
[23] D. H. Wolpert and K. Tumer. An Introduction to Collective Intelligence.
Technical Report 99-63, NASA-ARC-IC, June 2005.
[24] D. H. Wolpert, K. Tumer, and A. Agogino. Learning Sequences of Actions in
Collectives of Autonomous Agents. July 2002.
[25] M. Wooders. The Tiebout Hypothesis: Near Optimality in Local Public Good
Economies. In Econometrica, pages 1467–1486. 1980.
141