Sapienza Universita degli studi di Roma
Facolta di Ingegneria
Tesi di Laurea Specialistica in
Ingegneria Informatica
A Service-Based People Localization and
Tracking System for Domotic Applications
Relatore Candidato
Ing. Massimo Mecella Francesco Leotta
Anno Accademico 2008/2009
a mia madre,per l’infaticabile amore con cui mi ha cresciuto
a mio padreper avermi insegnato che il rispetto si guadagna con il lavoro
ed a mio fratelloper essere sempre stato confidente ed amico
Extended Abstract
Con il termine domotica si intende l’insieme di pratiche emergenti tese allo
sviluppo di un elevato grado di automazione dei servizi offerti dalle abitazioni.
Gli obiettivi di questa tesi, sviluppata nell’ambito del progetto SM4All,
sono stati lo studio, il progetto e l’implementazione di un sistema di local-
izzazione, riconoscimento e tracciamento dei soggetti presenti all’interno di
una abitazione. Tale funzionalita e fondamentale nello sviluppo di un sis-
tema di automazione domestica completo che permetta la definizione di “sce-
nari” mediante i quali gli utenti possano personalizzare il “comportamento”
dell’abitazione in funzione della loro posizione (sfruttando le possibilita of-
ferte dai moderni impianti ed elettrodomestici).
I sistemi PLT (People Localization and Tracking) si distinguono in due
categorie a seconda che essi facciano o meno uso di marcatori; per il pro-
getto del nostro sistema PLaTHEA (People Localization and Tracking for
HomE Automation) si e scelto di seguire la seconda politica, considerando
l’uso di marcatori fonte di disagio fisico e condizionamento psicologico per
gli utenti. Questa scelta implica l’utilizzo di tecniche di analisi di sequenze
video della scena da monitorare; da questo flusso di immagini il sistema ha
necessita di estrapolare le coordinate fisiche dei soggetti tracciati, presuppo-
nendo, quindi, un senso di profondita che e ottenibile solo utilizzando due
(o piu) telecamere; questo processo va sotto il nome di Visione Stereo ed e
oggetto del Capitolo 2.
Il monitoraggio di una intera abitazione prevede l’installazione per ogni
stanza di un PLaTHEA peer composto di due telecamere di rete (poste
nell’angolo alto opposto alla porta di ingresso), uno switch di rete ed un
elaboratore per l’esecuzione del software sviluppato. Il cablaggio tipico di
i
una abitazione servita da PLaTHEA e quindi quella descritta in Fig. 1.
Fig. 1: Due PLaTHEA peers dispiegati all’interno di una abitazione. Perogni stanza abbiamo gli elementi base dell’installazione.
Ognuno di questi peer fornisce ai client (responsabili per la composizione
dei servizi domestici) informazioni sulle identita e sulle posizioni dei soggetti
tracciati tramite una interfaccia di rete a servizi basata sul protocollo UPnP
(Universal Plug and Play); tali informazioni possono essere richieste in modal-
ita sincrona (tramite chiamate bloccanti a servizi) o asincrona (sfruttando un
modello interazione publish/subscribe). Il modello dei componenti in Fig. 2
espone nel dettaglio le interazioni che interessano un PLaTHEA peer.
Il “data layer” del sistema e costituito dal database dei volti (utilizzati per
il riconoscimento) e dal database contenente le informazioni di calibrazione
delle telecamere (che sono come detto le principali sorgente dati); in parti-
colare allo scopo di avviare il sistema su ogni peer devono essere eseguite le
seguenti operazioni:
1. la calibrazione interna delle due telecamere che permette di ottenere
ii
Fig. 2: I componenti di PLaTHEA e le loro responsabilita e dipendenze.
i parametri che ne descrivono il comportamento (ideale e con distor-
sione);
2. la calibrazione stereo che descrive la posizione relativa di una tele-
camera rispetto all’altra;
3. la calibrazione esterna che descrive la relazione tra il sistema di
coordinate centrato nella telecamera sinistra ed il sistema di coordinate
della scena.
Il popolamento del database delle facce puo essere fatto in qualunque
momento e non e necessario per il funzionamento del sistema come semplice
PLT1.
Il fulcro del sistema e il componente indicato come Elaboration Core la
cui architettura e descritta in Fig. 3. Tale componente e costituito da cinque
thread di elaborazione sincronizzati mediante eventi software.
Le sequenze video (in formato MJPEG) sono acquisite dalle due tele-
camere in maniera indipendente (da due thread separati che effettuano la
decompressione del formato compresso); esse hanno il medesimo frame rate
nominale ma questo varia di sovente a causa della compensazione automatica
1Nei sistemi PLT il riconoscimento delle persone non e sempre presente.
iii
Fig. 3: Il componente Elaboration Core in dettaglio.
dell’illuminazione; cio provoca una asincronia tra le due sequenze che deve
essere risolta mediante un modulo apposito detto sincronizzatore, che se-
leziona i frame da scartare in ognuna delle sequenze in modo da mantenere lo
stato di sincronia e fornisce in uscita una sequenza di coppie da utilizzare per
la stereo visione (il sincronizzatore che e eseguito su un thread indipendente
elimina la distorsione da entrambe le immagini e le rettifica).
La localizzazione ed il tracciamento di soggetti richiede la capacita di
distinguere essi da tutto cio che nella scena di riferimento e statico e che va
sotto il nome di background; il modello del background deve:
• permettere una semplice estrazione del foreground, cioe di tutti gli
agenti mobili della scena;
• essere quanto il piu insensibile possibile ad improvvisi cambiamenti di
illuminazione della scena (che potrebbero avvenire per l’accensione di
una sorgente luminosa, lo spostamento di una tenda via discorrendo),
ma anche alle ombre che in molti casi producono il rilevamento di falsi
iv
positivi;
• essere tempo adattivo in modo da permettere cambiamenti nel back-
ground (lo spostamento di mobili, suppellettili ed abiti ad esempio).
La volonta di rispettare questi vincoli ha portato all’analisi ed alla sper-
imentazione di una moltitudine di risultati pubblicati che alla fine hanno
portato allo sviluppo di un metodo ibrido che combinando le tecniche pre-
sentate in [23], [11] e [5] risolve le problematiche relative all’hardware utiliz-
zato (le telecamere Axis 207 e la loro estrema sensibilita all’illuminazione) ed
all’ambiente operativo (quello domestico con la sua grande dinamicita).
Una volta ottenuti i pixel di foreground della scena, questi devono essere
proiettati in un insieme di coordinate tridimensionali. Come detto la stereo
visione ci viene in aiuto in questo compito; abbiamo bisogno pero di un al-
goritmo che associ efficientemente i punti nell immagine sinistra con quelli
dell’immagine destra; poiche le due immagini sono rettificate e senza dis-
torsioni, punti corrispondenti si troveranno sulla stessa linea (la cosiddetta
linea epipolare) e quindi differiranno solo per coordinata x; questa differenza
viene detta disparita. L’algoritmo utilizzato e del tipo SAD (Sum of Abso-
lute Difference) ed e implementato nella libreria OpenCV (ispirandosi a [16])
e risulta essere molto efficiente (la precisione ottenuta e inferiore rispetto ad
altri metodi piu costosi ma e comunque sufficiente per i nostri scopi).
Nei sistemi PLT e pratica comune effettuare il tracciamento utilizzando
informazioni estrapolate da una vista simulata dall’alto ottenuta trasfor-
mando le coordinate tridimensionali dei pixel di foreground rispetto al sis-
tema di riferimento delle telecamere (ottenute utilizzando la mappa di dis-
parita di questi pixel) in coordinate tridimensionali rispetto ad un sistema
di riferimento solidale con la stanza (la matrice di rotazione ed il vettore
di traslazione necessari a cio sono ottenute durante la calibrazione esterna).
L’algoritmo per la proiezione e l’identificazione dei soggetti e ripreso da [10]
ma noi preferiamo l’utilizzo di un detector di contorni per l’identificazione dei
candidati al tracciamento. In Fig. 4 sono mostrati il modello del background,
il foreground estratto e la vista dall’alto del soggetto.
Nel momento in cui un soggetto viene identificato, ad esso vengono asso-
v
Fig. 4: Esempio di indentificazione di un soggetto. I pixel di foregroundvengono estratti utilizzando come riferimento il background. Quindi si creauna vista simulata dall’alto in cui il soggetto viene identificato.
ciati una posizione, una velocita, le altezze massima e media ed un template
di colore. Gli oggetti identificati all’istante t vengono confrontati con quelli
tracciati all’istante t− 1 (la cui posizione attuale viene predetta utilizzando
la velocita memorizzata) in base al template dei colori (tecnica ripresa da
[25]), alle posizioni ed alle altezze medie. Gli oggetti identificati all’istante
t per i quali non e stato possibile trovare una corrispondenza tra gli oggetti
tracciati all’istante t − 1 divengono nuovi oggetti tracciati (la cui velocita
iniziale e nulla).
Il riconoscimento dei volti e una operazione svolta da un thread indipen-
dente; il motivo di questa scelta e che essa e una operazione costosa (e tanto
piu lunga quanto piu e grande il database dei volti) e quindi metterla in
sequenza con le altre operazioni comporterebbe un abbassamento del frame
rate gestito dal sistema che diventerebbe inadatto al tracciamento. Ai fini
del tracciamento il sistema dovrebbe infatti mantenere un frame rate medio
di 10 frame al secondo (il tempo di elaborazione e quindi di 100 ms per ogni
coppia stereo di immagini).
vi
Per ogni frame ad alta risoluzione fornito dal thread principale di elabo-
razione, il thread di riconoscimento dei volti deve eseguire le seguenti oper-
azioni:
1. eseguire il rilevamento dei volti; cioe l’estrazione delle regioni del
frame contenenti volti; per fare cio il sitema utilizza un classificatore di
Viola-Jones;
2. per ogni volto rilevato, eseguire il riconoscimento utilizzando il con-
fronto delle SIFT features con ognugno dei volti nel database; il pun-
teggio di una persona e ottenuto sommando tutti i match del volto
preso in considerazione nel frame corrente con tutti i volti registrati
per quella persona;
3. riproiettare il volto a terra ottenendo il soggetto tracciato corrispon-
dente.
Il thread principale di elaborazione fornisce al thread di riconoscimento
dei volti le immagini ad alta risoluzione e le informazioni di riproiezione dei
volti a terra e si occupa di interrogarlo ad ogni time slot:
• se il thread di riconoscimento volti ha finito, le informazioni da esso
fornite vengono allegate alle informazioni di tracciamento. In partico-
lare se ad un oggetto tracciato viene associata per tre volte consecutive
la medesima identita, questa identita viene associata all’oggetto;
• se il thread di riconoscimento volti non ha concluso le sue operazioni,
il thread principale passa al tracciamento sulla successiva coppia stereo
fornita dal sincronizzatore.
Grande importanza nella tesi e stata data alla fase di installazione. L’interfaccia
grafica di amministrazione permette infatti con dei semplici tools di eseguire
tutte le operazione necessarie allo scopo.
Di seguito proponiamo una breve descrizione di ogni capitolo della tesi:
Capitolo 1 Questo capitolo contiene una breve introduzione alla domotica
ed a come essa puo cambiare il modo di concepire l’abitare. Il progetto
vii
SM4All viene inserito in questo contesto fornendo la motivazione allo
sviluppo di questa tesi.
Capitolo 2 In questo capitolo viene fornito al lettore il bagaglio di conoscenze,
circa il modello di telecamera e la stereo visione, utile nei capitoli suc-
cessivi.
Capitolo 3 In questo capitolo si analizza lo stato dell’arte nel campo dei
sistemi PLT e di riconoscimento dei volti. Questa e un’area di ricerca
in continua evoluzione. I due macro-problemi saranno scomposti e per
ognuno degli aspetti verranno descritti pro e contro di pratiche ed al-
goritmi. Il capitolo si conclude con una galleria di progetti di ricerca
nell’ambito dei sistemi PLT.
Capitolo 4 In questo capitolo descriviamo come PLaTHEA usa e combina
le tecniche introdotte nel capitolo 3 e quali soluzioni originali abbiamo
escogitato per risolvere i problemi relativi all’uso di telecamere di rete
ed al sequenziamento delle operazioni.
Capitolo 5 Questo capitolo inizia analizzando gli obiettivi del sistema. Segue
una dettagliata descrizione dell’architettura: quali tecnologie vengono
sfruttate, come viene dispiegato il sistema e cosı via.
Capitolo 6 Questo capitolo approfondisce l’architettura di sistema anal-
izzando problemi di implementazione, possibili soluzioni e tra queste
quelle adottate.
Capitolo 7 Questo capitolo inizia descrivendo i casi di test; da questi ven-
gono dedotte le aree di miglioramento del sistema. I tempi di com-
putazione e le risorse consumate da PLaTHEA vengono inoltre anal-
izzate.
Capitolo 8 Questo capitolo conclude la tesi introducendo possibili lavori fu-
turi intesi al miglioramento delle prestazioni del sistema ed all’ampliamento
delle sue funzionalita.
viii
Contents
1 Reference Context 1
1.1 Introduction to Domotics . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Commercial Platforms for Home Automation . . . . . . 3
1.1.2 Architectures for Domotic Systems . . . . . . . . . . . 5
1.2 The SM4All Project . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 The Pervasive Layer . . . . . . . . . . . . . . . . . . . 6
1.2.2 The Composition Layer and the need of a PLT System 9
1.2.3 The User Layer . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . 13
2 Camera Model and Stereo Vision 15
2.1 Camera Pinhole Model and Camera Calibration . . . . . . . . 15
2.1.1 Lens Distortion . . . . . . . . . . . . . . . . . . . . . . 18
2.1.2 Camera calibration . . . . . . . . . . . . . . . . . . . . 19
2.2 Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Triangulation . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Stereo Calibration . . . . . . . . . . . . . . . . . . . . 22
2.2.3 Stereo Rectification . . . . . . . . . . . . . . . . . . . . 24
3 A Survey on the State of the Art 25
3.1 Introduction to PLT systems . . . . . . . . . . . . . . . . . . . 25
3.2 Typical Structure of a Stereo PLT System . . . . . . . . . . . 27
3.2.1 Stereo Computation Module . . . . . . . . . . . . . . . 28
3.2.2 Background Modeling and Foreground Segmentation
Modules . . . . . . . . . . . . . . . . . . . . . . . . . . 30
ix
3.2.3 Plan View Projection Module . . . . . . . . . . . . . . 34
3.2.4 Tracker Module . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.1 Face Detection . . . . . . . . . . . . . . . . . . . . . . 42
3.3.2 Face Recognition . . . . . . . . . . . . . . . . . . . . . 43
3.4 Projects around the world . . . . . . . . . . . . . . . . . . . . 45
3.4.1 LocON Project . . . . . . . . . . . . . . . . . . . . . . 45
3.4.2 Gator Tech Smart House Project . . . . . . . . . . . . 46
3.4.3 ARGOS project . . . . . . . . . . . . . . . . . . . . . . 46
3.4.4 RoboCare Project . . . . . . . . . . . . . . . . . . . . . 48
4 Our System and Related Works 49
4.1 Background Modeling and Foreground Segmentation . . . . . 50
4.1.1 The Background Model . . . . . . . . . . . . . . . . . . 51
4.1.2 Foreground Segmentation . . . . . . . . . . . . . . . . 53
4.1.3 Foreground Refinements . . . . . . . . . . . . . . . . . 55
4.2 Plan View Projection and Tracking . . . . . . . . . . . . . . . 55
4.2.1 Localization . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.2 Correspondence . . . . . . . . . . . . . . . . . . . . . . 56
4.2.3 Refinements . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3.1 Notes on Face Detection . . . . . . . . . . . . . . . . . 61
4.4 Tracking and Face Recognition Combined . . . . . . . . . . . 63
5 System Requirements and Architecture 65
5.1 Overview on System Requirements . . . . . . . . . . . . . . . 66
5.2 A Look at the Architecture . . . . . . . . . . . . . . . . . . . 67
5.2.1 Embedding PLaTHEA . . . . . . . . . . . . . . . . . 67
5.2.2 The Components’ Architecture . . . . . . . . . . . . . 68
5.2.3 The Software Dependencies . . . . . . . . . . . . . . . 68
5.3 The Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3.1 The Camera Calibration Database . . . . . . . . . . . 71
5.3.2 The Face Database . . . . . . . . . . . . . . . . . . . . 72
x
5.4 The Elaboration Core . . . . . . . . . . . . . . . . . . . . . . . 72
5.5 The UPnP Device . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.6 The External Entities . . . . . . . . . . . . . . . . . . . . . . . 75
5.7 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.7.1 Installation and Configuration . . . . . . . . . . . . . . 76
5.7.2 Run Time Installation Refinements . . . . . . . . . . . 81
5.7.3 The Face Database Construction . . . . . . . . . . . . 82
5.7.4 Run Time Use Cases . . . . . . . . . . . . . . . . . . . 82
6 Implementation Details 87
6.1 Technological Introduction . . . . . . . . . . . . . . . . . . . . 87
6.2 The Elaboration Core Component . . . . . . . . . . . . . . . . 88
6.2.1 Video Acquisition and Synchronization . . . . . . . . . 89
6.2.2 The Elaboration and the Face Recognition Threads . . 92
6.3 The UPnP Device . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3.1 The UPnP device descriptor . . . . . . . . . . . . . . . 96
7 Tests and Performance Analysis 101
7.1 Tests on the PLT Sub-system . . . . . . . . . . . . . . . . . . 101
7.1.1 Test Environment . . . . . . . . . . . . . . . . . . . . . 102
7.1.2 Test Results . . . . . . . . . . . . . . . . . . . . . . . . 104
7.2 Tests on Face Recognition Sub-system . . . . . . . . . . . . . 111
7.3 Computational Costs . . . . . . . . . . . . . . . . . . . . . . . 114
8 Conclusions and Future Works 117
8.1 Considerations on Vision Systems . . . . . . . . . . . . . . . . 117
8.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
xi
xii
Chapter 1
Reference Context
In this chapter we’ll introduce the concept of domotics and we’ll explain
how EU project SM4All contributes to the development of this area of
reasearch. Then we’ll give a brief introduction to our PLaTHEA system
and we’ll explain its contribution to the project. Finally, in the last section,
for each chapter is given a brief sketch.
Contents1.1 Introduction to Domotics . . . . . . . . . . . . . 1
1.1.1 Commercial Platforms for Home Automation . . . 3
1.1.2 Architectures for Domotic Systems . . . . . . . . . 5
1.2 The SM4All Project . . . . . . . . . . . . . . . . . 6
1.2.1 The Pervasive Layer . . . . . . . . . . . . . . . . . 6
1.2.2 The Composition Layer and the need of a PLTSystem . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 The User Layer . . . . . . . . . . . . . . . . . . . . 12
1.3 Structure of the Thesis . . . . . . . . . . . . . . . 13
1.1 Introduction to Domotics
Home automation (also called domotics) may designate an emerging practice
of increased automation of household appliances and features in residential
dwellings, particularly through electronic means that allow for things imprac-
ticable, overly expensive or simply not possible in recent past decades [1].
1
2 CHAPTER 1. REFERENCE CONTEXT
The term is sometimes confused with “building automation”, which refers to
industrial settings and the automatic or semi-automatic control of lighting,
climate, doors and windows, and security and surveillance systems; instead
building automation features are only a subset of those provided by a full-
fledged home automation system.
Compared to a simple building automation system a home automation
system may provide the following features:
• control of home entertainment systems such as home cinema, hi-fi au-
dio, surveillance and so on;
• use of domestic robots;
• elderly assistance;
• “scenes” for different events such as dinners, parties, and so on;
Particularly interesting is the last feature. In our vision“home behaviour”
should be customizable. Let’s consider some simplex scenarios.
Scenario 1.1 Mario is an up to date technological adopter; he loves all the
comforts offered by modern technology. Mario comes home late at night and
all he wants is to watch a movie in the living room, where he placed his
newly purchased home theater, with soft lights and a comfortable temperature
of 25 ◦C. The domotic system may detect the arrival of Mario in the living
room and then direct ligths dimmer and heating system to realize all Mario’s
desires.
Scenario 1.2 Mario is also an apprehensive father and he doesn’t want his
little daughter, Marzia, can watch certain channels in the living room without
him (this channels are known to transmit violent movies even in the daylight).
The domotic system may detect the presence of Marzia in the living room
without her father and then inhibit the selection of those channels.
So, a home automation system should make possible for the users to define
scenes of common life and to instruct the system itself to react to these scenes
guiding in a specific way all the enabled appliances and subsystems.
1.1. INTRODUCTION TO DOMOTICS 3
The deployment of such a system requires a way for different subsystems
(software and hardware) to communicate; this has meant the emergence of a
set of standards mainly in the area of wiring and communication. Further-
more, the emergence of building automation before and of home automation
lately has led an evolution in domestic wiring practice (for air-conditioning
systems for example but for Ethernet wiring too).
Fig. 1.1: A typical domestic patch panel.
Another aspect has to be taken into account: wiring is hardest to retrofit
into an existing house. One solution to the problem is to embed data signals
in power lines but more frequently wireless technologies do the lion part.
Furthermore wireless is widely employed in sensor networks (Wi-Fi (IEEE-
802.11), Bluetooth (IEEE-802.15.1), ZigBee (IEEE-802.15.4) and so on).
1.1.1 Commercial Platforms for Home Automation
Many companies (part of which already active in the area of building au-
tomation) are active in the development of platforms which make the basis
for the design of a full-fledged home automation system.
The first example of such a system is MyHome by BTicino1. This
system is based on a proprietary bus (the SCS bus) which acts as both data
1See www.myhome-bticino.it for information about products and www.
myopen-bticino.it for the technical forum
4 CHAPTER 1. REFERENCE CONTEXT
bus and power supply. BTicino produces a wide variety of devices (light
switch, dimmers, actuators, cameras and so on) which communicate using
this bus. BTicino has also designed a family of web servers (the term is
perhaps inappropriate) through which, using the OpenWebNet protocol, an
external system can interact with the devices on the bus. Driving the system
via software makes possible to obtain even complex compound services. The
main disvantage of the system is that SCS bus has a low data capacity (due to
his conjuncted role of power supply) and so it makes it obviously impossible
to acquire data from multiple devices; we will see later that this makes it
impossible to design our PLaTHEA system on top of MyHome (during the
thesis we have spend some days on the feasibility of this approach).
The second example is CHORUS by Gewiss2. The system is based on a
two wire bus (the KNX bus) similar to BTicino SCS. With respect to BTicino,
Gewiss provide a series of video surveillance devices that communicate with
the CHORUS MASTER using high capacity buses (Ethernet for example);
this is very similar to the approach chosen in the design of our PLaTHEA
system.
Other popular platforms are:
• BY-ME by Vimar. It’s very similar to CHORUS (use the same KNX
bus) and to MyHome;
• EasyDom. Easydom is based on a dorsal bus that allows for the
the creation of a domotic systems that is able to incorporate normal
electrical equipments. Once the devices are wired on the bus it is
possible to handle multiple functions with single commands and the
interaction is driven by a house plant;
Even though the described platforms go in the direction of home au-
tomation. The design of modern home automation goes several step beyond;
in such architectures the mentioned platforms are only little bricks of huge
“cathedrals of services”.
2Visit http://chorus.gewiss.com for more details
1.1. INTRODUCTION TO DOMOTICS 5
1.1.2 Architectures for Domotic Systems
So far we have a look of what a home automation system could do for the
end user. Now we need the “engineer point of view” about domotic systems;
that is we want to answer the following questions:
• What are the basic elements of a home automation system?
• How these elements can work together to obtain system’s goals?
Roughly speaking the basic elements of a domotic systems are:
• sensors. They are devices that measure a physical quantity and con-
verts it into a signal which can be read by an observer or by an instru-
ment;
• actuators. They are mechanical devices for moving or controlling a
mechanism or system;
• hardware and software controllers. They coordinate sensors and
actuators (and other controllers too in a layered architecture) to obtain
a specific goal.
The presence of controllers is not really mandatory; recent advances in
the field of sensor networks allow to distribute system’s intelligence by all
the sensors. Nevertheless the design of a complex home automation system
suggest the use of a layered architecture that would benefit from the use of
controllers (see section 1.2 about SM4All project for an example).
As early mentioned there is a pletora of communication standards and
protocols in the home automation field: some devices (a device may act as
sensor as well as actuator) may have a Bluetooth interface, other devices
may expose a UPnP interface, thermal sensors may be arranged in a ZigBee
sensor network and so on. So a huge problem to take into account is the
integration of different subsystems. For example see at Scenario 1.1; the
presence of Mario in the living room could be signaled by a recognition system
(as you’ll see soon this is not a random example) in a publish-subscribe
fashion; the temperature could be provided by a sensor network; the dimmer
6 CHAPTER 1. REFERENCE CONTEXT
(it’s a perfect example of a pure actuator) could expose a simplex HTTP-
like interface on a TCP connection (it’s the case of BTicino OpenWebNet
protocol). The subsystems integration (we can call it services integration)
it’s not only a technological matter but also a semantic one, so the study of
services composition have relevance in the system’s design.
1.2 The SM4All Project
The SM4All (Smart hoMes for All) project aims at studying and develop-
ing an innovative middleware platform for inter-working of smart embedded
services in immersive and person-centric environments, through the use of
composability and semantic techniques, in order to guarantee dynamicity,
dependability and scalability, while preserving the privacy and security of
the platform and its users. This is applied to the challenging scenario of
private/home/building in presence of users with different abilities and needs
(e.g., young able bodied aged and disabled) [26].
In the section 1.1.2 we have introduced the concept of layered architecture
for home automation system. The SM4All system is constituted by a set
of logical components arranged in three distinct layers [6]: the Pervasive
Layer, the Composition Layer and the User Layer3.
1.2.1 The Pervasive Layer
The main goal of the Pervasive layer will be to seamlessly integrate het-
erogeneous networks and devices (sensors and actuators) into the SM4ALL
middleware and provide devices’ services and information using a common
and standard abstraction interface, no matter which underlaying technology
is the device based on, as shown in Figure 1.2. Thus, the Pervasive layer will
be responsible of integrating different devices into the middleware, which
may use different communication technologies and protocols for interaction,
and providing their services to the upper layers of the middleware providing
a common interface.
3Our analysis of these layers will be necessarily short. For a more detailed descriptionvisit SM4All website at http://www.sm4all-project.eu.
1.2. THE SM4ALL PROJECT 7
Fig. 1.2: SM4All Pervasive Layer: integration of heterogeneous networksand devices
The main requirement is the properly communication handling among
home devices and users applications. SM4All Pervasive layer must auto-
matically discover heterogeneous devices. Likewise, it must able to add new
devices into the system, or remove them when it is necessary.
Added devices will be registered into the system and it will identify and
provide information about their description and functionality. Pervasive layer
will be continuously scanning the network in order to know the latest sta-
tus of devices. Also, Plug and Play (UPnP, UPnP AV) must be supported
as communication standard for discovering and controlling the home devices
(common communication protocol between the devices and the home sys-
tem). The data types will be managed transparently, so all of basic data
types can be transferred between devices with different bit sizes.
The Pervasive layer will act as middleware of the system, in order to
abstract the communication with sensors and actuators of the home network
and its control. The Pervasive layer is characterized by the following system
requirements:
• scalability: the pervasive layer should be extendable, providing the
8 CHAPTER 1. REFERENCE CONTEXT
capacity of increasing the system services and the management of new
devices;
• interoperability: it must support the connection and control with
UPnP and non UPnP devices;
• robustness of services: it should ensure the correct functioning of
the services, therefore when a device fail occurs, it is detected;
• services of connection: the middleware provide automatic connec-
tion with UPnP and non UPnP devices, extracting its description;
• services of devices control: the middleware will control autonomously
the devices according to the available services for each;
• query and feed the system repositories: the middleware will pro-
vide and retrieve data from the repositories of the system;
• establish communication to the composition layer: the middle-
ware will perform actions following the composition layer lines.
Also the Pervasive Layer provides a common interface for the upper lay-
ers of the middleware for interacting with devices, following two different
patterns:
• pull: This pattern allows to the upper layers to invoke services from
devices using a common interface;
• push: This pattern allows to the upper layers to receive events pro-
duced from devices and their services. For example, a temperature
sensor could report changes on Temperature to the upper layers of the
middleware.
PLaTHEA is seen by the system as a UPnP enabled device. So it has
no need of a proxy container. This choose make PLaTHEA independent of
the SM4All system but easily integrated into it.
1.2. THE SM4ALL PROJECT 9
1.2.2 The Composition Layer and the need of a PLTSystem
In Figure 1.3 are shown the main components of the SM4All Composition
Layer. The goal of this layer is to execute a complex task invoked by the
User Layer using the interaction methods provided by the Pervasive Layer.
Fig. 1.3: SM4All Composition Layer: components architecture
We’ll analyze into detail the location component and the context aware
component.
The location component in composition layer serves as a special and
fundamental type of context information provider. The component is dedi-
cated to process raw location data of objects and analyze the location rela-
tionship between objects. It fulfills the following requirements:
• the location component will be able to be aware of the locations of the
users in the house in any determined moment;
• the location component communication layer must be characterized in
terms of scalability and interoperability;
• the location component will be able to manage and associate users,
limited locations (rooms at home environment) and time information;
• the location component will feed time-spatial context associated to
users, in order to accomplish personalization of services;
• the location component will output binary relationships between user
and devices nearby, e.g. toLeft, toRight, behind, inFront;
10 CHAPTER 1. REFERENCE CONTEXT
At the initial stage of the project, several technologies were proposed to
develop the location services. After evaluation of the technologies, SM4ALL
selected the approach of Video Tracking and RFID (Radio Frequency IDen-
tification).
Our PLaTHEA system (finally we have the pleasure) follows the Video
Tracking approach to give the position (in the context of a specific room)
of recognized user (mobile agents of which PLaTHEA is able to recognize
identity) and unrecognized mobile agents (humans as well as robots). To do
this it adds to the standard features of a PLT (People Localization and Track-
ing) system a face recognition system; so for each tracked object PLaTHEA
add, if possible, informations about the identity of the object (of course the
tracked object must be human and must train the system with his face bio-
metrical informations; that is a user context is defined for it). The choose of a
Video Tracking approach for this task is natural: objects to be tracked don’t
have to wear any marker4 which limit humans freedom and naturalness5. On
the other hand Video Tracking approach suffers in dark rooms; SM4All
intends to solve this problem using infrared cameras but this feature is out
of the scope of this thesis, so we limit our work to room with enough light to
allow cameras’ light compensation.
In our vision every room in the house will have a distinct installation of
PLaTHEA. This multitude of PLaTHEA “device”6 communicate with the
pervasive layer via UPnP and then the composition layer will do his work.
A graphical presentation of this is given in Fig. 1.4.
In SM4All vision the concept of “position” is deeply linked with the
concept of “context”. The context can be categorized in:
• System context. System context refers to the hardware being used,
the bandwidth, and the different devices available and accessible by the
4On the other hand such a solution doesn’t allow to know the position of static elementslike desks, chairs, keys and so on; in this field the use of marker like RFID is preferable.
5This solution is however often used in practice. Bill Gates’ futuristic home solves theproblem of tracking, providing each guest with an active bracelet
6PLaTHEA is a software module running on a Windows machine. In the future wewant to study an implementation on a lightweight operating system such as WindowsMobile or Embedded Linux.
1.2. THE SM4ALL PROJECT 11
Fig. 1.4: PLaTHEA systems as seen by SM4All system
user in the smart home;
• User context. User context is at a central position of context man-
agement. User context is collected in the form of user profiles and
represent their preferences with respect to the enriched environment
built by the SM4ALL architecture; user context information flows in
system, serving as driving force of context-aware services and goals of
planning composite services;
• Physical context. Physical context refers to information related to
the physical environment where the user is, such as location, tempera-
ture, noise, light, etc.
So the context-aware component and the position component are funda-
mental to obtain the main goals of a home automation system as defined in
section 1.1.
12 CHAPTER 1. REFERENCE CONTEXT
1.2.3 The User Layer
Finally, we are at the higher level of the hierarchy: the SM4All User Layer.
The goals of these layer is easily defined by its main components:
• Abstract Adaptive Interface. This module interacts with the Com-
position Layer, gathering information about users, services, and status
and provides the concrete user interfaces (UIs) with a set of operations,
partially ordered, together with visual information, e.g., icons. More
in detail, the AAI analyzes all the available services, collecting from
each of them the set of initial operations and the associated icons; this
constitute the initial set of available operations. When a service is
started, the initial set of available operations for that service changes
and the AAI, as soon as the Composition Layer notifies that the status
is changed, updates the overall operation set, computing a new par-
tial order. It’s the interface used by the other User Layer interfaces to
interact with the composition layer;
• Brain Computer Interface. The brain computer interface (BCI)
technology allows a direct connection between brain and computer
without any muscular activity required, and thus it offers an unique
opportunity to enhance and/or to restore communication and actions
into external world for people with severe motor disability;
• HTTP Interface. The SM4All project proposes standard HTTP
interfaces to provide rich user interaction for the users at home. These
interactive web applications will allow web pages displayed in standard
web browsers to present responsive user interfaces that approach the
features expected in free-standing applications;
• Remote Interface. The SM4All system allows for a remote inter-
action with home automation features.
1.3. STRUCTURE OF THE THESIS 13
1.3 Structure of the Thesis
To conclude the introductive chapter we give for each of the chapters a little
summary.
Chapter 2 In this chapter we introduce the knowledge background about
camera models and stereo vision useful to continue the reading. For
those who are not familiar for these concepts and want to make their
life easier.
Chapter 3 In this chapter we analyze the state of the art in the field of PLT
systems and face recognition systems. This is a continuously evolving
area of research. The two macro-problems will be decomposed and
for each aspect will be given pros and cons of algorithms and pratices.
The chapter ends with gallery of research projects in the area of PLT
systems.
Chapter 4 In this chapter we describes how PLaTHEA uses and combines
the techniques introduced in Chapter 3 and what original solution we
have founded for other problems.
Chapter 5 This chapter starts analyzing the goals of the system. A detailed
description of the architecture follows: what technologies are used, how
the system is deployed and so on.
Chapter 6 This chapter deepens system architecture analyzing implemen-
tation problems, possible solutions and among these the adopted one
(with motivations).
Chapter 7 This chapter start describing test cases and inferring from them
system’s drawbacks. PLaTHEA performance are also analyzed.
Chapter 8 This chapter ends the thesis introducing possible future works
to enhance the system.
14 CHAPTER 1. REFERENCE CONTEXT
Chapter 2
Camera Model and StereoVision
Contents2.1 Camera Pinhole Model and Camera Calibration 15
2.1.1 Lens Distortion . . . . . . . . . . . . . . . . . . . . 18
2.1.2 Camera calibration . . . . . . . . . . . . . . . . . . 19
2.2 Stereo Vision . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Triangulation . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Stereo Calibration . . . . . . . . . . . . . . . . . . 22
2.2.3 Stereo Rectification . . . . . . . . . . . . . . . . . 24
2.1 Camera Pinhole Model and Camera Cal-
ibration
We begin by looking at the simplest model of a camera, the pinhole camera
model. In this simple model, light is envisioned as entering from the scene
or a distant object, but only a single ray enters from any particular point.
In a physical pinhole camera, this point is then “projected” onto an imaging
surface. As a result, the image on this image plane (also called the projective
plane or imager) is always in focus, and the size of the image relative to the
distant object is given by a single parameter of the camera: its focal length.
For our idealized pinhole camera, the distance from the pinhole aperture to
15
16 CHAPTER 2. CAMERA MODEL AND STEREO VISION
the screen is precisely the focal length. This is shown in Figure 2.1, where f
is the focal length of the camera, (X, Y, Z) are the object’s coordinates with
respect to the so called center of projection and (x, y, f) are the object’s
image coordinates on the imaging plane.
Fig. 2.1: A point Q = (X, Y, Z) is projected onto the image plane by theray passing through the center of projection, and the resulting point on theimage is q = (x, y, f).
We can see by similar triangles that:
x
f=X
Z,
y
f=Y
Z(2.1)
The point at the intersection of the image plane and the optical axis is
referred to as the principal point.
You might think that the principle point is equivalent to the center of the
imager; yet this would imply that some guy with tweezers and a tube of glue
was able to attach the imager in your camera to micron accuracy. In fact, the
center of the chip is usually not on the optical axis. We thus introduce two
new parameters, cx and cy, to model a possible displacement (away from the
optic axis) of the center of coordinates on the projection screen. The result
is that a relatively simple model in which a point Q in the physical world,
whose coordinates are (X, Y, Z), is projected onto the screen at some pixel
2.1. CAMERA PINHOLE MODEL AND CAMERA CALIBRATION 17
location given by (xscreen, yscreen) in accordance with the following equations:
xscreen = fx
(X
Z
)+ cx, yscreen = fy
(Y
Z
)+ cy (2.2)
Note that we have introduced two different focal lengths; the reason for
this is that the individual pixels on a typical low-cost imager are rectangular
rather than square. The focal length fx (for example) is actually the product
of the physical focal length of the lens and the size sx of the individual
imager elements (this should make sense because sx has units of pixels per
millimeter while F has units of millimeters, which means that fx is in the
required units of pixels). Of course, similar statements hold for fy and sy.
It is important to keep in mind, though, that sx and sy cannot be measured
directly via any camera calibration process, and neither is the physical focal
length F directly measurable. Only the combinations fx = Fsx and fy = Fsy
can be derived without actually dismantling the camera and measuring its
components directly.
The relation that maps the points Q in the physical world with coor-
dinates (X, Y, Z) to the points on the projection screen with coordinates
(xscreen, yscreen) is called a projective transform. When working with such
transforms, it is convenient to use what are known as homogeneous coor-
dinates. The homogeneous coordinates associated with a point in a projec-
tive space of dimension n are typically expressed as an (n + 1)-dimensional
vector, with the additional restriction that any two points whose values are
proportional are equivalent. In our case, the image plane is the projective
space and it has two dimensions, so we will represent points on that plane
as threedimensional vectors q = (x, y, w). Recalling that all points having
proportional values in the projective space are equivalent, we can recover the
actual pixel coordinates by dividing through by w. This allows us to arrange
the parameters that define our camera (i.e., fx, fy, cx, and cy) into a single
3-by-3 matrix, which we will call the camera intrinsics matrix:
q = MQ, where q =
xyz
, M =
fx 0 cx0 fy cy0 0 1
, Q =
XYZ
(2.3)
18 CHAPTER 2. CAMERA MODEL AND STEREO VISION
2.1.1 Lens Distortion
In theory, it is possible to define a lens that will introduce no distortions.
In practice, however, no lens is perfect. This is mainly for reasons of manu-
facturing; it is much easier to make a “spherical” lens than to make a more
mathematically ideal“parabolic” lens. It is also difficult to mechanically align
the lens and imager exactly. Here we describe the two main lens distortions
and how to model them1. Radial distortions arise as a result of the shape
of lens, whereas tangential distortions arise from the assembly process of
the camera as a whole.
We start with radial distortion. The lenses of real cameras often notice-
ably distort the location of pixels near the edges of the imager. This bulging
phenomenon is the source of the “fish-eye” effect. With some lenses, rays
farther from the center of the lens are bent more than those closer in. A
typical inexpensive lens is, in effect, stronger than it ought to be as you get
farther from the center. Radial distortion is particularly noticeable in cheap
web cameras but less apparent in high-end cameras, where a lot of effort is
put into fancy lens systems that minimize radial distortion.
For radial distortions, the distortion is 0 at the (optical) center of the
imager and increases as we move toward the periphery. In practice, this
distortion is small and can be characterized by the first few terms of a Taylor
series expansion around r = 0. For cheap web cameras, we generally use the
first two such terms; the first of which is conventionally called k1 and the
second k2. For highly distorted cameras such as fish-eye lenses we can use a
third radial distortion term k3. In general, the radial location of a point on
the imager will be rescaled according to the following equations:
xcorrected = xscreen(1 + k1r2 + k2r
4 + k3r6)
ycorrected = yscreen(1 + k1r2 + k2r
4 + k3r6)
The second-largest common distortion is tangential distortion. This dis-
tortion is due to manufacturing defects resulting from the lens not being
exactly parallel to the imaging plane. Tangential distortion is minimally
1The approach to modeling lens distortion taken here derives mostly from [7]
2.1. CAMERA PINHOLE MODEL AND CAMERA CALIBRATION 19
characterized by two additional parameters, p1 and p2, such that:
x′corrected = xcorrected + [2p1ycorrected + p2(r2 + 2x2corrected)]
y′corrected = ycorrected + [p1(r2 + 2y2corrected) + 2p2xcorrected]
Thus in total there are five distortion coefficients that we require. They
are typically bundled into one distortion vector; this is just a 5-by-1 matrix
containing k1, k2, p1, p2, and k3 (in that order).
2.1.2 Camera calibration
In this subsection we analyze what we obtain from the calibration of a single
camera. In the section 2.2.2 we will analyze how stereo calibration completes
the informations generated by single camera calibration.
It’s easy to sense that camera calibration give us a camera intrinsics
matrix and a distortion coeffiecients vector. The camera intrinsic ma-
trix is perhaps the most interesting final result, because it is what allows us
to transform from 3D coordinates to the image’s 2D coordinates. We can also
use the camera matrix to do the reverse operation, but in this case we can
only compute a line in the three-dimensional world to which a given image
point must correspond.
The maths behind camera calibration are out of the scope of this thesis.
For those interested the real“best-seller” is [27]. OpenCV2 library (that is the
one used for the implementation of PLaTHEA) uses the method described
in [32]. We will return on camera calibration in the following chapters; for
now we say that the calibration is done using multiple views of a constant
pattern (in our case a chessboard with known texel side).
To conclude, if we have more than one camera, camera calibration has
to be done for each camera even if the camera models are identical, due to
difference in the manufacturing process.
2OpenCV is an open-source library available at http://opencv.willowgarage.com.A very good guide to this library is [4] to which is inspired this chapter
20 CHAPTER 2. CAMERA MODEL AND STEREO VISION
2.2 Stereo Vision
We all are familiar with the stereo imaging capability that our eyes give us.
To what degree can we emulate this capability in computational systems?
Computers accomplish this task by finding correspondences between points
that are seen by one imager and the same points as seen by the other im-
ager. With such correspondences and a known baseline separation between
cameras, we can compute the 3D location of the points. Although the search
for corresponding points can be computationally expensive, we can use our
knowledge of the geometry of the system to narrow down the search space as
much as possible. In practice, stereo imaging involves four steps when using
two cameras3.
1. Mathematically remove radial and tangential lens distortion; this is
called undistortion and is detailed in section 2.1. The outputs of this
step are undistorted images.
2. Adjust for the angles and distances between cameras, a process called
rectification. The outputs of this step are images that are row-aligned
and rectified.
3. Find the same features in the left and right camera views, a process
known as correspondence. The output of this step is a disparity map,
where the disparities are the differences in x-coordinates on the image
planes of the same feature viewed in the left and right cameras: xl−xr.
4. If we know the geometric arrangement of the cameras, then we can turn
the disparity map into distances by triangulation. This step is called
reprojection, and the output is a depth map.
We start with the last step to motivate the first three.
2.2.1 Triangulation
Assume that we have a perfectly undistorted, aligned, and measured stereo
rig as shown in Figure 2.2: two cameras whose image planes are exactly
3Here we give just a high-level understanding. For details, we recommend [8]
2.2. STEREO VISION 21
coplanar with each other, with exactly parallel optical axes (the optical axis
is the ray from the center of projection O through the principal point c and
is also known as the principal ray) that are a known distance apart, and with
equal focal lengths fl = fr. Also, assume for now that the principal points
cleftx and crightx have been calibrated to have the same pixel coordinates in
their respective left and right images. Please don’t confuse these principal
points with the center of the image. A principal point is where the principal
ray intersects the imaging plane. Th is intersection depends on the optical
axis of the lens. As we saw in Section 2.1, the image plane is rarely aligned
exactly with the lens and so the center of the imager is almost never exactly
aligned with the principal point.
Moving on, let’s further assume the images are row-aligned and that every
pixel row of one camera aligns exactly with the corresponding row in the other
camera. We will call such a camera arrangement frontal parallel. We will
also assume that we can find a point P in the physical world in the left and
the right image views at pl and pr, which will have the respective horizontal
coordinates xl and xr4.
In this simplifi ed case, taking xl and xr to be the horizontal positions
of the points in the left and right imager (respectively) allows us to show
that the depth is inversely proportional to the disparity between these views,
where the disparity is defined simply by d = xl− xr. This situation is shown
in 2.2, where we can easily derive the depth Z by using similar triangles.
Referring to the figure, we have:
T − (xl − xr)Z − f
=T
Z⇒ Z =
fT
xl − xr(2.4)
Since depth is inversely proportional to disparity, there is obviously a
nonlinear relationship between these two terms. When disparity is near 0,
small disparity differences make for large depth differences. When disparity
is large, small disparity differences do not change the depth by much. The
consequence is that stereo vision systems have high depth resolution only for
4How these coordinates are founded it’s an important matter because this is a compu-tational expensive operation. We will analyze this aspect in the implementation chapter.
22 CHAPTER 2. CAMERA MODEL AND STEREO VISION
objects relatively near the camera; baseline distance T choose have so a great
relevance because a higher value of T allows to work with higher distance.
With this arrangement it is relatively easily to solve for distance. Now
we must spend some energy on understanding how we can map a real-world
camera setup into a geometry that resembles this ideal arrangement. In
the real world, cameras will almost never be exactly aligned in the frontal
parallel configuration depicted in Figure 2.2. Instead, we will mathematically
find image projections and distortion maps that will rectify the left and right
images into a frontal parallel arrangement. When designing a stereo rig, it
is best to arrange the cameras approximately frontal parallel and as close
to horizontally aligned as possible. This physical alignment will make the
mathematical tranformations more tractable. If you cameras aren’t aligned at
least approximately, then the resulting mathematical alignment can produce
extreme image distortions and so reduce or eliminate the stereo overlap area
of the resulting images.
2.2.2 Stereo Calibration
Stereo calibration is the process of computing the geometrical relationship
between the two cameras in space. In contrast, stereo rectification is the pro-
cess of “correcting” the individual images so that they appear as if they had
been taken by two cameras with row-aligned image planes (review Figures
2.2). With such a rectification, the optical axes (or principal rays) of the two
cameras are parallel and so we say that they intersect at infinity.
Stereo calibration outputs the following elements:
• the rotation matrix R and translation vector T between the two
cameras (as depicted if Fig. 2.3);
• the essential matrix E. Given a point P , we would like to derive a
relation which connects the observed locations pl and pr of P on the
two imagers. This relationship will turn out to serve as the definition
of the essential matrix; that is:
pTr Epl = 0 (2.5)
2.2. STEREO VISION 23
Fig. 2.2: With a perfectly undistorted, aligned stereo rig and known cor-respondence, the depth Z can be found by similar triangles; the principalrays of the imagers begin at the centers of projection Ol and Or and extendthrough the principal points of the two image planes at cleftx and crightx .
Fig. 2.3: The essential geometry of stereo imaging is captured by the essen-tial matrix E, which contains all of the information about the translation Tand the rotation R, which describe the location of the second camera relativeto the first in global coordinates.
24 CHAPTER 2. CAMERA MODEL AND STEREO VISION
Note that E contains nothing intrinsic to the cameras; thus, it relates
points to each other in physical or camera coordinates, not pixel coor-
dinates;
• the fundamental matrix F . In practice, we are usually interested
in pixel coordinates. In order to find a relationship between a pixel
in one image and the corresponding epipolar line in the other image,
we will have to introduce intrinsic information about the two cameras.
Recalling that pixel coordinate q = Mp substituting in 2.5 we have:
qTr (M−1r )TEM−1
l ql = qTr Fql = 0 (2.6)
In a nutshell: the fundamental matrix F is just like the essential matrix
E, except that F operates in image pixel coordinates whereas E operates
in physical coordinates.
2.2.3 Stereo Rectification
We want to reproject the image planes of our two cameras so that they reside
in the exact same plane, with image rows perfectly aligned into a frontal
parallel configuration. We want the image rows between the two cameras
to be aligned after rectification so that stereo correspondence (finding the
same point in the two different camera views) will be more reliable and
computationally tractable5.
Using Bouguet method [3], which pretend we’ve already done stereo cal-
ibration, the rectification outputs the following:
• the 3-by-3 row-aligned rectification rotations for the left and right image
planes Rl and Rr;
• the 3-by-4 left and right projection equations Pl and Pr;
• the 4-by-4 reprojection matrix Q which allow to transform a triple of
screen coordinates (xl, yr, d) into camera coorinates.
5Note that reliability and computational efficiency are both enhanced by having tosearch only one row for a match with a point in the other image.
Chapter 3
A Survey on the State of theArt
Contents3.1 Introduction to PLT systems . . . . . . . . . . . 25
3.2 Typical Structure of a Stereo PLT System . . . 27
3.2.1 Stereo Computation Module . . . . . . . . . . . . 28
3.2.2 Background Modeling and Foreground Segmenta-tion Modules . . . . . . . . . . . . . . . . . . . . . 30
3.2.3 Plan View Projection Module . . . . . . . . . . . . 34
3.2.4 Tracker Module . . . . . . . . . . . . . . . . . . . . 37
3.3 Face Recognition . . . . . . . . . . . . . . . . . . . 39
3.3.1 Face Detection . . . . . . . . . . . . . . . . . . . . 42
3.3.2 Face Recognition . . . . . . . . . . . . . . . . . . . 43
3.4 Projects around the world . . . . . . . . . . . . . 45
3.4.1 LocON Project . . . . . . . . . . . . . . . . . . . . 45
3.4.2 Gator Tech Smart House Project . . . . . . . . . . 46
3.4.3 ARGOS project . . . . . . . . . . . . . . . . . . . 46
3.4.4 RoboCare Project . . . . . . . . . . . . . . . . . . 48
3.1 Introduction to PLT systems
With People Localization and Tracking - PLT system we define a class
of systems able to:
25
26 CHAPTER 3. A SURVEY ON THE STATE OF THE ART
• Locate. That is to provide a human’s position in a complex scene;
• Track. That is the system is able to follow a human’s position at
successive sampling instant. If at time t human labeled as P is located
at ~pt and at time t + 1 his new position is ~pt+1 the system should
understand this, and associate both positions to P .
The techniques to performs these tasks belong to two categories:
• Localization and Tracking using markers. In these systems hu-
mans wear some kind of marker (for example a bracelet). These mark-
ers can emit signals of different kind (electric, luminous and so on);
these signals are received by a specific device that converts them into
tridimensional information;
– Pros. Markers result very useful in dark rooms. They can be
used for person recognition too. In the area of Virtual reality (the
human wear a lot of markers) they allows to obtain very precise
models of body;
– Cons. Physical and psychological conditioning of humans forced
to wear markers.
• Localization and Tracking without markers. These systems ob-
tain humans’ position using only the images’ sequence originated by the
video acquisition device/s. This sequence can be produced by a single
camera (Monocular Vision System), by two cameras (Stereo Vi-
sion System) or by more cameras (Multi-cameras Vision System).
In the first case we need a 3d human model to obtain tridimensional
informations; this factor determines the low precision of this kind of sys-
tems. In the latter two case tridimensional informations are inferred
by geometrical considerations.
– Pros. Humans don’t need to wear any kind of marker; more
cameras we use, more precise is the system, it’s possible to obtain
human height easily;
3.2. TYPICAL STRUCTURE OF A STEREO PLT SYSTEM 27
– Cons. Problems with dark rooms and strange illumination phe-
nomena.
The choose of the direction that PLaTHEA had to follow was easy.
Taking into account that in SM4All trasparency it’s an important issue
and that two cameras allow to obtain a very good precision, we have choosen
to implement our PLT system as a Stereo Vision System.
3.2 Typical Structure of a Stereo PLT Sys-
tem
As we’ll see in the next chapters, the PLaTHEA PLT elaboration flow
follows a well known model, used for example in [23] and in [10], and shown
in Figure 3.1.
Fig. 3.1: Model for PLT elaboration flow.
In Figure 3.1 thick lines denote links present in the PLT component of
PLaTHEA and dashed lines denote links present in other PLT systems.
28 CHAPTER 3. A SURVEY ON THE STATE OF THE ART
In the following subsections we’ll study each element of this architecture
analyzing algorithm and techniques to perform the correspondent tasks1.
Moreover the figure doesn’t cover all PLaTHEA issues but it’s a good start
for its analysis.
3.2.1 Stereo Computation Module
Stereo correspondence, matching a 3D point in the two different camera
views, can be computed only over the visual areas in which the views of the
two cameras overlap. Once again, this is one reason why you will tend to get
better results if you arrange your cameras to be as nearly frontal parallel as
possible. In [24] the authors give us a very good review of the wide world
of stereo correspondence algorithms; they found in each algorithm a series of
common step and for each of this algorithms they analyze the choose made
for each step.
OpenCV implements a fast and effective block-matching stereo algorithm
that is similar to the one developed by Kurt Konolige [16]; it works by using
small “sum of absolute difference” (SAD) windows to find matching points
between the left and right stereo rectified images. This algorithm finds only
strongly matching (high-texture) points between the two images. Thus, in a
highly textured scene such as might occur outdoors in a forest, every pixel
might have computed depth. In a very low-textured scene, such as an in-
door hallway, very few points might register depth. There are three stages to
the block-matching stereo correspondence algorithm, which works on undis-
torted, rectified stereo image pairs:
• Prefiltering to normalize image brightness to reduce lighting dif-
ferences and to enhance image texture;
• Correspondence search along horizontal epipolar lines using
an SAD window. For each feature in the left image, we search the
corresponding row in the right image for a best match. After rectifica-
tion, each row is an epipolar line, so the matching location in the right
1We invite the reader interested in how a disparity map help in 3d reconstruction toread Chapter 2
3.2. TYPICAL STRUCTURE OF A STEREO PLT SYSTEM 29
image must be along the same row (same y-coordinate) as in the left
image; this matching location can be found if the feature has enough
texture to be detectable and if it is not occluded in the right camera’s
view. See Fig. 3.2;
• Postfiltering to eliminate bad correspondence matches. With
a uniqueness ratio and a texture threshold.
Fig. 3.2: Stereo correspondence starts by assigning point matches betweencorresponding rows in the left and right images: left and right images ofa lamp (upper panel); an enlargement of a single scan line (middle panel);visualization of the correspondences assigned (lower panel).
OpenCV also implements the graph-cut algorithm described in [15]. The
algorithm gives better result with respect to the SAD algorithm but it’s too
much expansive for real-time processing.
30 CHAPTER 3. A SURVEY ON THE STATE OF THE ART
3.2.2 Background Modeling and Foreground Segmen-tation Modules
How do we define background and foreground? If we’re watching a parking
lot and a car comes in to park, then this car is a new foreground object. But
should it stay foreground forever? How about a trash can that was moved?
It will show up as foreground in two places: the place it was moved to and
the “hole” it was moved from. How do we tell the difference? And again,
how long should the trash can (and its hole) remain foreground? If we are
modeling a dark room and suddenly someone turns on a light, should the
whole room become foreground?
Background modeling is a continuosly evolving area of research. In this
section we’ll analyze a lot of method showing their pros and their drawbacks.
For each method we want to answer to a subset of the following questions:
1. How each background pixel is modeled? A simple Gaussian dis-
tribution, a Mixture of Gaussians (MOG), chromaticity statistics
and so on;
2. Are the pixel modeling Time Adaptive? Background, obviously,
changes and an important feature is time adaptivity (for example TAPP-
MOGs - Time Adaptive Per Pixel Mixture of Gaussians);
3. How they react to sudden illumination changes and hotspots?
Some modeling methods suffer from sudden change in illumination sig-
naling the whole interested zone as foreground;
4. How they react to the presence of shadows? They signal a
shadow as a foreground zone? They incorporate shadows in the back-
ground after some times?;
5. Can they manage a foreground object with color similar to
the background? This is a sometimes forgotten detail.
6. How they manage periodically moving object? We think about
curtains, ventilators and so on.
3.2. TYPICAL STRUCTURE OF A STEREO PLT SYSTEM 31
In [23] the authors mantain for each pixel a set of three gaussians rel-
ative to: pixel intensity, disparity and borders (this last computed using a
Sobel filter2). The use of disparity it’s particularly useful: foreground seg-
mentation is performed by background subtraction from the current intensity
and disparity images. By taking into account both intensity and disparity
information, the system is able to correctly deal with shadows, detected as
intensity changes, but not disparity changes, and foreground objects that
have the same color as the background, but different disparities. The draw-
back in the use of disparity is that performance depends on disparity map’s
accuracy3.
Very interesting in this paper (and very effective) is the time adaptivity
of the system. It is based on an extension of the concept of Pixel Activity
first introduced in [9] (that is a TAPPMOGs system). While in [9] the pixel
activity it’s computed starting from the difference in pixel intensity between
the current frame and the previous, in [23] the pixel activity is computed
introducing the concept of vertical and horizontal activities computed
starting from changes in borders. The pixel activity it’s computed starting
from the product of the activity of its row and column. So if a human wear a
yellow shirt and this is poorly moving even the always yellow pixels present
an high activity. Per pixels gaussians distributions are updated in a way
inversely proportional to pixel activity.
In [11] the authors propose a robust and efficiently computed background
subtraction algorithm that is able to cope with local illumination change
problems, such as shadows and highlights, as well as global illumination
changes. For each pixel the system mantain a Gaussian distribution for each
of the RGB channels; the pixel model is enriched using a running standard
deviation on:
• Brightness Distortion. The brightness distortion is a scalar value
that brings the observed color close to the expected chromaticity line,
2A Sobel filter is a special kind of filter approximating the first order derivative ofintensities along x and y direction
3We’ll see that disparity map computed by OpenCV library isn’t perfect from this pointview. This is the reason why we decided to use a different method for shadow detection.
32 CHAPTER 3. A SURVEY ON THE STATE OF THE ART
that is the distance between the background pixel’s luminosity and the
current pixel’s luminosity;
• Color Distortion. Color distortion is defined as the orthogonal dis-
tance between the observed color and the expected chromaticity line. In
other words color distortion is the real chromaticity difference between
current pixel color and background pixel color.
A graphical description of this quantities is given in Fig. 3.3.
Fig. 3.3: The [11] proposed color model in the three-dimensional RGB colorspace; the background image is statistically pixel-wise modeled. Ei representsan expected color of a given i-th pixel and Ii represents the color value of thepixel in a current image. The difference between Ii and Ei is decomposedinto brightness (αi) and chromaticity (CDi) components.
However this system is not time adaptive. The background subtraction
algorithm experiences some problems with dark foreground pixels that are
misclassified as shadow pixels; to prevent this the authors define a little
change in background subtraction but this change worsen the otherwise ex-
cellent shadow detection capability. We’ll see, while talking about plan view
maps, that it’s important that a moving object has almost all the pixels
correctly detected as foreground pixels.
The maths behind this algorithm may seem computationally expensive
at a first reading but our experience showed that this is not the truth. For
example with respect to [23] there is only a little overhead.
3.2. TYPICAL STRUCTURE OF A STEREO PLT SYSTEM 33
An interesting background modeling method is described in [14]. The
codebook method is derived by the world of video compression. A codebook
is made up of boxes that grow to cover the common values seen over time on
a specific pixel value (see Fig. 3.4).
Fig. 3.4: Codebooks are just “boxes” delimiting intensity values: a box isformed to cover a new value and slowly grows to cover nearby values; if valuesare too far away then a new box is formed.
This codebook method can deal with pixels that change levels dramat-
ically (e.g., pixels in a windblown tree, which might alternately be one of
many colors of leaves, or the blue sky beyond that tree); so it’s the unique
method among those already seen in this section to support periodically
moving object.
In the codebook method of learning a background model, each box is
defined by two thresholds (max and min) over each of the three color axes.
These box boundary thresholds will expand (max getting larger, min get-
ting smaller) if new background samples fall within a learning threshold
(learnHigh and learnLow) above max or below min, respectively. If new
background samples fall outside of the box and its learning thresholds, then
a new box will be started.
This method has presented the following drawbacks during our tests:
• it doesn’t manage shadows and sudden illumination changes;
34 CHAPTER 3. A SURVEY ON THE STATE OF THE ART
• the time adaptive version of the algorithm (as described in the paper)
needs the definition of multiple learning layer and seems very intricate
with respect to others (such as [23]);
• the time complexity seems to not allow application in real time scenar-
ios.
The work presented in [5] doesn’t cope with background learning but only
with foreground detection. Particularly interesting is the method of shadow
detection exploiting the properties showed by HSV (Hue, Saturation, Value)
color space; in fact in this color space is proved that a shadow induce a huge
change of V component, but limited changes in H and S components with
respect to the pixel’s model. From our tests the method is resulted very
effective and very cheap from a computational point of view.
Furthermore the paper show the application of the method to a wide
variety of context. So it’s an attractive tool to improve other foreground
detection method with problems in dealing with shadows.
3.2.3 Plan View Projection Module
The motivation behind using plan-view statistics for person tracking begins
with the observation that, in most situations, people usually do not have
significant portions of their bodies above or below those of other people. We
might therefore expect to separate people more easily, and to reduce occlusion
problems, by mounting our cameras overhead and pointing them toward the
ground. However, methods based on monocular video that exploit this idea
usually either must continue to deal with significant occlusion problems in
all but the central portion of the image (particularly if wide-angle lenses
are used), or must accept a somewhat limited field of view (particularly if
the ceiling is relatively low). Furthermore, when mounted overhead, the
cameras used for tracking are not suitable for extracting images of people’s
faces, which are desired in many applications that employ vision-based person
tracking.
3.2. TYPICAL STRUCTURE OF A STEREO PLT SYSTEM 35
With a stereo camera, we can produce orthographically projected, over-
head views of the scene that better separate people than the perspective
images produced by a monocular camera. In addition, we can produce these
images even when the stereo camera is not mounted overhead, but instead at
an oblique angle that maximizes viewing volume and preserves our ability to
see faces. All of this is possible because the depth data produced by a stereo
camera allows for the partial 3D reconstruction of the scene, from which
new images of scene statistics, using arbitrary viewing angles and camera
projection models, can be computed.
Every reliable measurement in a depth image can be back-projected, using
camera calibration information and a perspective projection model, to the
3D scene point responsible for it. By back-projecting all of the depth image
pixels, we create a 3D point cloud representing the portion of the scene
visible to the stereo camera. If we know the direction of the “vertical” axis
of the world - that is, the axis normal to the ground level plane in which we
expect people to be well-separated - we can discretize space into a regular
grid of vertically oriented bins, and then compute statistics of the 3D point
cloud within each bin. A plan-view image contains one pixel for each of
these vertical bins, with the value at the pixel being some statistic of the
3D points within the corresponding bin. This procedure effectively builds an
orthographically projected, overhead view of some property of the 3D scene.
Fig. 3.5 illustrates this idea.
All the methods that use plan view projection chose to image the same
statistic of the 3D points within the vertically oriented bins, namely the
count of points in each bin. In the resulting images, referred to as plan-
view “occupancy” or “density” maps, people appear as “piles of pixels” that
can be tracked as they move around the ground. Although powerful, this
representation discards virtually all object shape information in the vertical
dimension. In addition, the occupancy map representation of a person will
show a sharp decrease in saliency when the person is partially occluded by
another person or object, as far fewer 3D points corresponding to the person
will be visible to the camera.
To address these shortcomings, we image a second planview statistic,
36 CHAPTER 3. A SURVEY ON THE STATE OF THE ART
Fig. 3.5: Concepts important to building plan-viewmaps.
namely the height above the ground-level plane of the highest point within
each vertical bin. This image, which we refer to as a “plan-view height map”,
is effectively a simple orthographic rendering of the shape of the 3D point
cloud when viewed from overhead.
All the papers using such an approach follow a similar way in the con-
struction of the maps. We’ll analyze the approach used in [10].
As we saw in Section 2.2.3 it’s possible, using matrix Q of backprojection,
to obtain from the coordinates (xscreen, yscreen, disparity) a triple of coordi-
nate (Xcam, Ycam, Zcam). We will see in the architecture chapter that during
the installation phase we need to do External Calibration which gives us a
rotation matrix Rworld and a traslation vector Tworld which allow us to obtain
(always refferring to Fig. 3.5):[XW YW ZW
]T= R−1world
( [Xcam Ycam Zcam
]T − ~T Tworld
)(3.1)
Before building plan-view maps from the 3D point cloud, we must choose
a resolution δground with which to quantize 3D space into vertical bins. We
3.2. TYPICAL STRUCTURE OF A STEREO PLT SYSTEM 37
would like this resolution to be small enough to represent the shapes of people
in detail, but we must also consider the limitations imposed by the noise
and resolution properties of our depth measurement system. In practice, we
typically divide the (XW , YW ) plane into a square grid with resolution δground
of 2-4cm.
After choosing the bounds (Xmin;Xmax;Ymin;Ymax) of the ground level
area within which we will restrict our attention, we can map 3D point cloud
coordinates to their corresponding plan-view image pixel locations as follows:
xplan = b(XW −Xmin)/δground + 0.5c
yplan = b(YW − Ymin)/δground + 0.5c
Plan-view height and occupancy maps, denoted as H and O respectively,
can be computed in a single pass through the foreground data. To do so,
we first set all pixels in both maps to zero. Then, for each pixel classified
as foreground, we compute its plan-view image location (xplan, yplan), ZW -
coordinate, and Zcam-coordinate. If the ZW -coordinate is greater than the
current height map value H(xplan, yplan), and if it does not exceed Hmax,
where Hmax is an estimate of how high a very tall person could reach with his
hands if he stood on his toes, we setH(xplan, yplan) = ZW . We next increment
the occupancy map value O(xplan, yplan) by Z2cam
fufv, which is an estimate of the
real area subtended by the foreground image pixel at distance Zcam from
the camera. The plan-view occupancy map will therefore represent the total
physical surface area of foreground visible to the camera within each vertical
bin of the world space.
3.2.4 Tracker Module
The vast majority of PLT systems based on plan view projection use Kalman
filter [13] during tracking. The basic idea behind the Kalman filter is that,
under a strong but reasonable set of assumptions, it will be possible, given
a history of measurements of a system, to build a model for the state of
the system that maximizes the a posteriori probability of those previous
measurements. In addition, we can maximize the a posteriori probability
38 CHAPTER 3. A SURVEY ON THE STATE OF THE ART
without keeping a long history of the previous measurements themselves.
Instead, we iteratively update our model of a system’s state and keep only
that model for the next iteration. This greatly simplifies the computational
implications of this method.
Taking cue from [10] we define the kalman state of a tracked subject as
a three-tuple 〈~x,~v, S〉 where ~x is the subject’s position, ~v is the subject’s
vectorial velocity on the discretized plan and S represents the body config-
uration of the subject. It’s very interesting the analysis of how in different
PLT systems S is modeled.
In [25] S is made up by three templates:
• the Htemplate is extracted by the plan view height map H and it’s cen-
tered in ~x. So it represents as much as possible the subject’s morphol-
ogy;
• the Otemplate is extracted by the plan view occupancy map O, it has
the same size of Htemplate and the same center. It represents the “entity
of the presence” of the subject;
• the Ctemplate, the so called color template, is obtained from the ob-
ject’s pixel in the foreground.
In [10] the author use only the height and the occupancy templates. The
operations of tracking module should be divided in three phases:
1. localization. In this phase the system search for all the candidate
templates in the O and H maps obtained by the current frame;
2. correspondence. In this phase is derived the distance between the
objects detected in the previous phase and the tracked object stored
in the db. in [10] and [25] this distance is a weighted sum of these
elements:
• Sum of absolute differences (SAD) of detected height and occu-
pancy template with respect to those stored in kalman state;
• Difference between the position produced by the prediction phase
of the kalman filter and the position contained in the kalman state;
3.3. FACE RECOGNITION 39
• The inverse distance of the candidate object from already associ-
ated objects. The probability of correct associations decrease if
there are other objects in the neighborhood;
• Only in [25] a measure of the difference of a detected color template
with the color tempalte stored in the kalman state.
If this shortest distance is under a predefined threshold then the db is
updated with the “winners’s” data.
3. eventual refinements. After correspondence phase it’s possible to have
two doubt situations:
• a candidate detected during localization phase has not been asso-
ciated with any tracked object;
• a tracked object has not been associated with any candidate ob-
ject.
In these situations the system has to take decisions; to this aim is
useful to define a series of states for the objects stored in the db. An
example of states set is {newobject, tracked,merged, lost, stale}. To
guide state transitions it’s necessary to use some kind of heuristic. This
heuristic may be for example a Bayesian network (like in [25]) or may
take a simpler form; for example if an object during the last frame was
detected near a door in the room and we haven’t found any candidate
associable the persone is likely exited from the room.
3.3 Face Recognition
Over the last ten years or so, face recognition has become a popular area of
research in computer vision and one of the most successful applications of
image analysis and understanding. Because of the nature of the problem, not
only computer science researchers are interested in it, but neuroscientists and
psychologists also. It is the general opinion that advances in computer vision
research will provide useful insights to neuroscientists and psychologists into
how human brain works, and vice versa.
40 CHAPTER 3. A SURVEY ON THE STATE OF THE ART
A general statement of the face recognition problem (in computer vision)
can be formulated as follows: Given still or video images of a scene, identify
or verify one or more persons in the scene using a stored database of faces.
Research directions (according to Face Recognition Vendor Test - FRVT
2002):
• recognition from outdoor facial images;
• recognition from non-frontal facial images;
• recognition at low false accept/alarm rates;
• understanding why males are easier to recognize than females;
• greater understanding of the effects of demographic factors on perfor-
mance;
• development of better statistical methods for understanding perfor-
mance;
• develop improved models for predicting identification performance on
very large galleries;
• effect of algorithm and system training on covariate performance;
• integration of morphable models into face recognition performance;
The literature in this area of research is really wide; a good reference to
start is [31] which analyze not only the face recognition problem but even
the related problem of face detection. Face recognition and face detection
problems fall in the area known as machine learning.
The goal of machine learning (ML) is to turn data into information. After
learning from a collection of data, we want a machine to be able to answer
questions about the data: What other data is most similar to this data? Is
there a face in the image?
Machine learning works on data such as temperature values, stock prices,
color intensities, and so on. The data is often preprocessed into features.
We might, for example, take a database of 10000 face images, run an edge
3.3. FACE RECOGNITION 41
detector on the faces, and then collect features such as edge direction, edge
strength, and offset from face center for each face. We might obtain 500 such
values per face or a feature vector of 500 entries. We could then use machine
learning techniques to construct some kind of model from this collected data.
If we only want to see how faces fall into different groups (wide, narrow, etc.),
then a clustering algorithm would be the appropriate choice. If we want to
learn to predict the age of a person from (say) the pattern of edges detected on
his or her face, then a classifier algorithm would be appropriate. To meet
our goals, machine learning algorithms analyze our collected features and
adjust weights, thresholds, and other parameters to maximize performance
according to those goals. This process of parameter adjustment to meet a
goal is what we mean by the term learning.
Now we want to deepen the difference between clustering and classifier
algorithm. Data sometimes has no labels; we might just want to see what
kinds of groups the faces settle into based on edge information. Sometimes
the data has labels, such as age. What this means is that machine learning
data may be supervised (i.e., may utilize a teaching “signal” or “label” that
goes with the data feature vectors). If the data vectors are unlabeled then
the machine learning is unsupervised.
Supervised learning can be categorical, such as learning to associate a
name to a face, or the data can have numeric or ordered labels, such as
age. When the data has names (categories) as labels, we say we are doing
classification. When the data is numeric, we say we are doing regression:
trying to fit a numeric output given some categorical or numeric input data.
In contrast, often we don’t have labels for our data and are interested in
seeing whether the data falls naturally into groups. The algorithms for such
unsupervised learning are called clustering algorithms. In this situation, the
goal is to group unlabeled data vectors that are “close” (in some predeter-
mined or possibly even some learned sense). We might just want to see how
faces are distributed: Do they form clumps of thin, wide, long, or short faces?
If we’re looking at cancer data, do some cancers cluster into groups having
diff erent chemical signals? Unsupervised clustered data is also often used to
form a feature vector for a higher-level supervised classifier. We might first
42 CHAPTER 3. A SURVEY ON THE STATE OF THE ART
cluster faces into face types (wide, narrow, long, short) and then use that as
an input, perhaps with other data such as average vocal frequency, to predict
the gender of a person.
3.3.1 Face Detection
The classifier used in PLaTHEA for face detection is the Haar classifier
that falls in the category of boosted rejection cascade. OpenCV library
implements a version of the Haar classifier technique for face detection first
developed by Paul Viola and Michael Jones and commonly known as the
Viola-Jones detector [30].
This face detector is a supervised classifier. We typically present image
patches (equalized in size and histogram) to the classifier, which are then
labeled as containing (or not containing) the object of interest, which for this
classifier is most commonly a face. The Viola-Jones detector uses a rejection
cascade of nodes, where each node is a multitree classifier designed to have
high (say, 99.9%) detection rate (low false negatives, or missed faces) at the
cost of a low (near 50%) rejection rate (high false positives, or “nonfaces”
wrongly classified). For each node, a “not in class” result at any stage of the
cascade terminates the computation, and the algorithm then declares that
no face exists at that location. Thus, true class detection is declared only if
the computation makes it through the entire cascade. For instances where
the true class is rare (e.g., a face in a picture), rejection cascades can greatly
reduce total computation because most of the regions being searched for a
face terminate quickly in a nonclass decision (see Fig. 3.6).
For the Viola-Jones rejection cascade, the weak classifiers that it boosts in
each node are decision trees that often are only one level deep (i.e., “decision
stumps”). A decision stump is allowed just one decision of the following form:
“Is the value v of a particular feature f above or below some threshold t”;
then, for example, a “yes” indicates face and a “no” indicates no face.
The Haar-like features used by the classifier are shown in Fig. 3.7. At all
scales, these features form the“raw material” that will be used by the boosted
classifiers. They are rapidly computed from the integral image representing
3.3. FACE RECOGNITION 43
Fig. 3.6: Rejection cascade used in the Viola-Jones classifier: each noderepresents a multitree boosted classifier ensemble tuned to rarely miss a trueface while rejecting a possibly small fraction of nonfaces; however, almost allnonfaces have been rejected by the last node, leaving only true faces.
the original grayscale image; given a grayscale image G the integral image I
is given by:
I(X, Y ) =∑x≤X
∑y≤Y
G(x, y) (3.2)
3.3.2 Face Recognition
Many approaches exist for face recognition4. In this section we’ll give a brief
overview of some face recognition algorithm.
Many methods of face recognition have been proposed during the past 30
years. Face recognition is such a challenging yet interesting problem that it
has attracted researchers who have different backgrounds: psychology, pat-
tern recognition, neural networks, computer vision, and computer graphics.
It is due to this fact that the literature on face recognition is vast and di-
verse. Often, a single system involves techniques motivated by different prin-
4The face recognition field is wide. A very complete guide to face recognition is athttp://www.face-rec.org
44 CHAPTER 3. A SURVEY ON THE STATE OF THE ART
Fig. 3.7: Haar-like features (the rectangular and rotated regions are easilycalculated from the integral image): in this diagrammatic representation ofthe wavelets, the light region is interpreted as “add that area” and the darkregion as “subtract that area”.
ciples. The usage of a mixture of techniques makes it difficult to classify these
systems based purely on what types of techniques they use for feature rep-
resentation or classification. To have a clear and high-level categorization,
we instead follow a guideline suggested by the psychological study of how
humans use holistic and local features. Specifically, we have the following
categorization:
• Holistic matching methods. These methods use the whole face
region as the raw input to a recognition system. One of the most
widely used representations of the face region is eigenfaces, which are
based on principal component analysis (in this category there is a real
“best-seller” as [28]);
• Feature-based (structural) matching methods. Typically, in these
methods, local features such as the eyes, nose, and mouth are first
extracted and their locations and local statistics (geometric and/or
appearance) are fed into a structural classifier. One example of this
category is Hidden Markov Model - HMM [20];
• Hybrid methods. Just as the human perception system uses both
3.4. PROJECTS AROUND THE WORLD 45
local features and the whole face region to recognize a face, a machine
recognition system should use both. One can argue that these methods
could potentially offer the best of the two types of methods.
One interesting feature-based matching method is based on Scale-Invariant
Feature Trasform - SIFT introduced by Lowe in [18] (the author has previ-
ously defined a method for features matching in [2]).
SIFT method is resulted very powerful with rigid object. The use of SIFT
in face recognition has been investigated in [12] and [19].
3.4 Projects around the world
3.4.1 LocON Project
LocON aims to integrate embedded location systems and embedded wireless
communication systems in a standardised way, developing a new platform
in order to control large scale infrastructures, like airports, more efficient,
secure, robust and flexible.
A set of PLT systems based on different technoligies have been developed
in the context of the european project. All this PLT systems make use of
markers for localization and tracking. The candidate localization systems in
the LocON project are the following:
• Global Positioning System - GPS. It provides reliable positioning, nav-
igation, and timing services to worldwide users on a continuous basis
in all weather, day and night, anywhere on or near the Earth;
• Radio Frequency IDentification - RFID. It’s the use of an object (typ-
ically referred to as an RFID tag) applied to or incorporated into a
product, animal, or person for the purpose of identification and track-
ing using radio waves. The range of the system depends on tag’s type;
passive tags don’t have power supply on board and the energy is given
by inductive coupling with RFID reader so they have limited range;
active tags have wider range. Recently some manufacturers have intro-
duced the RFID e-passport;
46 CHAPTER 3. A SURVEY ON THE STATE OF THE ART
• Ultra WideBand - UWB. It is a radio technology that can be used at
very low energy levels for short-range high-bandwidth communications
by using a large portion of the radio spectrum. UWB has traditional ap-
plications in non-cooperative radar imaging. Most recent applications
target sensor data collection, precision locating and tracking applica-
tions;
• Wi-Fi. Based on IEEE 802.11 family ad-hoc and infrastructured net-
works;
• Local Positioning Radar - LPR. A technology developed by Symeo
which uses radio signals which are not susceptible to harsh ambient con-
ditions. Symeo equipment can be deployed indoor and outdoor under
vibrations, extreme temperatures, dust and harsh weather conditions.
3.4.2 Gator Tech Smart House Project
The PLT system of this project combines two main component:
• In the localization area, the Gator Tech Smart House has embedded
sensors in the floor to determine user location. This solution is not
intrusive and guarantees the desired transparency of a pervasive com-
puting environment;
• The use of RFID technology in combination to sensors floor allow iden-
tity detection.
3.4.3 ARGOS project
ARGOS project (Automatic Remote Grand Canal Observation System) is
a video-surveillance system for boat traffic monitoring, measurement and
management along the Grand Canal of Venice. This new system will answer
to the specific requirements for the boat navigation rules in Venice while
providing a combined unified view of the whole Grand Canal waterway. Such
features far exceed the performance of any commercially available product.
3.4. PROJECTS AROUND THE WORLD 47
Therefore, a specific software has been developed, based on the integration
of advanced automated image analysis techniques.
Obviously ARGOS project is not a PLT system (we can define it a BOAT
Localization and Tracking system), but it’s a very interesting project because
its context leverages several problems: background modeling it’s a very diffi-
cult task due to water (that is a periodically moving background entity) and
due to the lenght of the controlled context (the Gran Canal in Venice). In
fact the ARGOS system is going to control a waterway of about 4 km length,
80 to 150 meters width, through 14 observation points (Survey Cells). The
system is based on the use of groups of IR/VIS cameras, installed just below
the roof of several buildings leaning over the Grand Canal. Each survey cell
is composed of 4 optical sensors: one center wide-angle (90 degree), orthog-
onal to the navigation axis, two side deep-field cameras (50-60 degree), and
a pan-tilt-zoom camera for high resolution acquisition of boat details (e.g.,
license plates).
The main ARGOS functions are:
1. optical detection and tracking of moving targets present in the field of
view (FOV);
2. computing position, speed and heading of any moving target within the
FOV of each camera;
3. elaboration at survey cell level of any event (target appears, exits, stops,
starts within the cells FOV) and transmission of any event to the Con-
trol Center;
4. connecting all the track segments related to the same target in the
different cameras FOV into a unique trajectory and track ID;
5. recording all the video frames together with the graphical information
related to track IDs and trajectories;
6. rectifying all the camera frames and stitching them into a composite
plain image so as to show a plan view of the whole Grand Canal;
48 CHAPTER 3. A SURVEY ON THE STATE OF THE ART
7. allowing the operator to graphically select any target detected by the
system and automatically activating the nearest PTZ camera to track
the selected target.
3.4.4 RoboCare Project
The goal of the RoboCare project is to build a multi-agent system which gen-
erates user services for human assistance. The system is to be implemented
on a distributed and heterogeneous platform, consisting of a hardware and
software prototype.
Some of the results and publications (especially [23] and [25]) that this
project has originated have had a deep influence in the design of PLaTHEA.
Chapter 4
Our System and Related Works
This chapter represents a first introduction to our system called PLaTHEA
(People Localization and Tracking for HomE Automation). Here we
emphasize the lessons learned in Chapter 3 focusing on how our approach is
to be located in the state-of-the art.
Contents4.1 Background Modeling and Foreground Segmen-
tation . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1.1 The Background Model . . . . . . . . . . . . . . . 51
4.1.2 Foreground Segmentation . . . . . . . . . . . . . . 53
4.1.3 Foreground Refinements . . . . . . . . . . . . . . . 55
4.2 Plan View Projection and Tracking . . . . . . . 55
4.2.1 Localization . . . . . . . . . . . . . . . . . . . . . . 55
4.2.2 Correspondence . . . . . . . . . . . . . . . . . . . . 56
4.2.3 Refinements . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Face Recognition . . . . . . . . . . . . . . . . . . . 59
4.3.1 Notes on Face Detection . . . . . . . . . . . . . . . 61
4.4 Tracking and Face Recognition Combined . . . . 63
49
50 CHAPTER 4. OUR SYSTEM AND RELATED WORKS
4.1 Background Modeling and Foreground Seg-
mentation
The choice of the background modeling technique (and hence of the fore-
ground segmentation approach too) is one of the most important during the
design of a PLT system. In the first part of the work we have done exper-
imental studies on a lot of background modeling algorithms including those
introduced in [23], [25], [11] and [14]; in the following of this chapter we will
discuss on how our approach uses these techniques to fit with our technolog-
ical environment.
The stereo input to the system is given by a couple of Axis 207 net cam-
eras. This kind of cameras allow to receive a video stream using MJPEG1;
light compensation of this camera model is very sensitive, so the first con-
straint for our foreground segmentation method is to be insensitive to varia-
tions of image pixel intensity due to light compensation; we have noted that
solving this problem is equivalent to use an algorithm insesitive to sudden
illumination changes.
Under this point of view our implementation of the approaches described
in [23], [25] and [14] doesn’t give us the wanted performances: Axis 207 light
compensation was too strong. The algorithm in [11] instead has showed very
good results not only with respect to camera’s features but also with respect
to hotspots and sudden illumination changes; however as already stated in
3.2.2 this algorithm is not time adaptive and this is an added constraint
for PLaTHEA. With respect to this second constraint the method based
on pixel’s activity defined in [23] impressed us immediately; the concept of
border activity is very natural and very effective as showed by our tests.
The result of this consideration has been an hybrid solution using:
• intensity invariant background modeling and foreground segmentation
defined in [11];
1Multipart JPEG is a video format used in streaming applications. The stream is asequence of JPEG images preceded by a header giving file size and other informations
4.1. BACKGROUNDMODELING AND FOREGROUND SEGMENTATION51
• time adaptivity approach introduced in [23].
However, the so structured algorithm didn’t satisfy us completely; as
stated in section 3.2.2 the solution in [11] is presented in two version: the
authors state that, the first one have problems with dark foreground elements
and the second one have problems with shadows; we have choosen the sec-
ond version because for a human we need to have more pixels detected as
foreground as possible (due to the production of plan view map). The use
of disparity described in [23] is very interesting but, during our experimental
studies, we noted that the flashing of the disparity map produced by OpenCV
library makes harder to apply this particular technique. The solution to this
issue come reading [5]; the use of HSV color model to detect shadow it’s
simple so the previous algorithm didn’t need to be modified a lot.
In this section we will describe our proposed solution for background
modeling and foreground segmentation.
4.1.1 The Background Model
So let’s start to resume the elements of our background model. We can divide
the background model in submodels:
• the Edge Intensity Model (as seen in [23] and [25]) stores the average
edge intensity and the absolute difference between the current edge
intensity and the average edge intensity. Given the vertical V and the
horizontal H border matrices computed using the Sobel filter on the
current left frame, the value of the current edge intensity matrix E for
pixel (X, Y ) is:
E(X, Y ) =√V 2(X, Y ) +H2(X, Y ) (4.1)
The average is, in fact, a running average whose sensibility is given by
the parameter β. So value of the average matrix at time t Etavg at the
location (X, Y ) is given by:
Etavg(X, Y ) = (1− β)Et−1
avg (X, Y ) + βE(X, Y ) (4.2)
52 CHAPTER 4. OUR SYSTEM AND RELATED WORKS
The default value for β is 0.08. Due to Axis 207 light compensation
system, the difference matrix Ediff obtained by the absolute difference
between E and Etavg is always rumourous, so we fix a minimum value
for difference minRumour; under this value for a specific pixel (X, Y )
we set Ediff = 0;
• the Activity Model (as seen in [23]) stores the average activity for
all the pixels. First we introduce vertical Avert and horizontal Ahorz
activities as follows:
Ahorz(Y ) =∑x
Ediff (x, Y ) Avert(X) =∑y
Ediff (X, y) (4.3)
The value at (X, Y ) of the average activity matrix at time t Atavg is
obtained as a running average with parameter λ:
Atavg(X, Y ) = (1− λ)At−1
avg (X, Y ) + λAhorz(Y )Avert(X) (4.4)
The default value for λ is 0.2;
• the Color Model is the most articulate submodel. For each of the
color channel of the left frame the model stores running average (with
learning parameter α), difference between this average and the current
frame and variance (whose learning factor is always α); for example for
the red channel the model stores the following matrices: CRavg, C
Rdiff
and CRvar.
In addition, to support [11] we need to store the Brightness and Color
distortion (matrices Cbd and Ccd respectively) for the current frame and
the reference Brightness and Color running average (matrices Cavgb and
Cavgc respectively)2.
It’s important to note that accordingly to [23] the value of α it’s calcu-
lated on a per pixel basis and it’s inversely proportional to the pixel’s
activity. So if a pixel presents a high activity value it’s color model
won’t be updated; inversely if pixel activity is low the color model will
2For the maths we suggest the reader to see the original paper.
4.1. BACKGROUNDMODELING AND FOREGROUND SEGMENTATION53
be updated with a learning factor dependent from the activity of the
pixel. So the learning factor αmod for the pixel (X, Y ) is obtained as
follow:
αmod(X, Y ) = α(1−At
avg(X, Y )
η) (4.5)
where η is a Activity Normalization Factor.
So the use of the shadow detection algorithm in [5] doesn’t imply an
overhead in the model. In fact during the foreground segmentation phase
the HSV model is derived directly from the RGB model.
4.1.2 Foreground Segmentation
Now, given the description of the background model we want to describe the
procedure to declare a pixel as a foreground pixel. We apply in cascade the
method described in [11] and then the method introduced by [5]:
1. given the algorithm parameters minCD, minBD, maxBD defined in
[11], first we define for a pixel (X, Y ) the following values:
brightnessRatio =(Cbd(X, Y )− 1
)/Cavgb(X, Y )
colorRatio = Ccd(X, Y )/Cavgc(X, Y )
and then (X, Y ) is a candidate foreground pixel if:
colorRatio > minCD or
(brightnessRatio > minBD and brightnessRatio < maxBD)
2. given the algorithm parameters minDarkening, maxDarkening, ts, th
defined in [5] the pixel (X, Y ) isn’t a foreground pixel if:
minDarkening < CV (X,Y )CV
avg< maxDarkening and
|CS(X, Y )− CSavg(X, Y )| < ts and |CH(X, Y )− CH
avg(X, Y )| < th
In Fig. 4.1 we have a screenshot of the foreground segmentation.
54 CHAPTER 4. OUR SYSTEM AND RELATED WORKS
Fig. 4.1: Background Modeling and Foreground Segmentation at work. Inthe“background”window we have the color model for the background. In the“connected components” window we have the currently detected foregroundafter the foreground refinements (see the next subsection). Finally in the“Plan View Occupancy Map” window we have the occupancy map for thetracked subject (see section 4.2).
4.2. PLAN VIEW PROJECTION AND TRACKING 55
4.1.3 Foreground Refinements
Some PLT systems use to clean the foreground matrix. It’s possible for
example to find blobs in the foreground matrix and eliminate those too much
small for being people. Our experience demonstrate that this operation is not
strictly necessary: another possible solution is to clean the foreground matrix
using only a Median Filter to eliminate the so called “salt and pepper” effect.
Because this choice can be a “matter of taste” we leave it to the installator
providing him the possibility of setting it (via administration GUI).
If the installator choose to use the foreground contour scanner, he has
to choose only one parameter, namely the Filter Perimeter Scale factor
ps. This scanner will delete from the foreground maps all the contours whose
perimeter Pcontour is under the selected fraction of image frame semi-perimeter
(the half od the perimeter) Pimage; so a contour is deleted if:
Pcontour <Pimage
ps(4.6)
4.2 Plan View Projection and Tracking
The Plan View projection phase in PLaTHEA is identical to that described
in 3.2.3. However the Tracking phase, though inspired by [10] and [25], it’s
innovative and result of an iterative process of refinement. The Kalman state
for each tracked person is the same used in [25].
So we’ll analyze in the following of this section the Tracking module used
in PLaTHEA using the same subdivision of section 3.2.4.
4.2.1 Localization
Instead of identifying blobs in the foreground matrix [25], our PLT system
identifies candidate humans directly in the plan view occupancy map O. To
solve for localization problem we have the following steps:
1. we use a contour scanner on the plan view occupancy map to retrieve
all the external contours and for each of these we calculate a bounding
box. The dimension of this bounding boxes is normalized to a common
56 CHAPTER 4. OUR SYSTEM AND RELATED WORKS
template size obtained by dividing two times the person torso average
weight for the texel side choosen during the plan view projection;
2. we use the bounding boxes detected at the previous step to do statistics
in the correspondent area of O and H. A bounding box contains a
candidate for tracking if the following constraints are respected:
• the integral over the area subtended by the bounding box on Ois greater than a certain threshold. In [10] we see a formula to
obtain this threshold in a deterministic way;
• the max height in the are subtended by the bounding box in H is
greater than a minimum height.
During the two steps just described are produced the templates for this
candidate, so that during the correspondence stage they are ready to be
consumed.
4.2.2 Correspondence
Now we have a set of tracked persons T = {t1, t2, ..., tn} and a set of candidate
objects C = {c1, c2, ..., cn}; we can think to the elements of these two sets as to
nodes in a weighted complete bipartite graph3; see Fig. 4.2 for details.
Defined for each edge a weight, it’s simple to find the best correspon-
dence for each element; the problem to solve is infact known as minimum
weighted bipartite matching problem (or more simply assignment
problem); this is a very well studied problem which has an efficient solution
in the Hungarian Algorithm4.
3In the mathematical field of graph theory, a bipartite graph (or bigraph) is a graphwhose vertices can be divided into two disjoint sets T and C such that every edge connectsa vertex in T to one in C; that is, T and C are independent sets. Such a graph is alsocomplete if every node in T is connected to every node in C. Finally the graph is weightedbecause to every edge we have a weight associated.
4It was published by Kuhn in [17], who gave the name “Hungarian method” becausethe algorithm was largely based on the earlier works of two Hungarian mathematicians.The time complexity of the original algorithm was O(n4), however Edmonds and Karpnoticed that it can be modified to achieve an O(n3) running time.
4.2. PLAN VIEW PROJECTION AND TRACKING 57
Fig. 4.2: An example of complete bipartite graph. A weight is associatedwith every node.
Now let’s discuss the distance measure. We initially thinked to a com-
posite measure including all the elements of the Kalman state of a tracked
subject; it turns out that this solution it’s difficult because to every distance
we have to associate a weight which is hard to obtain in a consistent way.
So the elaboration of the distance go ahead for steps; for each elements in Tand for each C:
1. in the first step we calculate the following distance measures:
• the difference between color templates in the way proposed in [25];
• the euclidean distance between the predicted position (via the
Kalman filter) of the tracked object and the position of the can-
didate;
• the ratio between the average height of the tracked object and the
average height of the candidate (we make it always greater than
1 to work as an incremental factor).
2. for each of these measures we have a different maximum value. Now
we have to possibility:
• if at least one of the measure exceeds the correspondent treshold
the the edge’s weight is set to the product of the three measures;
58 CHAPTER 4. OUR SYSTEM AND RELATED WORKS
• in the other cases the edge’s weight is set simply to the color
difference.
The ratio behind this technique is that if all the constraints are respected
there is high probability that the candidate object correspond to the tracked
object; obviously this euristic works better if persons wears clothes very dif-
ferent in colour.
After all weights are defined the weight matrix is given as input to the
hungarian algorithm which find the best matching. The result if the algo-
rithm is analyzed in the following way:
• if the algorithm has founded a correspondence and the weight associ-
ated to this corrispondence it’s under a predefined threshold then the
tracked object is updated using the candidate associated to it by the
hungarian algorithm;
• if the algorithm has founded a correspondence but the weight is too
much high then the tracked object is updated with the position pre-
dicted by the Kalman filter. All the templates are maintained invaried;
• if a tracked object has no correspondence (this should happen if |T | is
greater than |C|) we follow the same behaviour as in the previous case;
• if a candidate object has no correspondence (this should happen if |C|is greater than |T |) we add a new tracked object using as templates
those of the candidate.
4.2.3 Refinements
In PLaTHEA a object in T should be in only one of the following states:
• NEWOBJECT . It’s the state assigned to a new entry in the tracked
object’s database. In this state an object is not really tracked and
PLaTHEA doesn’t provide to the client any update about it;
• TRACKED. A NEWOBJECT enter in this state if it’s succesfully
tracked for more than 5 times consecutively. Of this object we know
4.3. FACE RECOGNITION 59
a certain position assured by the correspondence founded by the Hun-
garian algorithm;
• LOST . An object enter in this state if for a frame there is no corre-
spondence founded by the Hungarian algorithm;
• STALE. An object enter in this state if it’s a NEWOBJECT and
before he become a TRACKED we find for it no correspondence or
an object is LOST for more than 100 frames. An object in this state
has to be deleted from the database.
In Fig. 4.3 we resume these transition’s policies.
Fig. 4.3: The state transition diagram for PLaTHEA tracked objects. Notethat this transition diagram treat a tracked object as an anonymous entity.We will review this diagram later using identity information.
4.3 Face Recognition
The face recognition method used in PLaTHEA can be classified as feature-
based. It use SIFT features to create a database where for each registered
person the system stores a set of images. When the system starts it prepares
itself to recognition with the following sequence of steps:
1. it loads for each person the correspondent set of images;
60 CHAPTER 4. OUR SYSTEM AND RELATED WORKS
2. for each image it computes the SIFT features;
3. for e single image it sort the SIFT features in a kd-tree [2]; this data
structure allow for a fast similarity computation during the recognition
phase.
For an example of SIFT database for a single person see in Fig. 4.4.
Fig. 4.4: The SIFT database for a single person. In each image the SIFTfeatures are highlighted. In the deployment system for each person we haveat least ten images for person.
The recognition imply the following steps:
1. SIFT features are extracted from the test face;
2. we try to assign a similarity score to each person in the database with
respect to the test face; this score is the sum of the score assigned to
each face related to the specific person. For each feature in the test
face’s detected features is finded the nearest feature in a specific face is
finded using the BBF - Best Bin First search on the face’s kd-tree,
computed during face database training. A database’s face score is
4.3. FACE RECOGNITION 61
given by counting all the features correspondent to test face’s features
which respect the following constraints:
• the distance from the correspondent feature is under a selected
treshold;
• the ratio between this distance and the distance between the test
feature and the second best matching is under a selected threshold.
3. if the higher person’s score it’s over a threshold then the algorithm
assign the face to that person.
SIFT do its best if used with rigid objects. From this consideration we
have that database training plays a prominent role in PLaTHEA’s recog-
nition performances. We want that a single person stores in the database
the more variegate possible expressions. This will help a SIFT based face
recognition algorithm to give high scores to the right person. See Fig. 4.5
for an example of execution.
It’s important to note in Fig. 4.4 and 4.5 the effects of a not diffuse
illumination in the room. In fact in Fig. 4.4 it’s possible to note that the
number of features detected on the left side of the face (the most illuminated)
it’s noticeably greater than that detected on the right side. Even in Fig. 4.5
it’s simpler for the algorithm to match features on the best illuminated side
of the face. This is a remarkable issue for PLaTHEA’s installer.
4.3.1 Notes on Face Detection
Before ending the section we want to talk about the importance of face
detection in PLaTHEA. Obviously the system has to support the presence
of multiple people in a room. The left camera in the stereo rig doesn’t supply
faces’ close-ups; we have an image of the whole room. From this room we
have to extract close-ups of the faces. To this aim it’s useful the face detector
presented in section 3.3.1. Now our experience proof that the face detection
is the computationally more expensive operation (we will in chapter 6 that
this force to execute face detection and face recognition in a parallel thread
62 CHAPTER 4. OUR SYSTEM AND RELATED WORKS
Fig. 4.5: Our face recognition algorithm at work. In each sample the upperimage is the test image. The samples show the matched features. Note thatthe algorithm does very few errors . . . and he is my brother.
4.4. TRACKING AND FACE RECOGNITION COMBINED 63
of execution with respect to the rest of the elaboration); the only way to
speed-up the process it to define a minimum face size.
The reader should think that this create problem to the face recognition
phase, because the smaller faces aren’t detected so the system will not even
try to recognize the face. This is not the truth. If a face is to small infact
the face recognition system is not able to find correspondence because on the
test face it’s not possible to detect enough features.
Now the faces stored in the database have a size of 150x150 pixel. The
face detection system is set to find faces with a minimum size of 75x75 pixels.
Before recognition if the detected face is bigger than the database face’s the
system resize it using cubic interpolation; inversely if the face is smaller than
face database size the system doesn’t try to zoom it because it should cause
information loss.
Fig. 4.6: Viola-Jones detector at work. Two faces detected
The reference Haar fetures for the Haar detector employed in PLaTHEA
are provided by OpenCV library. In particular we use the training set for
frontal face. It’s possible to use multiple training set (OpenCV provided
training set also for profiles) but our experience has suggested to use only
trhe training set for frontal faces. This imply that is unuseful to store in the
database person’s profiles.
4.4 Tracking and Face Recognition Combined
Until now we haven’t faced the problem of how to combine identity informa-
tion provided by face recognition and tracking information provided by the
64 CHAPTER 4. OUR SYSTEM AND RELATED WORKS
tracking module.
If a face is recognized at a time step, it’s center of mass is reprojected on
the plan used in tracking in the same way as it was a foreground pixel. Then
we find the tracked object present in that position and assign to it the identity
(in fact before to assign an identity to a tracked object the same identity has
to be recognized for three consecutive times). Unfortunately we had to face a
problem; the most of the times the stereo correspondence algorithm doesn’t
provide the disparity value for the face pixels5; so we move the aforementioned
face’s center of mass to the chest (we use the face dimension to do this) and
we reproject this point to the floor.
Now the reader should have a doubt: Why we don’t track directly faces
instead of pixels detected via foreground segmentation? We have some answer
to this question:
1. users don’t look always at the camera;
2. the face detector is not perfect at all. It sometimes successfully find a
face and sometimes, due to adverse light conditions or to obstructions,
it’s not able to di this;
3. like face detector, face recognition system is not perfect. Sometimes a
face doesn’t have a person matching with enough score to be sure of
the identity.
So in our vision the combination of tracking and face recognition gives
better results with respect to the simple face tracking.
5This is due to the speed optimization techniques used in the OpenCV SAD basedstereo correspondence algorithm.
Chapter 5
System Requirements andArchitecture
Contents5.1 Overview on System Requirements . . . . . . . . 66
5.2 A Look at the Architecture . . . . . . . . . . . . 67
5.2.1 Embedding PLaTHEA . . . . . . . . . . . . . . . 67
5.2.2 The Components’ Architecture . . . . . . . . . . . 68
5.2.3 The Software Dependencies . . . . . . . . . . . . . 68
5.3 The Storage . . . . . . . . . . . . . . . . . . . . . . 70
5.3.1 The Camera Calibration Database . . . . . . . . . 71
5.3.2 The Face Database . . . . . . . . . . . . . . . . . . 72
5.4 The Elaboration Core . . . . . . . . . . . . . . . . 72
5.5 The UPnP Device . . . . . . . . . . . . . . . . . . 73
5.6 The External Entities . . . . . . . . . . . . . . . . 75
5.7 Use Cases . . . . . . . . . . . . . . . . . . . . . . . 76
5.7.1 Installation and Configuration . . . . . . . . . . . 76
5.7.2 Run Time Installation Refinements . . . . . . . . . 81
5.7.3 The Face Database Construction . . . . . . . . . . 82
5.7.4 Run Time Use Cases . . . . . . . . . . . . . . . . . 82
65
66 CHAPTER 5. SYSTEM REQUIREMENTS AND ARCHITECTURE
5.1 Overview on System Requirements
The initial chapters have given you the theoretical basis to PLT systems
in general and to PLaTHEA in particular. With this section we start to
describe the system from a pratical point of view and the best way to do
this is to describe what are the system requirements; some of this are al-
ready emerged during the first chapters and we will face them with a more
systematic approach.
In first place the system has to be more transparent as possible to the
user. We have already stated that a markers based PLT system produce a
psychological effect which induce in the user rigidity and lack of naturalness.
So the use of cameras instead of markers give us a first kind of transparency.
In a second sense we intend for transparency the fact that the users don’t
have to follow a particular behaviour to let the system work (for example to
assume particular poses or to pronounce some magic words). In our vision
the only interaction the user has to have with PLaTHEA is during the
training phase; that is the user has only to do it’s photographic book for face
recognition (of course we are talking of the interaction with PLaTHEA; if
we refer to SM4All system as a whole, the users have to define the so called
“scenes”).
In second place we wanted PLaTHEA to be easily integrable in a
home automation system as well as in the home building design. The first
requirements required that the interaction with the system was loosely cou-
pled and a perfect tool to do this was the implementation of a service based
architecture; the client of the system (the pervasive layer in SM4All’s slang)
interact with the system (in a synchronous or asynchronous fashion; we will
return on this aspect later) via services (the infrastructure to do this is offered
by UPnP standard). The second requirements refused the use of special kind
of data buses: video frames, as well as services’ request and reply, travel on
a simple Ethernet bus (in our vision of futuristic home, the whole home is
wired using Ethernet).
In third place we want our system to be more cheap (from an economical
point of view) as possible. In our vision each room in the home should be
5.2. A LOOK AT THE ARCHITECTURE 67
equipped with a couple of off-the-shelf cameras1 (the 207 is the entry level
model of the network cameras’ family produced by Axis) and with a computer
(the system is for now deployed on a notebook, but we hope to deploy it
on a more simple machine (possibly equipped with an embedded operating
system).
Also, we need a robust system. PLaTHEA has to be started and from
that moment never stopped (of course, this assumption may seem a little
strong). So during the implementation a lot of attention has been dedicated
to errors handle, memory management and so on.
Finally, last but not least, we’ve give particular attention to the deploy-
ment phase. The administration interface allow to easily make the system
work in a short time; it provides an easy interface for all kind of calibration
end for the creation of face database.
5.2 A Look at the Architecture
In this section we want to start describing the deployment of a set of PLaTHEA
installations in a typical home and continue describing the components’ ar-
chitecture of a single instance.
5.2.1 Embedding PLaTHEA
In Fig. 5.1 we explain our vision of how the system has to be deployed in a
home.
In our vision we have a router which represents the access point to the
internet. This router offer Wi-Fi networks as well as a Ethernet network.
This router is connected via Ethernet to a set of switchs (one for each room
in the home). Every room has installed a computer; this computer runs all
the services for its particular room including PLaTHEA. The cameras are
connected to this switch. This is for us a good solution because it isolate the
traffic of the cameras’ frames (wi will see in the test chapter that this traffic
1Note that there is trade off between costs and performances. For example a good facerecognition system requires high resolution cameras if possible.
68 CHAPTER 5. SYSTEM REQUIREMENTS AND ARCHITECTURE
Fig. 5.1: PLaTHEA embedded in a home. For each room we have thebasic elements of the system.
may reach the 10% of a LAN/100) and also offers the possibility to install
on the computer other services.
5.2.2 The Components’ Architecture
Now that we have an idea of how the system is deployed, we want to analyze
the structure of a single instance. The principale components of PLaTHEA
architecture are figured in Fig. 5.2. In the rest of the chapter we will describe
into details the various component of this architecture.
We’ll proceed from the lower layer up to the “presentation layer” (that
is the UPnP Device) and then we’ll talk about the elements external to the
system which interact with it.
5.2.3 The Software Dependencies
To operate PLaTHEA needs the presence of the set of libraries indicated in
Fig. 5.3.
5.2. A LOOK AT THE ARCHITECTURE 69
Fig. 5.2: The components in Plathea and their responsabilites and depen-dencies.
Fig. 5.3: The libraries dependencies in PLaTHEA.
70 CHAPTER 5. SYSTEM REQUIREMENTS AND ARCHITECTURE
We give a brief introduction to all these libraries:
• OpenCV is a computer vision library originally developed by Intel. It
is free for use under the open source BSD license. The library is cross-
platform. It focuses mainly on real-time image processing, as such,
if it finds Intel’s Integrated Performance Primitives on the system, it
will use these commercial optimized routines to accelerate itself; the
download of OpenCV library it’s for free at http://sourceforge.net/
projects/opencvlibrary/;
• LibJPEG 7 is a C library used for decompression and compression of
JPEG images; the library has been recently updated to support C++
and is available for free at http://www.ijg.org/;
• CyberLink UPnP is a C++ library which follows the version 1.0
of the UPnP standard; it’s developed by Satoshi Konno for a wide
variety of platforms; the C++ version is available at http://clinkcc.
sourceforge.net/;
• Xerces is a XML library used by CyberLink UPnP (UPnP is inherently
based on SOAP and then on XML); it’s available for C++ and Java;
for the C++ version see at http://xerces.apache.org/xerces-c/.
Now, LibJPEG and CyberLink UPnP are linked in the application as
static libraries, so to simply execute the system they don’t have to be installed
(of course they are needed to recompile the code). However OpenCV and
Xerces are linked as dynamic libraries so they have to be installed on the
disk and their bin directories have to be in the path to execute the software.
The application it’s developed using Microsoft Visual Studio 2008 but
we have avoided using Microsoft extensions to C++ language so it’s easy to
compile the code under another compiler.
5.3 The Storage
The system had to manage to permanent repositories of data: the camera
calibration and the face databases. These repositories aren’t real databases,
5.3. THE STORAGE 71
but rather collections of files necessary to the system during his run.
5.3.1 The Camera Calibration Database
The camera calibration database contains all the matrices and vectors intro-
duced in chapter 2 and other files.
We start with the files produced by Stereo Cameras Calibration; with
this the term we indicate the Camera Calibration with the Stereo Calibration;
this is due to the fact that during the installation phase PLaTHEA execute
the two operations simultaneously. The results of this operation are:
• the intrinsics matrices for the left and the right camera denoted re-
spectively with Mleft and Mright and stored in “LeftIntrinsics.xml” and
“RightIntrinsics.xml”;
• the distortion vectors for the left and the right camera denoted re-
spectively with Dleft and Dright and stored in “LeftDistortion.xml” and
“RightDistortion.xml”;
• the rotation matrix R stored in“Rotation.xml”and the traslation vector
T stored in “Traslation.xml”;
• the essential matrix E stored in “Essential.xml” and the fundamental
matrix F stored in “Fundamental.xml”;
• the reprojection matrix Q stored in “3DReprojection.xml”;
• the pixel remapping matrices for undistortion and rectification for both
cameras. In particular for the left camera we have the files“mx LEFT.xml”
and “my LEFT.xml”; for the right camera we have “mx RIGHT.xml”
and my RIGHT.xml”.
Now we analyze the files produced by the so called external calibration.
We will see later how this kind of calibration is done; for now we only say
that this produce the way to trasform 3D camera coordinates in 3D room
coordinates; this operation is necessary as we’ve already seen to create the
plan view maps. The external calibration produces as output a traslation
72 CHAPTER 5. SYSTEM REQUIREMENTS AND ARCHITECTURE
vector Tworld (stored in “External Traslation.xml”) and the rotation matrix
Rworld (stored in “External Rotation.xml”.
The last source for camera calibration dbase are the room settings stored
in a file with extension “.rsf”. This file contains the environment data and
person’s common data:
• the minimum and maximum values for the XW , YW and ZW coordi-
nates; that is the room size;
• the texel side for the plan view projection;
• the persons’ minimum, maximum and average height and the persons’
average width;
Before starting the elaboration the system has to load all this files to do
computation as we have seen in the previous chapters.
5.3.2 The Face Database
The face database it’s not so complex. It contains a resume file with extension
“.dof”containing informations about persons identification number and name
with associated the number of images stored in the database; this images have
a name composed by person name and an incremental index.
5.4 The Elaboration Core
The Elaboration Core is the most important component of the system. This
is the component that does the real work:
• it acquires the MJPEG stream from the cameras, synchronize them
(more on this later) and decompress them in bitmap images (OpenCV
uses a format known as IplImage);
• it models the background and to foreground segmentation;
• it updates the plan view maps and tracks on them the objects;
5.5. THE UPNP DEVICE 73
• it does face recognition and combine these informations with the track-
ing informations.
Fig. 5.4: The principale software modules in the Elaboration Core.
5.5 The UPnP Device
The UPnP Device represents the Presentation Layer of the system. It allows
the clients to interact with the system in synchronous and asynchronous
fashion.
Universal Plug and Play (UPnP) [29] is a set of networking protocols pro-
mulgated by the UPnP Forum. The goals of UPnP are to allow devices to con-
nect seamlessly and to simplify the implementation of networks in the home
(data sharing, communications, and entertainment) and in corporate envi-
ronments for simplified installation of computer components. UPnP achieves
this by defining and publishing UPnP device control protocols (DCP) built
upon open, Internet-based communication standards.
The term UPnP is derived from plug-and-play, a technology for dynami-
cally attaching devices directly to a computer, although UPnP is not directly
74 CHAPTER 5. SYSTEM REQUIREMENTS AND ARCHITECTURE
related to the earlier plug-and-play technology. UPnP devices are “plug-and-
play” in that when connected to a network they automatically announce their
network address and supported device and services types, enabling clients
that recognize those types to immediately begin using the device.
Fig. 5.5: The synchronous and asynchronouse UPnP interfaces used inPLaTHEA.
A UPnP control point (that is a UPnP client) can interact with an UPnP
device in two ways mainly:
• it can call synchronous methods with blocking calls;
• it can subscribe to services; a services contains a series of variable that
communicate their value change to the service subscriber; this is the
UPnP asinchronous interaction.
In Fig. 5.5 we have inserted the subscription methods in the set of the
synchronous methods. Now for each method and for each evented variable
we describe the use.
• the GetListIDRegistered method returns an XML string with a set
of couples < id, name >, one for each registered user;
• the GetRoomInfo method returns and XML string containing infor-
mations about the controlled room;
• the GetPositionFromPersonID method takes as input a person id
and returns the position of the correspondent user if he’s present in the
room;
5.6. THE EXTERNAL ENTITIES 75
• the GetPositionFromObjectID method takes as input a tracked ob-
ject id (this IDs differently from person IDs are temporary) and returns
the position of this object;
• the GetAllPositions method returns an XML string with the position
(and the eventual identity) of all the tracked objects;
• the notifyNewObject state variable change its value if the system has
detected new object/s; the variable is set to an XML string containing
tracked object’s informations; to receive this update a client has to
subscribe to the mainService;
• the notifyNewRecognizedObject state variable change its value if
a previously tracked object has been recognized; to receive this update
a client has to subscribe to the mainService;
• the notifyAllFrames state variable change periodically with a period
depending on which periodical service the client is subscribed to.
5.6 The External Entities
Now, we have to analyze the remaining components in the components’ ar-
chitecture depicted in Fig. 5.2. An important feature of a home automation
system is the identity database. When a new face database entry is reg-
istered, PLaTHEA has to advise of this event this system; this is done
because a client obtain via UPnP from PLaTHEA an id valid only for our
system. The identity database allow the client to complete the information
about this ID.
In synthesis, the identity database gives the client a correspondence be-
tween the ID registered in PLaTHEA and the real identities certificated by
the home automation system.
76 CHAPTER 5. SYSTEM REQUIREMENTS AND ARCHITECTURE
5.7 Use Cases
In this section we want to take a look to PLaTHEA at work. The work
of the system can be subdivided in two main phase: the Installation and
Configuration Phase and the Elaboration Phase. It’s remarkable that
a part of the system behaviour (the parameters introduced in Chapter 4) are
configurable at “Run Time” allowing to see the effect of a variable change on
the overall system.
5.7.1 Installation and Configuration
The first installation step is the physical mounting of the stereo rig. During
this phase it’s useful to taking into account the following considerations:
• has already stated the cameras have to be mounted much frontal par-
allel as possible (see Fig. 2.2 for details);
• from the choosen baseline depends the depth resolution of the system;
this means that this parameter has to be fixed accordingly to room size.
Nearer the cameras are, worst is the resolution at high distances;
• it’s better to mount the stereo rig in the corner of the room opposite
to the entering door, near the cealing, to obtain the largest field of
view;
• it’s important to choose a camera model whose resolution is adeguate
to the room size. As we’ve seen the face detection algorithm cut-off the
faces with a size inferior to 75x75. If we have high resolution cameras
we have bigger faces;
• the two cameras have to be identical.
The second step is to start a uncalibrated acquisition from the net cam-
eras (see Fig. 5.6).
Before continuing we want to describe the informations requested in the
acquisition window in Fig. 5.6:
5.7. USE CASES 77
Fig. 5.6: The acquisition window, during this phase it’s important touncheck the calibrated option.
• the IP addresses and ports of the stereo cameras; it’s important to note
that left and right are considered looking from behind the cameras to
the cameras’ direction;
• the user id and password for authentication to the cameras. Axis net
cameras implements a form of open HTTP authentication which re-
quest a base 64 conversion of this data;
• the acquisition resolution. Please note that this resolution refer to the
that used for face recognition. For people localization and tracking this
resolution is scaled down to 320x240;
• the acquisition frame rate. It’s important to note that this rate has
to be adeguate to the room’s computer. However the system is robust
with respect to low or excessive frame rates;
• the use calibration data option which tell the system if undistort and
rectify the images. This option require that stereo cameras calibration
data is loaded.
At this moment is useful to do some correction to camera poses to reach
the best frontal parallel configuration possible.
78 CHAPTER 5. SYSTEM REQUIREMENTS AND ARCHITECTURE
The third step to do is what we called stereo cameras calibration.
This is done giving to the system (already placed in the desired position)
a sequence of 14 poses of a rigid chessboard pattern (see Fig. 5.7 for an
example).
Fig. 5.7: The stereo calibration window, the system emit a sound to tell theinstallator to remain much more still as possible.
After the calibration the system estimate the error on the data obtained.
A value less or equal than 0.20 assure a good result. If the error is too high,
we have to repeat the calibration.
It’s important to note tha calibration data are valid if the relative position
between the two cameras doesn’t change; it’s possible to change the absolute
position of the stereo rig as a whole, but not the position of a single camera;
in such a case the calibration has to be done from scratch.
We can test our calibration, stopping the acquisition and restarting it
using the calibration data. If we are satisfied we have to save this data in a
folder (in this folder will be placed all the files described previously).
5.7. USE CASES 79
Now with the acquisition started with undistortion and rectification func-
tions active we can do the external calibration. To do this step we have
to do a preliminar operation: we have to put a series of markers in the room
and for each of them calculate the exact position in room coordinates (in
millimeters) like in Fig. 5.8.
Fig. 5.8: The external calibration markers. The room coordinate systemhas to be obtainable by rotation and traslation of camera’s coordinate systemwhich is right-handed and depicted in Fig. 5.9.
Now we can use the external calibration window tool to select using a
viewfinder a marker in the scene and insert its world coordinates. We do this
operation for all markers (see Fig. 5.10). In the same way as after stereo
cameras calibration is done we can’t move a camera relatively to the other,
after stereo calibration we can’t move the stereo rig.
After external calibration we can save the data into a folder (we rec-
comend to use the same folder used for stereo cameras calibration data).
The last installation step is the Room Settings phase. This action
doesn’t require the acquisition from the camera; we use the window in Fig.
5.11 to do all the work.
80 CHAPTER 5. SYSTEM REQUIREMENTS AND ARCHITECTURE
Fig. 5.9: The camera coordinate system.
Fig. 5.10: The external calibration tool. The installator moves the cursorover the snapshot taken from the left camera and, using a “zoom in” window,clicks on each marker and digits the world coordinates (in millimeters) foreach marker.
5.7. USE CASES 81
Fig. 5.11: The Room Settings window.
This settings can be saved in a file with extension “.rsf” as already stated.
5.7.2 Run Time Installation Refinements
The work for the installator is not finished. We have already stated that
the behaviour of the system can be changed modifying a series of values (for
example the learning factor of a gaussian). The administrator GUI provided
with PLaTHEA allow to change these value at Run Time. After elaboration
is started it’s available the tool window in Fig. 5.12.
The window it’s divided in frames correspondent to different elaboration
tasks:
• the Background Learning frame. Here we have all the settings de-
scribed in section 4.1.1;
• the Disparity Map Settings frame. For detail on the settings we
invite the reader to study the OpenCV block matching stereo corre-
spondence algorithm on OpenCV online wiki or on [4];
• the Foreground Segmentation frame. For details on parameters see
[11] and [5]. Note that it’s possible to choose if using or not the shadow
detector and the foreground contour scanner too;
82 CHAPTER 5. SYSTEM REQUIREMENTS AND ARCHITECTURE
Fig. 5.12: The elaboration settings window.
• the Plan View Map and Tracking frame. We have already analyzed
how these settings modify the behavior of the system.
5.7.3 The Face Database Construction
The administrator GUI allows the users to modify the face database. It’s
possible to add and delete users, and for each users to add new faces to the
database (it’s not possible to delete faces from the database, if a user has
this need, he have to replace the all set).
5.7.4 Run Time Use Cases
Now we want to describe how the system behaves in different situations. To
do this we’ll define a series of scenes. To start we introduce a state diagram
similar to that introduced in Fig. 4.3. In this diagram we want to change
point of view; here the subject isn’t a tracked object (that as stated in 4.3
has limited life with an initial state and a final state) but the person whose
identity may be associated to a tracked object for a limited period of time;
when a person exit from a room and re-enter after sometimes its identity will
probably associated to another tracked object.
5.7. USE CASES 83
Fig. 5.13: The person’s state transition diagram.
The use cases we want to define have as main actor the client system
that we indicate with UPnPControlPoint. This control point in SM4All
middleware is a component of the Pervasive Layer. The other actors are the
instances of PLaTHEA.
We will spend the rest of the chapter to describe the use of the system.
We have already described the interfaces supported by a single instance of
PLaTHEA. So now we’ll deepen the composite use cases.
84 CHAPTER 5. SYSTEM REQUIREMENTS AND ARCHITECTURE
Fig. 5.14: Examples of use cases for a home served by PLaTHEA. Thereare many others service obtainable.
5.7. USE CASES 85
Main Success Scenario
1. UPnPControlPoint start the research of X
2. For each PLaTHEA installation UPnPControlPoint guess for X includ-ing Find Person Position
3. The installation k of PLaTHEA returns the position of X
4. UPnPControlPoint does some operation
Extension to 3 No installation have founded X
a) failure
Fig. 5.15: The Person Research Use Case
Main Success Scenario
1. UPnPControlPoint want to know the position for every person
2. For each PLaTHEA installation UPnPControlPoint includes Give AllPersons
3. UPnPControlPoint collect all the informations and do other tasks
Fig. 5.16: The Home Snapshot Use Case
86 CHAPTER 5. SYSTEM REQUIREMENTS AND ARCHITECTURE
Main Success Scenario
1. UPnPControlPoint want to have an update about each room every 5seconds
2. For each PLaTHEA installation UPnPControlPoint include Subscribeto All Tracked
3. UPnPControlPoint wait for an update
4. UPnPControlPoint receive an update from an installation k ofPLaTHEA
5. UPnPControlPoint does some operation
6. Back to 3
Extension to 6 Surveillance end
a) end of the use case
Fig. 5.17: The Home Periodic Surveillance Use Case
Chapter 6
Implementation Details
Contents6.1 Technological Introduction . . . . . . . . . . . . . 87
6.2 The Elaboration Core Component . . . . . . . . 88
6.2.1 Video Acquisition and Synchronization . . . . . . . 89
6.2.2 The Elaboration and the Face Recognition Threads 92
6.3 The UPnP Device . . . . . . . . . . . . . . . . . . 95
6.3.1 The UPnP device descriptor . . . . . . . . . . . . . 96
6.1 Technological Introduction
In this introductive section we describe PLaTHEA’s technological issues.
The system is implemented in native C++ exploiting Win32 API (on a
Windows 7 operating environment) and tested on an off-the-shelf laptop (a
Toshiba Satellite A300 1GY equipped with an Intel Core 2 Duo CPU at 2.53
GHz and 4 GB of RAM).
As already stated the cameras using during the implementation are a cou-
ple of Axis 207 net cameras; these cameras have a wired Ethernet interface.
Using the HTTP interface it’s possible to retrieve single image (in JPEG
format) or video (following the standard MPEG-4 or as Multipart JPEG).
Axis 207 is the entry level model of the network cameras’ family produced
87
88 CHAPTER 6. IMPLEMENTATION DETAILS
by Axis; it has a maximum resolution of 640x480 at a maximum frame rate
of 30 fotograms per second1.
6.2 The Elaboration Core Component
The goal of this section is to deepen the implementation details of the Elab-
oration Core Component. This is a multithreaded component2 (it consists
of five synchronized threads of execution) and its overall schema is given in
Fig. 6.1.
Fig. 6.1: The elaboration core component in detail.
Each of the following subsections analyzes in detail a part of this figure.
1This cameras offers other features not exploited such as UPnP interface, motion andaudio detection, email sending and so on
2A very interesting guide to Win32 API and Windows architecture is [22].
6.2. THE ELABORATION CORE COMPONENT 89
6.2.1 Video Acquisition and Synchronization
Stereo Vision involves the simultaneous acquisition from the two cameras of
two sequences of frames; to ask the cameras for a single frame when necessary
wasn’t a good solution for the following reasons:
• request from a net camera a frame requires the establishment of a TCP
connection and this involves not only extra traffic (due to connection
instauration) but also the problems involved by TCP Slow Start;
• the establishment of a connection is a “slow” operation and so it’s very
difficult to request two simultaneous images from the two camera; it’s
very important computing disparity that the to images taken from the
left and the right cameras are taken in the same instant.
So we chose to use Multipart JPEG (due to its semplicity with respect
to MPEG-4); in this way we open a persistent TCP connection with both
cameras. However this choice doesn’t remove the synchronization problem
due to the following reasons:
• the cameras are independent devices; when an Axis camera does light
compensation, (as stated in the camera’s documentation) it slows the
frame rate for elaboration and this involves asynchrony in the two se-
quence of frame;
• even if the cameras trasmit at the exactly same frame rate, they could
be start to trasmit at different instances (difference in microseconds).
The second issue is not a problem at all because persons moves slowly
(with respect to computer time). However, the first issue is a very important
problem. In Fig. 6.2 we show the problem and the solution.
To solve this problem we have implemented a three threads structure;
we have an acquisition thread for each camera and a synchronization thread
synchronized via software events; when one of the acquisition threads has
a ready image, it advice the synchronization thread; we have the following
options:
90 CHAPTER 6. IMPLEMENTATION DETAILS
Fig. 6.2: The problem of video sequences’ synchronization.
• if during the previous update we have sent a stereo couple to the elab-
oration, the new frame is stored waiting for a frame received by the
other camera to create a new stereo couple;
• if the previous frame was received by another camera, a new stereo
stereo frame is ready for the elaboration;
• if the previous frame was received from the same camera it is trashed.
The threads are synchronized using a couple of automatic reset soft-
ware events. The sequence diagram is showed in Fig. 6.3.
Our implementation shows very good result for synchronization. The syn-
chronization thread is the input to the elaboration core thread that does the
hard work. The reader should think that this three threads produce in fact
a huge overhead; this is not true because acquisition, decompression3 (done
by the acquisition thread) and synchronization are very quick operations.
This solution produce also an advantage for the elaboration thread be-
cause it can use its“time slot”only for elaboration (and of course visualization
in the administrator interface).
3The JPEG decompression into a Bitmap and the converted into IplImage format isdone using LibJPEG 7.
6.2. THE ELABORATION CORE COMPONENT 91
Fig. 6.3: The sequence diagram for sinchronization. The yellow block rep-resents an ignored frame. The X represents a really short elaboration.
92 CHAPTER 6. IMPLEMENTATION DETAILS
6.2.2 The Elaboration and the Face Recognition Threads
As already state the synchronization thread produces a sequence of stereo
couple (at an average rate correspondent to the acqisition rate); these stereo
sequence is the real input to the elaboration core; the synchronization thread
provide to the elaboration core the full size stereo couple and a 320x240 copy.
The synchronization thread signal to the elaboration core thread the presence
of a new stereo couple always using an automatic reset software event.
The elaboration core execute in sequence the following operations:
1. convert the BGR stereo couple provided by the synchronization thread
in a gray scale clone; the OpenCV stereo correspondence algorithm in
fact works only on gray scale images;
2. apply the Sobel filter to the left image of the small sized stereo couple;
the filter is used for both horizontal and vertical borders obtaining two
matrices;
3. compute the stereo correspondence between the gray scale versions of
images componing the current stereo couple; OpenCV execute this task
using a parallel thread to obtain a speed-up;
4. update the background; this update requires the color version of the
small sized stereo couple and the two borders matrices computed at
step 3;
5. apply the foreground segmentation algorithm. At this step we have a
foreground matrix where white pixels represents the foreground;
6. execute the tracking;
7. signal to the face recognition thread the presence of a new full size left
frame. This is an optional operation; if the face recognition thread is
not working (the precedent face recognition task has been completed
during the last time slot) it can start to process another image (details
on this later);
6.2. THE ELABORATION CORE COMPONENT 93
8. if face recognition thread has ended to analyze a full size left frame,
obtaining a series of identity, we have to interpolate these informations
to those provided by tracking.
In the sequence above we have described two optional operation; face
recognition it’s a slow task: it may take more than a single time slot to
complete; this is a huge problem because a good tracking requires an high
rate sampling to capture all the little variation of a person pose or height.
The solution to this has figured in Fig. 6.1 is to compute tracking and
face recognition in two separate threads of execution. Immediatly after the
elaboration core thread has finished tracking it checks the state of the face
recognition thread:
• if the face recognition thread has ended, the elaboration core thread
acquires the results and does the interpolation;
• if the face recognition thread hasn’t ended, the interpolation is posponed
to the next time slot.
We’ve already stated that, after recognition, the face recognition thread
reproject the face to the plan view used for tracking. This imply that when
face recognition receive a new frame, it needs to receive the information for
the current tracked persons; this is why in Fig. 6.1 the face recognition is
started immediatly after the tracking instead of the top of the sequence4.
The Fig. 6.4 can help in understanding the mechanism.
As depicted in Fig. 6.1 the face recognition thread performs mainly two
operations in sequence:
1. it performs the face detection using a gray scale version of the left frame
of the big sized stereo couple provided by the elaboration core thread;
2. for each face detected, it tries to do face recognition;
3. for each face correctly identified it reproject the center of mass of the
face to the plan and search for a tracked object in that position.
4Of course this is another good reason to allow face recognition to take more than atime slot, otherwise we’ll have to wait immediately for its ending.
94 CHAPTER 6. IMPLEMENTATION DETAILS
Fig. 6.4: The sequence diagram shows the synchronization between the elab-oration core thread and the face recognition thread. The arrows representevent signaling. The red boxes represent the time slots when the elaborationcore thread provide the face recognition thread with new data. The yellowbox represent the time slot when the elaboration core thread does interpola-tion; due to this this elaboration burst is a little longer than the others.
6.3. THE UPNP DEVICE 95
As already stated, the most expensive operation, among those described
in the sequence above, is the face detection (of course it depends on the
number of faces in database; but if the database is under the 100 faces the
claim holds).
Our experience proof that the synchronization solution obtained it’s very
effective. The elaboration times of the elaboration core thread for each stereo
coupled is increased of a little amount; however this solution makes this elab-
oration time more variable, with spikes in correspondence of the interpolation
time slots.
6.3 The UPnP Device
The UPnP Device represents in some sense the presentation layer of PLaTHEA
for the interaction of external entities (UPnP control points). UPnP works
on SOAP (as others Web Services types); all the interactions are managed
for free by the UPnP library. What our UPnP device have to added are the
support for periodical updates.
To this aim our UPnP device is implemented as a thread that periodically
change the value of the variables correspondent to periodic publish/subscribe
services. The basic period for updates is of 0.333 milliseconds (correspondent
to the periodic service of three updates for second) and the others are derived
from this.
This thread runs independently from the elaboration core thread but
depends on it for updates as figured in Fig. 5.2; so UPnP device thread own
a copy of tracked objects data (including identities) whose access is protected
using a single writer and multiple readers lock. So we have different
pretenders to this lock:
• the UPnP Device thread wants to gain a reader lock to produce the
XML update for subscribers;
• the elaboration core thread wants to gain a writer lock to update in-
formations;
96 CHAPTER 6. IMPLEMENTATION DETAILS
• the UPnP Device synchronous methods wants to gain a reader lock to
fulfill the requests.
Another time, a sequence diagram allow us to clarify the concept.
Fig. 6.5: The sequence diagram shows the synchronization via a SWMR(Single Writer and Multiple Readers) lock. Initially the UPnP Device threadacquire reader rights; when it finishes its task, the writer rights are grantedto the elaboration core thread which had requested them in advance and waswaiting for them; finally the UPnP library can acquire reader rights to fulfilla synchronous request.
The use of a SWMR object (the request manager is in fact the operating
system) allows to avoid data inconsistency.
6.3.1 The UPnP device descriptor
We’ve already introduced the interfacec exposed by the UPnP device. In fact
those interfaces were described from a client point of view. Here we want to
deepen how these methods are really described to feed to the UPnP library.
In Fig. 6.6 we have a device description following the standard described
6.3. THE UPNP DEVICE 97
in [29]. This device exposes four services; three of these services are described
by the same XML Service Description file (depicted in Fig. 6.8).
1 <?xml version=”1.0 ” ?>2 <root xmlns=”urn:schemas−upnp−o rg :dev i c e −1−0”>3 <specVers ion>4 <major>1</major>5 <minor>0</minor>6 </ specVers ion>7 <dev i ce>8 <deviceType>urn:sm4al l−dis−sapienza :dev ice :PLTSystem:1</deviceType>9 <fr iendlyName>People Lo ca l i z a t i on Recognit ion and Tracking System</ friendlyName>
10 <manufacturer>Dipartimento di In fo rmat i ca e S i s t em i s t i c a Ruberti</manufacturer>11 <manufacturerURL>ht tp : //www. d i s . uniroma1 . i t</manufacturerURL>12 <modelName>PLTSystem</modelName>13 <UDN>uuid:disPLTSystem</UDN>14 <s e r v i c e L i s t>15 <s e r v i c e>16 <serv iceType>urn:sm4al l−dis−s a p i e n z a : s e r v i c e :ma i nS e r v i c e : 1</ serv iceType>17 <s e r v i c e I d>urn:sm4al l−dis−s ap i e n z a : s e r v i c e I d :ma i nS e r v i c e</ s e r v i c e I d>18 <SCPDURL>mainService . xml</SCPDURL>19 <controlURL>con t r o l</controlURL>20 <eventSubURL>mainServiceEvent</eventSubURL>21 </ s e r v i c e>22 <s e r v i c e>23 <serv iceType>urn:sm4al l−dis−s a p i e n z a : s e r v i c e : g e n e r i cA l lDa t a S e r v i c e : 1</
serv iceType>24 <s e r v i c e I d>urn:sm4al l−dis−s ap i en za : s e r v i c e I d : t h r e eF rame s</ s e r v i c e I d>25 <SCPDURL>gene r i cA l lDataSubsc r ip t i on . xml</SCPDURL>26 <controlURL>con t r o l</controlURL>27 <eventSubURL>threeFramesServiceEvent</eventSubURL>28 </ s e r v i c e>29 <s e r v i c e>30 <serv iceType>urn:sm4al l−dis−s a p i e n z a : s e r v i c e : g e n e r i cA l lDa t a S e r v i c e : 1</
serv iceType>31 <s e r v i c e I d>urn:sm4al l−dis−s ap i en za : s e r v i c e Id : t enFrames</ s e r v i c e I d>32 <SCPDURL>gene r i cA l lDataSubsc r ip t i on . xml</SCPDURL>33 <controlURL>con t r o l</controlURL>34 <eventSubURL>tenFramesServiceEvent</eventSubURL>35 </ s e r v i c e>36 <s e r v i c e>37 <serv iceType>urn:sm4al l−dis−s a p i e n z a : s e r v i c e : g e n e r i cA l lDa t a S e r v i c e : 1</
serv iceType>38 <s e r v i c e I d>urn:sm4al l−dis−s a p i e n z a : s e r v i c e I d : f i f t yF r ame s</ s e r v i c e I d>39 <SCPDURL>gene r i cA l lDataSubsc r ip t i on . xml</SCPDURL>40 <controlURL>con t r o l</controlURL>41 <eventSubURL>f i f tyFramesServ i ceEvent</eventSubURL>42 </ s e r v i c e>43 </ s e r v i c e L i s t>44 <presentationURL>/ pr e s en ta t i on</presentationURL>45 </ dev i ce>46 </ root>
Fig. 6.6: The UPnP description for a device contains several pieces ofvendor-specific information, definitions of all embedded devices, URL forpresentation of the device, and listings for all services, including URLs forcontrol and eventing.
The Main Service (depicted in Fig. 6.7) exposes all the synchronous
methods (except the subscription methods) described at page 74. Moreover
it expose two evented variable used in the publish/subscribe interaction with
a UPnP control point. When a control point subscribes to this service it
receives updates when this two variables changes their values; in fact in
UPnP it’s possible to subscribe only to an entire service and not to the
98 CHAPTER 6. IMPLEMENTATION DETAILS
1 <?xml version=”1.0 ”?>2 <scpd xmlns=”urn:schemas−upnp−o r g : s e r v i c e −1−0” >3 <specVers ion>4 <major>1</major>5 <minor>0</minor>6 </ specVers ion>7 <a c t i onL i s t>8 <ac t i on>9 <name>GetList IDRegistered</name>
10 <argumentList>11 <argument>12 <name>Result</name>13 <r e l a t edS ta t eVa r i ab l e>Result</ r e l a t edS ta t eVa r i ab l e>14 <d i r e c t i o n>out</ d i r e c t i o n>15 </argument>16 </ argumentList>17 </ ac t i on>18 <ac t i on>19 <name>GetRoomInfo</name>20 <argumentList>21 <argument>22 <name>Result</name>23 <r e l a t edS ta t eVa r i ab l e>Result</ r e l a t edS ta t eVa r i ab l e>24 <d i r e c t i o n>out</ d i r e c t i o n>25 </argument>26 </ argumentList>27 </ ac t i on>28 <ac t i on>29 <name>GetPositionFromPersonID</name>30 <argumentList>31 <argument>32 <name>PersonID</name>33 <r e l a t edS ta t eVa r i ab l e>ID</ r e l a t edS ta t eVa r i ab l e>34 <d i r e c t i o n>in</ d i r e c t i o n>35 </argument>36 <argument>37 <name>Result</name>38 <r e l a t edS ta t eVa r i ab l e>Result</ r e l a t edS ta t eVa r i ab l e>39 <d i r e c t i o n>out</ d i r e c t i o n>40 </argument>41 </ argumentList>42 </ ac t i on>43 <ac t i on>44 <name>GetPositionFromObjectID</name>45 <argumentList>46 <argument>47 <name>ObjectID</name>48 <r e l a t edS ta t eVa r i ab l e>ID</ r e l a t edS ta t eVa r i ab l e>49 <d i r e c t i o n>in</ d i r e c t i o n>50 </argument>51 <argument>52 <name>Result</name>53 <r e l a t edS ta t eVa r i ab l e>Result</ r e l a t edS ta t eVa r i ab l e>54 <d i r e c t i o n>out</ d i r e c t i o n>55 </argument>56 </ argumentList>57 </ ac t i on>58 <ac t i on>59 <name>GetAl lPos i t i ons</name>60 <argumentList>61 <argument>62 <name>Result</name>63 <r e l a t edS ta t eVa r i ab l e>Result</ r e l a t edS ta t eVa r i ab l e>64 <d i r e c t i o n>out</ d i r e c t i o n>65 </argument>66 </ argumentList>67 </ ac t i on>68 </ a c t i onL i s t>69 <s e rv i c eS ta t eTab l e>70 <s t a t eVa r i ab l e sendEvents=”no”>71 <name>Result</name>72 <dataType>s t r i n g</dataType>73 </ s t a t eVa r i ab l e>74 <s t a t eVa r i ab l e sendEvents=”no”>75 <name>ID</name>76 <dataType>i n t</dataType>77 </ s t a t eVa r i ab l e>78 <s t a t eVa r i ab l e sendEvents=”yes ”>79 <name>notifyNewObject</name>80 <dataType>s t r i n g</dataType>81 </ s t a t eVa r i ab l e>82 <s t a t eVa r i ab l e sendEvents=”yes ”>83 <name>noti fyNewRecognizedObject</name>84 <dataType>s t r i n g</dataType>85 </ s t a t eVa r i ab l e>86 </ s e rv i c eS ta t eTab l e>87 </ scpd>
Fig. 6.7: The PLaTHEA Main Service description.
6.3. THE UPNP DEVICE 99
single variables; moreover it’s not possible to exercise control on the messages
received.
1 <?xml version=”1.0 ”?>2 <scpd xmlns=”urn:schemas−upnp−o r g : s e r v i c e −1−0” >3 <specVers ion>4 <major>1</major>5 <minor>0</minor>6 </ specVers ion>7 <s e rv i c eS ta t eTab l e>8 <s t a t eVa r i ab l e sendEvents=”yes ”>9 <name>not i fyAl lFrames</name>
10 <dataType>s t r i n g</dataType>11 </ s t a t eVa r i ab l e>12 </ s e rv i c eS ta t eTab l e>13 </ scpd>
Fig. 6.8: The Periodical Updating Service description
The remaining three services describe periodic updates with different fre-
quencies. We have selected frequencies thinking of the use a control point
can do of position informations:
• for a continuous control of positions in the room we suggest to subscribe
to “three for second” or “one for second” service;
• for a more mild control we suggest the “one for five seconds” service.
Obviously the control point can filter the updates following other business
rules, but this is out of the scope of the system.
100 CHAPTER 6. IMPLEMENTATION DETAILS
Chapter 7
Tests and Performance Analysis
Contents7.1 Tests on the PLT Sub-system . . . . . . . . . . . 101
7.1.1 Test Environment . . . . . . . . . . . . . . . . . . 102
7.1.2 Test Results . . . . . . . . . . . . . . . . . . . . . . 104
7.2 Tests on Face Recognition Sub-system . . . . . . 111
7.3 Computational Costs . . . . . . . . . . . . . . . . 114
7.1 Tests on the PLT Sub-system
In this section we aim to analyze the performances of the PLaTHEA’s
People Localization and Tracking subsystem. These tests satisfy two main
needs:
• we want to measure system’s error on dynamic positions. The system
shows very good performances with respect to static measures, obtain-
ing an error of about 10 centimeters on all the axis. So we want to
analyze the reply of the system following a moving agent;
• the errors have to be derived from a client point of view, so we use a
UPnP client in order to gather system’s measurements; this client is
showed in Fig. 7.1.
101
102 CHAPTER 7. TESTS AND PERFORMANCE ANALYSIS
Fig. 7.1: The UPnP client used in our test. It shows PLaTHEA’s asyn-chronous interface from the client point of view. The updates contain infor-mations about id of the tracked subject, with the eventual identity, end thearea containing the subject defined by two corner points of a rectangle. Allthe measure are in plan view coordinates.
7.1.1 Test Environment
Let’s start with the test environment; we have to monitor an area of approx-
imately 16 m2 with maximum distance from the stereo rig of 6 m.
The stereo rig is placed at a height of about 2.5 from the floor and is
pointed towards the entry door of the room (the farthest point to monitor)
with a degree of about 45 ◦C with respect to the parallel to the floor.
For an installer the first problem to solve is the choice of the baseline
between the cameras; bigger is the area to monitor, larger is the baseline; a
baseline of 19 cm is perfect to handle a range from 1 m to 6 m. Unfortunately
larger is the baseline bigger has to be the chessboard for the calibration1
because in each stereo snapshot it has to be present in both the imager
and the cells have to be big enough to avoid precision errors during the
computation.
The error measurements are done with respect to three path designed to
have an increasing degree of complexity. These paths are followed by three
persons with different heights and builds; they will intersect on these path in
1During our test we have used a chessboard printed on an A3 sheet. However, in ourexperience, bigger is the chessboard simpler is to achieve a good calibration.
7.1. TESTS ON THE PLT SUB-SYSTEM 103
Fig. 7.2: The stereo rig used in our tests. Please note that we have orientedthe cameras to obtain the best possible frontal parallel arrangement.
different ways, helping us to obtain information about the sensibility of the
system with respect to partial and total occlusions and with respect to the
proximity to walls and inanimate objects.
Height WeightSubject 1 1.67 50 KGSubject 2 1.73 70 KGSubject 3 1.80 85 KG
Fig. 7.3: Information about the heights and builds of the subjects protago-nist of the tests.
The paths are obtained connecting a series of points; obviously the points
provided by the system haven’t any correspondence with them. So the error
for a given point provided by the system is defined as the euclidean distance
from the nearest point in the walk (of course, not the nearest in absolute,
but the nearest watching at the video sequence). For each walk we provide
the maximum error and the average error with respect to the real positions
occupied by the subject, and the minimum, maximum and average height
measuread for the subject.
The deployment of the test environment has been a perfect example of
system installation; we have specified the room’s features, we have performed
104 CHAPTER 7. TESTS AND PERFORMANCE ANALYSIS
Fig. 7.4: The paths used for the tests. They are placed in increasing orderof complexity. Each cell in the red grid correspond to an area of side 50 cm.The stereo rig is placed in the bottom right corner of the monitored area.
the stereo and the external calibration (for this last we have placed special
markers all around the room providing for each of them the exact position
hand measured) and we spend some time to tune the settings in order to
obtain the best performance for the system; this last task it’s maybe the
harder to execute because it requires a lot of experience with the system, but
it can improve the system’s performance a lot.
7.1.2 Test Results
We start with the simplest kind of tests; each subject follows all the paths
alone. Despite of the simplicity these tests give us many informations about
the dependence of the system’s performance from the quality of the disparity
map.
As already stated in the previous chapters, we use the disparity map
to reproject the foreground pixels on the floor obtaining the so called plan
view maps. The algorithm provided with OpenCV library is very fast but
it outputs very confused disparity maps that make harder to obtain high
precision in a set of situations; for example we can derive from the results
that the disparity map become more and more inprecise while the subject
approach to the wall. The choice of the appropriate stereo correspondence
algorithm is the result of a trade-off between quality and efficiency; we have
7.1. TESTS ON THE PLT SUB-SYSTEM 105
to find an algorithm which allows to work in a real time fashon providing
at the same time precise disparity informations about at least the moving
object (some stereo correspondence algorithm such as [15] gives very good
result even on static objects with very low texture information but are very
computationally expensive).
(a) Subject 1 (b) Subject 2 (c) Subject 3
Fig. 7.5: The results for the ‘Blue’ Path.
(a) Subject 1 (b) Subject 2 (c) Subject 3
Fig. 7.6: The results for the ‘Green’ Path.
Before going ahead with the analysis we describe the main elements of
the graphics:
• the red grid is made up by cells which represents a real floor area with
a side of 50 cm;
106 CHAPTER 7. TESTS AND PERFORMANCE ANALYSIS
(a) Subject 1 (b) Subject 2 (c) Subject 3
Fig. 7.7: The results for the ‘White’ Path.
• the green path is the real path that the subjects have to follow;
• the blue points represent the positions provided by the system; for each
of these points we give te tolerance on the position figured as the bias
of the rectangle centered in the correspondent point;
• in each figure we have drawn the height measurements, which the reader
can compare to table 7.3.
Now in the following table, we have the error measurements for these first
tests:
MAX Error AVG ErrorBlue 533.67 201.04
Subject 1 Green 305.94 189.31White 622.90 230.80Blue 580.00 241.67
Subject 2 Green 532.54 272.76White 710.21 241.03Blue 449.45 217.05
Subject 3 Green 556.05 215.50White 442.72 213.83
Fig. 7.8: Error measurements (in mm) for the walk shown in Fig.s 7.5, 7.6,7.7
7.1. TESTS ON THE PLT SUB-SYSTEM 107
From the table 7.8 we can derive the following considerations:
• the average error keep itself constant to approximately 20 cm;
• the system is subject to spike in the measurements due mainly to the
problem with disparity map above mentioned.
Taking into account the tolerance provided by the system itself these first
result make PLaTHEA suitable for domestic use.
Now we make our tests a little bit harder; we have two subjects that
follow two paths at the same time. We have done four of these kind of tests
with an increasing degree of occlusion between the two subjects.
(a) Subject 1 on the ‘Blue’ path (b) Subject 3 on the ‘Green’ path
Fig. 7.9: In this test the two subjects experience the smallest degree ofocclusion.
In Fig. 7.9 Subject 1 and 3 perform two different paths with a little
amount of occlusions; so the result is not so longer from the alone walks
showed early.
A little more complex is the scenario depicted in Fig. 7.10. Here while
the Subject 3 performs the first part of the path, Subject 1 (that is a little bit
108 CHAPTER 7. TESTS AND PERFORMANCE ANALYSIS
(a) Subject 1 on the ‘Green’ path (b) Subject 3 on the ‘Blue’ path
Fig. 7.10: In this test the two subjects experience a little bit more of occlu-sion with respect to Fig. 7.9.
small) generate a little degree of occlusion; the result is a worsening of the
disparity map that generate an error in the “sense of distance” of the system.
In Fig. 7.11 the reader can see that not only the two subjects experience
the same kind of imprecisions, but also that the system starts to provide a
little bit set of measurements.
Finally in Fig. 7.12 we have the worst case; here a whole part of the path
followed by Subject 1 is not detected at all. The main cause for this event is
that Subject 3 (that is much more bulky of Subject 1) generate an occlusion
that deny the system the sight of Subject 1 at all. This problem could be
resolved in part placing the stereo rig near the cealing (we have placed the
rig at a relatively small height)2.
Now we resume the error results in the following table:
It’s remarkable that the average errors showed in table 7.13 are very
similar to those presented in table 7.8.
2We will see in the conclusive chapter that another solution to the problem is the useof two stereo system placed at opposite corners in the room.
7.1. TESTS ON THE PLT SUB-SYSTEM 109
(a) Subject 1 on the ‘Blue’ path (b) Subject 3 on the ‘Blue’ path
Fig. 7.11: In this test we have the third degree of occlusion; the systembegin to lose precision.
(a) Subject 1 on the ‘Blue’ path (b) Subject 3 on the ‘Green’ path
Fig. 7.12: This is the greater degree of occlusion, part of the Subject 1 walkis not even detected.
110 CHAPTER 7. TESTS AND PERFORMANCE ANALYSIS
MAX Error AVG ErrorSubject 1 624.82 310.82
Degree 1 Subject 3 536.66 244.80Subject 1 568.51 258.35
Degree 2 Subject 3 565.69 226.78Subject 1 796.24 238.32
Degree 3 Subject 3 501.20 230.13Subject 1 386.26 202.83
Degree 4 Subject 3 695.70 313.28
Fig. 7.13: Error measurements (in mm) for the walks shown in Fig.s 7.9,7.10, 7.11, 7.12. The data are ordered for increasing degree of occlusion.Note that the data for the Degree 4 are in some sense misleading because itdoesn’t consider the lost measurements.
Now we take a look a the most complex text. The three subjects follow
three different paths at the same time; the reader may object that we can
have more than three persons in a room, but this test is useful because the
paths are very intricate, so this is a very complex human interaction scene.
(a) Subject 1 (b) Subject 2 (c) Subject 3
Fig. 7.14: All in the test room at the same time.
In Fig. 7.14 we note the same problem showed in Fig. 7.12; in particular
Subject 1 experience the same occlusion problem already seen. Now in the
following table we resume the error measurements:
Before closing this section we want to make some considerations on the
height measurements. The data showed in the figures above show that the
7.2. TESTS ON FACE RECOGNITION SUB-SYSTEM 111
MAX Error AVG ErrorSubject 1 388.33 212.45Subject 2 490.31 194.13Subject 3 772.79 302.22
Fig. 7.15: Error measurements (in mm) for the walk shown in Fig. 7.14.
maximum detected height for each subject it’s near to the real height; un-
fortunately the average detected heights show a relatively high error. The
problem is as always in the disparity map; during the walk the persons assume
poses that prevent the algorithm to detect disparity information merging the
head with the background disparity. This is not the first time we have to face
this problem; the face recognition thread as already stated after the recogni-
tion has to associate the identity to a tracked object; this operation involves
the use of disparity information for reprojection on the plan view but the
pixels of the face often don’t have an associated disparity; so in this case we
move down the y coordinate of the face to the chest that in vast majority of
the cases have an associated disparity.
7.2 Tests on Face Recognition Sub-system
The reader may wonder why we didn’t test the face recognition subsystem
using the same UPnP client used for the PLT tests; the reason is in some
sense a technological matter; the cameras used in PLaTHEA development
are entry level cameras with a maximum resolution of 640x480 pixels3. At
this resolution a face placed at 4 meters from the stereo rig appear on the left
imager in an area of about 40x40 pixels that is not enough for face recognition.
The system correctly associate the face with a tracked object but is not able
to assign to it the correct identity; our test with a face database containing 5
users gives a hit rate of about the 30% that is only a little more of a random
choice; so we postpone this test to the future using high definition cameras
3The system acquires the video sequences from the stereo rig at this resolution, butthis is downsampled to 320x240 for the PLT subsystem. The idea is to acquire from thecameras at the maximum resolution available to perform face recognition, resizing it forthe tracking task.
112 CHAPTER 7. TESTS AND PERFORMANCE ANALYSIS
(we approximately need a resolution of 1920 pixels on width).
However we want to analyze the results that the face recognition tecnique
based on SIFT features has showed as a standalone system. To this aim we
have developed a test application the use a simple webcam as video source.
This application is showed in Fig. 7.16.
Fig. 7.16: The test application for the face recognition system.
The test application works exactly as the face recognition subsytem in-
cluded in PLaTHEA with the exception that the video source is not the left
camera of the stereo rig. Recapitulating it does the following actions:
1. it does the face detection using the Viola-Jones classifier;
2. for each image detected it matches the SIFT features in that image
with the SIFT features of the images in the database;
3. for each user in the database it assign a score summing all the matches
for each image;
4. the face is assigned to the user with the maximum score using a thresh-
old to declare this association valid.
7.2. TESTS ON FACE RECOGNITION SUB-SYSTEM 113
The training set for the system is made up by 50 images (10 images for
each of the 5 user; the training set for a single user is showed in Fig. 7.17);
this training set may seem tiny but designing it we’ve thinked to the final
use of the system: the domestic environment.
Fig. 7.17: The training set for a single user. We have a series of frontalposes; the profile poses aren’t useful because the Viola-Jones classifier istrained only for frontal faces.
Now the face recognition system have two main constraints that corre-
spond to two tests set:
• if the face detected in the current frame correspond to a user in the
database, the system have to correctly indicate it;
• if the face detected in the current frame doesn’t correspond to any user
in the database, the system should detect this situation.
The first test set contains a series of 60 images of the users contained in
the database; the system (tuned with a threshold of only 20 features) has
shown very good results that we ricapitulate in the table 7.18. The results
are very interesting also because the test set contains every kind of strange
pose that justify the two non hit case.
The second test set it’s interesting because the result can be used to tune
the threshold of the system; it’s made up by 60 images of users not present
in the database; if we look at the maximum score and at the average we can
choose the best threshold for the system.
114 CHAPTER 7. TESTS AND PERFORMANCE ANALYSIS
Number AVG ScoreCorrect 58 69
No Answer 1 10Not Correct 1 32
Fig. 7.18: The results for the first test set.
MAX Score AVG Score38 25
Fig. 7.19: The results for the second test set.
Doing the test we have noted the importance of the way in which the
training set is made up. It’s important that every user store in the database
the same number of images and that all the training set is shooted in good
light condition. We experienced that with low light condition the system is in
fact not able to extract enough SIFT features for the recognition. In the worst
case the users shows training set with different illumination conditions; in this
case the performance of the system degrade with frequent mismatching.
7.3 Computational Costs
We close the chapter with some considerations about the cost of the sys-
tem. Vision system are computationally expensive particularly for the stereo
correspondence algorithm and the face recognition system as a whole.
For a good tracking, as already stated, we need to elaborate a frame rate
of at least 10 frames for second; this means that the system has 100 ms to
analyze a stereo snapshot. We’ve already seen that to this aim we need a
second elaboration thread, that works parallel to the main elaboration thread,
for face recognition; this parallel thread to do a single computation occupy
2-3 time slot. In Fig. 7.20 we have a graphic that explain the situation; the
elaboration unit is a Toshiba Satellite laptop equipped with an Intel Core 2
Duo CPU at 2.53 GHz and 4 GB of RAM.
We have already state that isolate the network traffic generated by the
cameras is very important; in fact the two Axis 207 cameras, with two video
7.3. COMPUTATIONAL COSTS 115
Fig. 7.20: The computational time of PLaTHEA. The horizontal axisindicates the stereo frame number; the vertical axis unit measure is in ms.
sequences composed by 640x480 frames at the rate of 10 frames for second
generate a traffic that use the 8% of the Ethernet 100 bandwidth. The traffic
will increase in the future using high definition cameras for face recognition.
116 CHAPTER 7. TESTS AND PERFORMANCE ANALYSIS
Chapter 8
Conclusions and Future Works
Contents8.1 Considerations on Vision Systems . . . . . . . . 117
8.2 Future Works . . . . . . . . . . . . . . . . . . . . . 119
8.1 Considerations on Vision Systems
The development of PLaTHEA has been an interesting exploration of tec-
niques in the field of computer vision; the result of this effort has been the
opinion that vision based systems have very good chances to become the
standard for people localization, recognition and tracking in domestic envi-
ronment.
However PLaTHEA itself suffers for the typical problems of vision sys-
tems; we list the most painful drawbacks:
• as we saw in the test chapter occlusions make it harder to detect the
presence of a person; in some case, if the occluder and the occluded
persons aren’t to much close, this problem may be solved placing the
stereo rig at an adequate height; problems related with occlusions are
more and more perceptible if the room is crowded;
• using illumination and chromaticity components of pixels’ colours we
have solved some of the problems derived from changes in light condi-
117
118 CHAPTER 8. CONCLUSIONS AND FUTURE WORKS
tions; the system is however very sensible to strong illumination changes
such as those derived from hotspots;
• deeply related with the previous item, the system works bad (or it
doesn’t work at all) with low light conditions; as already stated in the
first chapter, one of SM4All goals is exploring the potential of infrared
for night vision; however the techniques explored during the design of
PLaTHEA are deeply based on color information while infrared vision
is inerently grayscale;
• last but not least, the system is robust enough to not include in the
background model relatively static persons (that is for example seated
studying) but has experienced problems with particulary static body’s
parts (such as legs) and particulary static bodies (such as a sleeping
human); in this situations the inclusion in the background is only a mat-
ter of time; to tune the system to support such cases it’s very difficult
because it should cause a loss in adaptivity properties of background
model.
The good new is that PLaTHEA’s code is highly modular so it is sim-
ply extensible; this was a needed feature for our code because during the
development we made a lot of changes in our strategy.
In spite of “popular belief” face recognition, using appropriate cameras,
it’s the easier task in a domestic environment; in fact, differently from other
scenarios where face recognition may be useful (banks, airports, stations or
market), a set of contraints may be neglected:
• the training set is very limited; only house’s inhabitants have to store
their faces in the repository and this fact allows to use simple algorithm
such as SIFT based algorithm;
• in more delicate scenarios humans that have to be recognized don’t
want to be identified (we may think to criminals, terrorists, dishonest
employees and so on).
8.2. FUTURE WORKS 119
As we’ve seen the other main family of PLT systems is that of markers
based systems; in this area the most present technology is RFID (Radio
Frequency IDentificator); this is a very good alternative to computer vision; it
delets some of the problems above mentioned but it introduces new problems
related to radio transmissions. A good example of such a system is given in
[21].
8.2 Future Works
We should talk rather of ‘immediately future’ works; in the previous section
we have introduced a set of problems to solve; however PLaTHEA isn’t a
research project in the strict sense of the term; rather we aim to obtain the
best from the tools produced by the research world.
Following these intents the first goal to reach is the improvement and
completion of the face recognition subsystem; as we’ve already state the
problem with face recognition is only a resolution matter; what this means
is that we have grossly three alternatives:
• we can use a couple of high resolution cameras for stereo vision replacing
the Axis 207 cameras;
• we can mantain only one Axis 207 camera coupling to it an high resolu-
tion camera; obviously this solution introduces a set of problems related
to calibration, rectification and stereo correspondence but reduce the
economical cost of the system (remember that in our opinion in a not
so far day we’ll have a PLaTHEA peer in each room of every house);
• we can add a third camera provided with optical zoom pointed to a
strategic area of the room (the door area); this solution create problems
from the point of view of the association of the face to the tracked
subjects; moreover this seems to be a less attractive idea.
The second main goal is the choice and the implementation of a best
stereo correspondence algorithm; the vast majority of precision problems
of the system derive from the loss of precision of the disparity map. The
120 CHAPTER 8. CONCLUSIONS AND FUTURE WORKS
choice of the stereo correspondence algorithm is a trade-off between quality
and velocity; the algorithm has to be fast enough to allow for real time
processing of stereo frames but it has to be precise enough to avoid excessive
errors durign the localization phase.
In third place, we want to improve the tracking algorithm adding a
bayesian network to aid the changes of the state for the tracked objects.
The previous two goals can be considered as “immediately future” works;
one of the problem of stereo vision system, that we have ignored until now,
is that stereo camera field of view doesn’t cover an whole room; this is not a
trivial problem. We don’t face this problem at all but we think to use in the
future two couple of cameras, with the two stereo rig at the opposite corners
of a room; this solution can aid in the solution of some occlusion situations.
Bibliography
[1] Home automation wikipedia voice. http://en.wikipedia.org/wiki/
Home_automation.
[2] Jeffrey S. Beis and David G. Lowe. Shape indexing using approximate
nearest-neighbour search in high-dimensional spaces. In In Proc. IEEE
Conf. Comp. Vision Patt. Recog, pages 1000–1006, 1997.
[3] J.-Y. Bouguet. Camera calibration toolbox for matlab. http://www.
vision.caltech.edu/bouguetj/calib_doc/index.html, 2008.
[4] Gary Bradski and Adrian Kaehler. Learning OpenCV. O’Reilly, 2009.
[5] Rita Cucchiara, Costantino Grana, Massimo Piccardi, and Andrea Prati.
Detecting moving objects, ghosts and shadows in video streams. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 25:1337–
1342, 2003.
[6] Mecella M. et al. Sm4all - architecture. http://www.sm4all-project.
eu/index.php/activities/deliverables.html, November 2009.
[7] J. G. Fryer and D. C. Brown. Lens distortion for close-range photogram-
metry. Photogrammetric Engineering and Remote Sensing, 52:51–58,
1986.
[8] R. Hartley and A. Zisserman. Multiple View Geometry in Computer
Vision. Cambridge University Press, 2006.
[9] Michael Harville. A framework for high-level feedback to adaptive, per-
pixel, mixture-of-gaussian background models. In European Conference
on Computer Vision, 2002.
121
122 BIBLIOGRAPHY
[10] Michael Harville. Stereo person tracking with adaptive plan-view tem-
plates of height and occupancy statistics. Image and Vision Computing,
22:127–142, 2004.
[11] Thanarat Horprasert, David Harwood, and Larry S. Davis. A robust
background subtraction and shadow detection. In In Proceedings of the
Asian Conference on Computer Vision, 2000.
[12] Jun Luo Ma, Y. Takikawa, E. Lao, S. Kawade, M. Bao-Liang Lu. Person-
specific sift features for face recognition. In IEEE International Confer-
ence on Acoustics, Speech and Signal Processing, 2007.
[13] R. E. Kalman. A new approach to linear filtering and prediction prob-
lems. Journal of Basic Engineering, 22, 1960.
[14] Kyungnam Kim, Thanarat H. Chalidabhongse, David Harwood, and
Larry Davis. Background modeling and subtraction by codebook con-
struction. In In International Conference on Image Processing, pages
3061–3064, 2004.
[15] Vladimir Kolmogorov, Ramin Zabih, and Steven Gortler. Generalized
multi-camera scene reconstruction using graph cuts. In In Proceedings of
the International Workshop on Energy Minimization Methods in Com-
puter Vision and Pattern Recognition, pages 501–516, 2003.
[16] K. Konolige. Small vision system: Hardware and implementation. In
Proceedings of the International Symposium on Robotics Research, pages
111–116, 1997.
[17] Harold W. Kuhn. The hungarian method for the assignment problem.
Naval Research Logistics Quarterly, 2:83–97, 1955.
[18] David G. Lowe. Distinctive image features from scale-invariant key-
points, 2003.
[19] M. Bicego, A. Lagorio, E. Grosso, M. Tistarelli. On the use of sift fea-
tures for face authentication. In Computer Vision and Pattern Recogni-
tion Workshop, 2006.
BIBLIOGRAPHY 123
[20] Ara V. Nefian and Monson H. Hayes Iii. A hidden markov model-based
approach for face detection and recognition, 1998.
[21] L.M. Ni, Y. Liu, Y.C. Lau, , and A.P. Patil. Landmarc: indoor location
sensing using active rfid. In Proc. of PerCom, pages 407–415, 2003.
[22] Jeffrey Richter and Christophe Nasarre. Windows via C/C++. Microsoft
Press, 2008.
[23] S. Bahadori, L. Iocchi, G.R. Leone, D. Nardi, and L. Scozzafava. Real-
time people localization and tracking through fixed stereo vision. In
International Conference on Industrial & Engineering Applications of
Artificial Intelligence & Expert Systems, 2005.
[24] Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of
dense two-frame stereo correspondence algorithms. International Jour-
nal of Computer Vision, 47:7–42, 2002.
[25] Luigi Scozzafava. Localizzazione e tracciamento di persone e robot at-
traverso la stereo visione. Master’s thesis, University of Rome Sapienza,
2003.
[26] SM4All Partners. SM4All - Description of work, March 2008.
[27] R. Y. Tsai. A versatile camera calibration technique for high accuracy
3d machine vision metrology using off-the-shelf tv cameras and lenses.
IEEE Journal of Robotics and Automation, 3:323–344, 1987.
[28] Matthew Turk and Alex Pentland. Eigenfaces for recognition. Journal
of Cognitive Neuroscience, 3(1):71–86, 1991.
[29] UPnP Forum. UPnP Device Architecture 1.0, 2008.
[30] P. Viola and M. J. Jones. Robust real-time face detection. International
Journal of Computer Vision, 57, 2004.
[31] Z. Zhang. A flexible new technique for camera calibration. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 22:1330–1334,
2000.
124 BIBLIOGRAPHY
[32] W. Zhao, R. Chellappa, A. Rosenfeld, and P.J. Phillips. Face recogni-
tion: A literature survey. ACM Computing Surveys, 2003.
Ringraziamenti
Scrivere questa tesi mi ha impegnato mente e corpo per molti mesi, durante
i quali ho trascurato molte persone; i miei primi ringraziamenti vanno quindi
alla mia famiglia, che ha sopportato con pazienza le mie assenze ed i miei
ritardi e soprattutto a nonno Ciccio e nonna Titina: vi prometto per il futuro
un nipote piu presente.
Dal momento in cui mi sono iscritto all’universita se c’e una persona che
tutte le sere ha atteso le mie chiamate (e che un sacco di volte ha dovuto
sgridarmi per questo motivo), quella e nonna Maria: quante volte in vita mia
ti ho chiamato “mamma”?
Nessun professore mi ha mai giudicato cosı duramente quanto zio Gian-
franco (e nessun datore di lavoro, penso, sara cosı esigente); devo a te le mie
inclinazioni da smanettone (e le mille foto da bambino).
Come dimenticare poi gli amici di una vita; Andrea, compagno di studi
prima, di casa e di esplorazioni romane poi; Marcello con cui ho a che fare
dalle elementari (e che ricorda tutto di quel periodo); Ettore che per primo
riuscı a farmi uscire fino a tarda sera e che ha avuto l’onore di dare il nome
al mio sistema; Asish di cui non posso riferire il soprannome; Luigi che in
poco tempo si e guadagnato il mio affetto; Giuseppe, Adriano ed Emilio che
hanno sempre sopportato questo amico poco presente.
Il mio percorso di studi universitario, poi, non sarebbe stato lo stesso senza
la compagnia di Enzo, Pasquale e Valerio; quanti esami abbiamo preparato
insieme? quante risate ci siamo fatti? Con voi ho condiviso questi anni
indimenticabili.
E infine...
125
126 BIBLIOGRAPHY
...i titoli di coda sono per la mia Donatella; sei stata amica, collega e adesso
compagna; hai sopportato in questi mesi gli umori di un’anima in pena
dandomi la forza di reagire nei momenti difficili; voglio scrivere insieme a te
tutte le pagine che seguono...