A Service-Based People Localization and Tracking System for...

Sapienza Universita degli studi di Roma

Facolta di Ingegneria

Tesi di Laurea Specialistica in

Ingegneria Informatica

A Service-Based People Localization and

Tracking System for Domotic Applications

Relatore Candidato

Ing. Massimo Mecella Francesco Leotta

Anno Accademico 2008/2009

a mia madre,per l’infaticabile amore con cui mi ha cresciuto

a mio padreper avermi insegnato che il rispetto si guadagna con il lavoro

ed a mio fratelloper essere sempre stato confidente ed amico

Extended Abstract

Con il termine domotica si intende l’insieme di pratiche emergenti tese allo

sviluppo di un elevato grado di automazione dei servizi offerti dalle abitazioni.

Gli obiettivi di questa tesi, sviluppata nell’ambito del progetto SM4All,

sono stati lo studio, il progetto e l’implementazione di un sistema di local-

izzazione, riconoscimento e tracciamento dei soggetti presenti all’interno di

una abitazione. Tale funzionalita e fondamentale nello sviluppo di un sis-

tema di automazione domestica completo che permetta la definizione di “sce-

nari” mediante i quali gli utenti possano personalizzare il “comportamento”

dell’abitazione in funzione della loro posizione (sfruttando le possibilita of-

ferte dai moderni impianti ed elettrodomestici).

I sistemi PLT (People Localization and Tracking) si distinguono in due

categorie a seconda che essi facciano o meno uso di marcatori; per il pro-

getto del nostro sistema PLaTHEA (People Localization and Tracking for

HomE Automation) si e scelto di seguire la seconda politica, considerando

l’uso di marcatori fonte di disagio fisico e condizionamento psicologico per

gli utenti. Questa scelta implica l’utilizzo di tecniche di analisi di sequenze

video della scena da monitorare; da questo flusso di immagini il sistema ha

necessita di estrapolare le coordinate fisiche dei soggetti tracciati, presuppo-

nendo, quindi, un senso di profondita che e ottenibile solo utilizzando due

(o piu) telecamere; questo processo va sotto il nome di Visione Stereo ed e

oggetto del Capitolo 2.

Il monitoraggio di una intera abitazione prevede l’installazione per ogni

stanza di un PLaTHEA peer composto di due telecamere di rete (poste

nell’angolo alto opposto alla porta di ingresso), uno switch di rete ed un

elaboratore per l’esecuzione del software sviluppato. Il cablaggio tipico di

i

una abitazione servita da PLaTHEA e quindi quella descritta in Fig. 1.

Fig. 1: Due PLaTHEA peers dispiegati all’interno di una abitazione. Perogni stanza abbiamo gli elementi base dell’installazione.

Ognuno di questi peer fornisce ai client (responsabili per la composizione

dei servizi domestici) informazioni sulle identita e sulle posizioni dei soggetti

tracciati tramite una interfaccia di rete a servizi basata sul protocollo UPnP

(Universal Plug and Play); tali informazioni possono essere richieste in modal-

ita sincrona (tramite chiamate bloccanti a servizi) o asincrona (sfruttando un

modello interazione publish/subscribe). Il modello dei componenti in Fig. 2

espone nel dettaglio le interazioni che interessano un PLaTHEA peer.

Il “data layer” del sistema e costituito dal database dei volti (utilizzati per

il riconoscimento) e dal database contenente le informazioni di calibrazione

delle telecamere (che sono come detto le principali sorgente dati); in parti-

colare allo scopo di avviare il sistema su ogni peer devono essere eseguite le

seguenti operazioni:

1. la calibrazione interna delle due telecamere che permette di ottenere

ii

Fig. 2: I componenti di PLaTHEA e le loro responsabilita e dipendenze.

i parametri che ne descrivono il comportamento (ideale e con distor-

sione);

2. la calibrazione stereo che descrive la posizione relativa di una tele-

camera rispetto all’altra;

3. la calibrazione esterna che descrive la relazione tra il sistema di

coordinate centrato nella telecamera sinistra ed il sistema di coordinate

della scena.

Il popolamento del database delle facce puo essere fatto in qualunque

momento e non e necessario per il funzionamento del sistema come semplice

PLT1.

Il fulcro del sistema e il componente indicato come Elaboration Core la

cui architettura e descritta in Fig. 3. Tale componente e costituito da cinque

thread di elaborazione sincronizzati mediante eventi software.

Le sequenze video (in formato MJPEG) sono acquisite dalle due tele-

camere in maniera indipendente (da due thread separati che effettuano la

decompressione del formato compresso); esse hanno il medesimo frame rate

nominale ma questo varia di sovente a causa della compensazione automatica

1Nei sistemi PLT il riconoscimento delle persone non e sempre presente.

iii

Fig. 3: Il componente Elaboration Core in dettaglio.

dell’illuminazione; cio provoca una asincronia tra le due sequenze che deve

essere risolta mediante un modulo apposito detto sincronizzatore, che se-

leziona i frame da scartare in ognuna delle sequenze in modo da mantenere lo

stato di sincronia e fornisce in uscita una sequenza di coppie da utilizzare per

la stereo visione (il sincronizzatore che e eseguito su un thread indipendente

elimina la distorsione da entrambe le immagini e le rettifica).

La localizzazione ed il tracciamento di soggetti richiede la capacita di

distinguere essi da tutto cio che nella scena di riferimento e statico e che va

sotto il nome di background; il modello del background deve:

• permettere una semplice estrazione del foreground, cioe di tutti gli

agenti mobili della scena;

• essere quanto il piu insensibile possibile ad improvvisi cambiamenti di

illuminazione della scena (che potrebbero avvenire per l’accensione di

una sorgente luminosa, lo spostamento di una tenda via discorrendo),

ma anche alle ombre che in molti casi producono il rilevamento di falsi

iv

positivi;

• essere tempo adattivo in modo da permettere cambiamenti nel back-

ground (lo spostamento di mobili, suppellettili ed abiti ad esempio).

La volonta di rispettare questi vincoli ha portato all’analisi ed alla sper-

imentazione di una moltitudine di risultati pubblicati che alla fine hanno

portato allo sviluppo di un metodo ibrido che combinando le tecniche pre-

sentate in [23], [11] e [5] risolve le problematiche relative all’hardware utiliz-

zato (le telecamere Axis 207 e la loro estrema sensibilita all’illuminazione) ed

all’ambiente operativo (quello domestico con la sua grande dinamicita).

Una volta ottenuti i pixel di foreground della scena, questi devono essere

proiettati in un insieme di coordinate tridimensionali. Come detto la stereo

visione ci viene in aiuto in questo compito; abbiamo bisogno pero di un al-

goritmo che associ efficientemente i punti nell immagine sinistra con quelli

dell’immagine destra; poiche le due immagini sono rettificate e senza dis-

torsioni, punti corrispondenti si troveranno sulla stessa linea (la cosiddetta

linea epipolare) e quindi differiranno solo per coordinata x; questa differenza

viene detta disparita. L’algoritmo utilizzato e del tipo SAD (Sum of Abso-

lute Difference) ed e implementato nella libreria OpenCV (ispirandosi a [16])

e risulta essere molto efficiente (la precisione ottenuta e inferiore rispetto ad

altri metodi piu costosi ma e comunque sufficiente per i nostri scopi).

Nei sistemi PLT e pratica comune effettuare il tracciamento utilizzando

informazioni estrapolate da una vista simulata dall’alto ottenuta trasfor-

mando le coordinate tridimensionali dei pixel di foreground rispetto al sis-

tema di riferimento delle telecamere (ottenute utilizzando la mappa di dis-

parita di questi pixel) in coordinate tridimensionali rispetto ad un sistema

di riferimento solidale con la stanza (la matrice di rotazione ed il vettore

di traslazione necessari a cio sono ottenute durante la calibrazione esterna).

L’algoritmo per la proiezione e l’identificazione dei soggetti e ripreso da [10]

ma noi preferiamo l’utilizzo di un detector di contorni per l’identificazione dei

candidati al tracciamento. In Fig. 4 sono mostrati il modello del background,

il foreground estratto e la vista dall’alto del soggetto.

Nel momento in cui un soggetto viene identificato, ad esso vengono asso-

v

Fig. 4: Esempio di indentificazione di un soggetto. I pixel di foregroundvengono estratti utilizzando come riferimento il background. Quindi si creauna vista simulata dall’alto in cui il soggetto viene identificato.

ciati una posizione, una velocita, le altezze massima e media ed un template

di colore. Gli oggetti identificati all’istante t vengono confrontati con quelli

tracciati all’istante t− 1 (la cui posizione attuale viene predetta utilizzando

la velocita memorizzata) in base al template dei colori (tecnica ripresa da

[25]), alle posizioni ed alle altezze medie. Gli oggetti identificati all’istante

t per i quali non e stato possibile trovare una corrispondenza tra gli oggetti

tracciati all’istante t − 1 divengono nuovi oggetti tracciati (la cui velocita

iniziale e nulla).

Il riconoscimento dei volti e una operazione svolta da un thread indipen-

dente; il motivo di questa scelta e che essa e una operazione costosa (e tanto

piu lunga quanto piu e grande il database dei volti) e quindi metterla in

sequenza con le altre operazioni comporterebbe un abbassamento del frame

rate gestito dal sistema che diventerebbe inadatto al tracciamento. Ai fini

del tracciamento il sistema dovrebbe infatti mantenere un frame rate medio

di 10 frame al secondo (il tempo di elaborazione e quindi di 100 ms per ogni

coppia stereo di immagini).

vi

Per ogni frame ad alta risoluzione fornito dal thread principale di elabo-

razione, il thread di riconoscimento dei volti deve eseguire le seguenti oper-

azioni:

1. eseguire il rilevamento dei volti; cioe l’estrazione delle regioni del

frame contenenti volti; per fare cio il sitema utilizza un classificatore di

Viola-Jones;

2. per ogni volto rilevato, eseguire il riconoscimento utilizzando il con-

fronto delle SIFT features con ognugno dei volti nel database; il pun-

teggio di una persona e ottenuto sommando tutti i match del volto

preso in considerazione nel frame corrente con tutti i volti registrati

per quella persona;

3. riproiettare il volto a terra ottenendo il soggetto tracciato corrispon-

dente.

Il thread principale di elaborazione fornisce al thread di riconoscimento

dei volti le immagini ad alta risoluzione e le informazioni di riproiezione dei

volti a terra e si occupa di interrogarlo ad ogni time slot:

• se il thread di riconoscimento volti ha finito, le informazioni da esso

fornite vengono allegate alle informazioni di tracciamento. In partico-

lare se ad un oggetto tracciato viene associata per tre volte consecutive

la medesima identita, questa identita viene associata all’oggetto;

• se il thread di riconoscimento volti non ha concluso le sue operazioni,

il thread principale passa al tracciamento sulla successiva coppia stereo

fornita dal sincronizzatore.

Grande importanza nella tesi e stata data alla fase di installazione. L’interfaccia

grafica di amministrazione permette infatti con dei semplici tools di eseguire

tutte le operazione necessarie allo scopo.

Di seguito proponiamo una breve descrizione di ogni capitolo della tesi:

Capitolo 1 Questo capitolo contiene una breve introduzione alla domotica

ed a come essa puo cambiare il modo di concepire l’abitare. Il progetto

vii

SM4All viene inserito in questo contesto fornendo la motivazione allo

sviluppo di questa tesi.

Capitolo 2 In questo capitolo viene fornito al lettore il bagaglio di conoscenze,

circa il modello di telecamera e la stereo visione, utile nei capitoli suc-

cessivi.

Capitolo 3 In questo capitolo si analizza lo stato dell’arte nel campo dei

sistemi PLT e di riconoscimento dei volti. Questa e un’area di ricerca

in continua evoluzione. I due macro-problemi saranno scomposti e per

ognuno degli aspetti verranno descritti pro e contro di pratiche ed al-

goritmi. Il capitolo si conclude con una galleria di progetti di ricerca

nell’ambito dei sistemi PLT.

Capitolo 4 In questo capitolo descriviamo come PLaTHEA usa e combina

le tecniche introdotte nel capitolo 3 e quali soluzioni originali abbiamo

escogitato per risolvere i problemi relativi all’uso di telecamere di rete

ed al sequenziamento delle operazioni.

Capitolo 5 Questo capitolo inizia analizzando gli obiettivi del sistema. Segue

una dettagliata descrizione dell’architettura: quali tecnologie vengono

sfruttate, come viene dispiegato il sistema e cosı via.

Capitolo 6 Questo capitolo approfondisce l’architettura di sistema anal-

izzando problemi di implementazione, possibili soluzioni e tra queste

quelle adottate.

Capitolo 7 Questo capitolo inizia descrivendo i casi di test; da questi ven-

gono dedotte le aree di miglioramento del sistema. I tempi di com-

putazione e le risorse consumate da PLaTHEA vengono inoltre anal-

izzate.

Capitolo 8 Questo capitolo conclude la tesi introducendo possibili lavori fu-

turi intesi al miglioramento delle prestazioni del sistema ed all’ampliamento

delle sue funzionalita.

viii

Contents

1 Reference Context 1

1.1 Introduction to Domotics . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Commercial Platforms for Home Automation . . . . . . 3

1.1.2 Architectures for Domotic Systems . . . . . . . . . . . 5

1.2 The SM4All Project . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.1 The Pervasive Layer . . . . . . . . . . . . . . . . . . . 6

1.2.2 The Composition Layer and the need of a PLT System 9

1.2.3 The User Layer . . . . . . . . . . . . . . . . . . . . . . 12

1.3 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . 13

2 Camera Model and Stereo Vision 15

2.1 Camera Pinhole Model and Camera Calibration . . . . . . . . 15

2.1.1 Lens Distortion . . . . . . . . . . . . . . . . . . . . . . 18

2.1.2 Camera calibration . . . . . . . . . . . . . . . . . . . . 19

2.2 Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.1 Triangulation . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.2 Stereo Calibration . . . . . . . . . . . . . . . . . . . . 22

2.2.3 Stereo Rectification . . . . . . . . . . . . . . . . . . . . 24

3 A Survey on the State of the Art 25

3.1 Introduction to PLT systems . . . . . . . . . . . . . . . . . . . 25

3.2 Typical Structure of a Stereo PLT System . . . . . . . . . . . 27

3.2.1 Stereo Computation Module . . . . . . . . . . . . . . . 28

3.2.2 Background Modeling and Foreground Segmentation

Modules . . . . . . . . . . . . . . . . . . . . . . . . . . 30

ix

3.2.3 Plan View Projection Module . . . . . . . . . . . . . . 34

3.2.4 Tracker Module . . . . . . . . . . . . . . . . . . . . . . 37

3.3 Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3.1 Face Detection . . . . . . . . . . . . . . . . . . . . . . 42

3.3.2 Face Recognition . . . . . . . . . . . . . . . . . . . . . 43

3.4 Projects around the world . . . . . . . . . . . . . . . . . . . . 45

3.4.1 LocON Project . . . . . . . . . . . . . . . . . . . . . . 45

3.4.2 Gator Tech Smart House Project . . . . . . . . . . . . 46

3.4.3 ARGOS project . . . . . . . . . . . . . . . . . . . . . . 46

3.4.4 RoboCare Project . . . . . . . . . . . . . . . . . . . . . 48

4 Our System and Related Works 49

4.1 Background Modeling and Foreground Segmentation . . . . . 50

4.1.1 The Background Model . . . . . . . . . . . . . . . . . . 51

4.1.2 Foreground Segmentation . . . . . . . . . . . . . . . . 53

4.1.3 Foreground Refinements . . . . . . . . . . . . . . . . . 55

4.2 Plan View Projection and Tracking . . . . . . . . . . . . . . . 55

4.2.1 Localization . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2.2 Correspondence . . . . . . . . . . . . . . . . . . . . . . 56

4.2.3 Refinements . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3 Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3.1 Notes on Face Detection . . . . . . . . . . . . . . . . . 61

4.4 Tracking and Face Recognition Combined . . . . . . . . . . . 63

5 System Requirements and Architecture 65

5.1 Overview on System Requirements . . . . . . . . . . . . . . . 66

5.2 A Look at the Architecture . . . . . . . . . . . . . . . . . . . 67

5.2.1 Embedding PLaTHEA . . . . . . . . . . . . . . . . . 67

5.2.2 The Components’ Architecture . . . . . . . . . . . . . 68

5.2.3 The Software Dependencies . . . . . . . . . . . . . . . 68

5.3 The Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.3.1 The Camera Calibration Database . . . . . . . . . . . 71

5.3.2 The Face Database . . . . . . . . . . . . . . . . . . . . 72

x

5.4 The Elaboration Core . . . . . . . . . . . . . . . . . . . . . . . 72

5.5 The UPnP Device . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.6 The External Entities . . . . . . . . . . . . . . . . . . . . . . . 75

5.7 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.7.1 Installation and Configuration . . . . . . . . . . . . . . 76

5.7.2 Run Time Installation Refinements . . . . . . . . . . . 81

5.7.3 The Face Database Construction . . . . . . . . . . . . 82

5.7.4 Run Time Use Cases . . . . . . . . . . . . . . . . . . . 82

6 Implementation Details 87

6.1 Technological Introduction . . . . . . . . . . . . . . . . . . . . 87

6.2 The Elaboration Core Component . . . . . . . . . . . . . . . . 88

6.2.1 Video Acquisition and Synchronization . . . . . . . . . 89

6.2.2 The Elaboration and the Face Recognition Threads . . 92

6.3 The UPnP Device . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.3.1 The UPnP device descriptor . . . . . . . . . . . . . . . 96

7 Tests and Performance Analysis 101

7.1 Tests on the PLT Sub-system . . . . . . . . . . . . . . . . . . 101

7.1.1 Test Environment . . . . . . . . . . . . . . . . . . . . . 102

7.1.2 Test Results . . . . . . . . . . . . . . . . . . . . . . . . 104

7.2 Tests on Face Recognition Sub-system . . . . . . . . . . . . . 111

7.3 Computational Costs . . . . . . . . . . . . . . . . . . . . . . . 114

8 Conclusions and Future Works 117

8.1 Considerations on Vision Systems . . . . . . . . . . . . . . . . 117

8.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

xi

xii

Chapter 1

Reference Context

In this chapter we’ll introduce the concept of domotics and we’ll explain

how EU project SM4All contributes to the development of this area of

reasearch. Then we’ll give a brief introduction to our PLaTHEA system

and we’ll explain its contribution to the project. Finally, in the last section,

for each chapter is given a brief sketch.

Contents1.1 Introduction to Domotics . . . . . . . . . . . . . 1

1.1.1 Commercial Platforms for Home Automation . . . 3

1.1.2 Architectures for Domotic Systems . . . . . . . . . 5

1.2 The SM4All Project . . . . . . . . . . . . . . . . . 6

1.2.1 The Pervasive Layer . . . . . . . . . . . . . . . . . 6

1.2.2 The Composition Layer and the need of a PLTSystem . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2.3 The User Layer . . . . . . . . . . . . . . . . . . . . 12

1.3 Structure of the Thesis . . . . . . . . . . . . . . . 13

1.1 Introduction to Domotics

Home automation (also called domotics) may designate an emerging practice

of increased automation of household appliances and features in residential

dwellings, particularly through electronic means that allow for things imprac-

ticable, overly expensive or simply not possible in recent past decades [1].

1

2 CHAPTER 1. REFERENCE CONTEXT

The term is sometimes confused with “building automation”, which refers to

industrial settings and the automatic or semi-automatic control of lighting,

climate, doors and windows, and security and surveillance systems; instead

building automation features are only a subset of those provided by a full-

fledged home automation system.

Compared to a simple building automation system a home automation

system may provide the following features:

• control of home entertainment systems such as home cinema, hi-fi au-

dio, surveillance and so on;

• use of domestic robots;

• elderly assistance;

• “scenes” for different events such as dinners, parties, and so on;

Particularly interesting is the last feature. In our vision“home behaviour”

should be customizable. Let’s consider some simplex scenarios.

Scenario 1.1 Mario is an up to date technological adopter; he loves all the

comforts offered by modern technology. Mario comes home late at night and

all he wants is to watch a movie in the living room, where he placed his

newly purchased home theater, with soft lights and a comfortable temperature

of 25 ◦C. The domotic system may detect the arrival of Mario in the living

room and then direct ligths dimmer and heating system to realize all Mario’s

desires.

Scenario 1.2 Mario is also an apprehensive father and he doesn’t want his

little daughter, Marzia, can watch certain channels in the living room without

him (this channels are known to transmit violent movies even in the daylight).

The domotic system may detect the presence of Marzia in the living room

without her father and then inhibit the selection of those channels.

So, a home automation system should make possible for the users to define

scenes of common life and to instruct the system itself to react to these scenes

guiding in a specific way all the enabled appliances and subsystems.

1.1. INTRODUCTION TO DOMOTICS 3

The deployment of such a system requires a way for different subsystems

(software and hardware) to communicate; this has meant the emergence of a

set of standards mainly in the area of wiring and communication. Further-

more, the emergence of building automation before and of home automation

lately has led an evolution in domestic wiring practice (for air-conditioning

systems for example but for Ethernet wiring too).

Fig. 1.1: A typical domestic patch panel.

Another aspect has to be taken into account: wiring is hardest to retrofit

into an existing house. One solution to the problem is to embed data signals

in power lines but more frequently wireless technologies do the lion part.

Furthermore wireless is widely employed in sensor networks (Wi-Fi (IEEE-

802.11), Bluetooth (IEEE-802.15.1), ZigBee (IEEE-802.15.4) and so on).

1.1.1 Commercial Platforms for Home Automation

Many companies (part of which already active in the area of building au-

tomation) are active in the development of platforms which make the basis

for the design of a full-fledged home automation system.

The first example of such a system is MyHome by BTicino1. This

system is based on a proprietary bus (the SCS bus) which acts as both data

1See www.myhome-bticino.it for information about products and www.

myopen-bticino.it for the technical forum

www.myhome-bticino.it

www.myopen-bticino.it

www.myopen-bticino.it


bus and power supply. BTicino produces a wide variety of devices (light

switch, dimmers, actuators, cameras and so on) which communicate using

this bus. BTicino has also designed a family of web servers (the term is

perhaps inappropriate) through which, using the OpenWebNet protocol, an

external system can interact with the devices on the bus. Driving the system

via software makes possible to obtain even complex compound services. The

main disvantage of the system is that SCS bus has a low data capacity (due to

his conjuncted role of power supply) and so it makes it obviously impossible

to acquire data from multiple devices; we will see later that this makes it

impossible to design our PLaTHEA system on top of MyHome (during the

thesis we have spend some days on the feasibility of this approach).

The second example is CHORUS by Gewiss2. The system is based on a

two wire bus (the KNX bus) similar to BTicino SCS. With respect to BTicino,

Gewiss provide a series of video surveillance devices that communicate with

the CHORUS MASTER using high capacity buses (Ethernet for example);

this is very similar to the approach chosen in the design of our PLaTHEA

system.

Other popular platforms are:

• BY-ME by Vimar. It’s very similar to CHORUS (use the same KNX

bus) and to MyHome;

• EasyDom. Easydom is based on a dorsal bus that allows for the

the creation of a domotic systems that is able to incorporate normal

electrical equipments. Once the devices are wired on the bus it is

possible to handle multiple functions with single commands and the

interaction is driven by a house plant;

Even though the described platforms go in the direction of home au-

tomation. The design of modern home automation goes several step beyond;

in such architectures the mentioned platforms are only little bricks of huge

“cathedrals of services”.

2Visit http://chorus.gewiss.com for more details

http://chorus.gewiss.com

1.1. INTRODUCTION TO DOMOTICS 5

1.1.2 Architectures for Domotic Systems

So far we have a look of what a home automation system could do for the

end user. Now we need the “engineer point of view” about domotic systems;

that is we want to answer the following questions:

• What are the basic elements of a home automation system?

• How these elements can work together to obtain system’s goals?

Roughly speaking the basic elements of a domotic systems are:

• sensors. They are devices that measure a physical quantity and con-

verts it into a signal which can be read by an observer or by an instru-

ment;

• actuators. They are mechanical devices for moving or controlling a

mechanism or system;

• hardware and software controllers. They coordinate sensors and

actuators (and other controllers too in a layered architecture) to obtain

a specific goal.

The presence of controllers is not really mandatory; recent advances in

the field of sensor networks allow to distribute system’s intelligence by all

the sensors. Nevertheless the design of a complex home automation system

suggest the use of a layered architecture that would benefit from the use of

controllers (see section 1.2 about SM4All project for an example).

As early mentioned there is a pletora of communication standards and

protocols in the home automation field: some devices (a device may act as

sensor as well as actuator) may have a Bluetooth interface, other devices

may expose a UPnP interface, thermal sensors may be arranged in a ZigBee

sensor network and so on. So a huge problem to take into account is the

integration of different subsystems. For example see at Scenario 1.1; the

presence of Mario in the living room could be signaled by a recognition system

(as you’ll see soon this is not a random example) in a publish-subscribe

fashion; the temperature could be provided by a sensor network; the dimmer


(it’s a perfect example of a pure actuator) could expose a simplex HTTP-

like interface on a TCP connection (it’s the case of BTicino OpenWebNet

protocol). The subsystems integration (we can call it services integration)

it’s not only a technological matter but also a semantic one, so the study of

services composition have relevance in the system’s design.

1.2 The SM4All Project

The SM4All (Smart hoMes for All) project aims at studying and develop-

ing an innovative middleware platform for inter-working of smart embedded

services in immersive and person-centric environments, through the use of

composability and semantic techniques, in order to guarantee dynamicity,

dependability and scalability, while preserving the privacy and security of

the platform and its users. This is applied to the challenging scenario of

private/home/building in presence of users with different abilities and needs

(e.g., young able bodied aged and disabled) [26].

In the section 1.1.2 we have introduced the concept of layered architecture

for home automation system. The SM4All system is constituted by a set

of logical components arranged in three distinct layers [6]: the Pervasive

Layer, the Composition Layer and the User Layer3.

1.2.1 The Pervasive Layer

The main goal of the Pervasive layer will be to seamlessly integrate het-

erogeneous networks and devices (sensors and actuators) into the SM4ALL

middleware and provide devices’ services and information using a common

and standard abstraction interface, no matter which underlaying technology

is the device based on, as shown in Figure 1.2. Thus, the Pervasive layer will

be responsible of integrating different devices into the middleware, which

may use different communication technologies and protocols for interaction,

and providing their services to the upper layers of the middleware providing

a common interface.

3Our analysis of these layers will be necessarily short. For a more detailed descriptionvisit SM4All website at http://www.sm4all-project.eu.

http://www.sm4all-project.eu

1.2. THE SM4ALL PROJECT 7

Fig. 1.2: SM4All Pervasive Layer: integration of heterogeneous networksand devices

The main requirement is the properly communication handling among

home devices and users applications. SM4All Pervasive layer must auto-

matically discover heterogeneous devices. Likewise, it must able to add new

devices into the system, or remove them when it is necessary.

Added devices will be registered into the system and it will identify and

provide information about their description and functionality. Pervasive layer

will be continuously scanning the network in order to know the latest sta-

tus of devices. Also, Plug and Play (UPnP, UPnP AV) must be supported

as communication standard for discovering and controlling the home devices

(common communication protocol between the devices and the home sys-

tem). The data types will be managed transparently, so all of basic data

types can be transferred between devices with different bit sizes.

The Pervasive layer will act as middleware of the system, in order to

abstract the communication with sensors and actuators of the home network

and its control. The Pervasive layer is characterized by the following system

requirements:

• scalability: the pervasive layer should be extendable, providing the


capacity of increasing the system services and the management of new

devices;

• interoperability: it must support the connection and control with

UPnP and non UPnP devices;

• robustness of services: it should ensure the correct functioning of

the services, therefore when a device fail occurs, it is detected;

• services of connection: the middleware provide automatic connec-

tion with UPnP and non UPnP devices, extracting its description;

• services of devices control: the middleware will control autonomously

the devices according to the available services for each;

• query and feed the system repositories: the middleware will pro-

vide and retrieve data from the repositories of the system;

• establish communication to the composition layer: the middle-

ware will perform actions following the composition layer lines.

Also the Pervasive Layer provides a common interface for the upper lay-

ers of the middleware for interacting with devices, following two different

patterns:

• pull: This pattern allows to the upper layers to invoke services from

devices using a common interface;

• push: This pattern allows to the upper layers to receive events pro-

duced from devices and their services. For example, a temperature

sensor could report changes on Temperature to the upper layers of the

middleware.

PLaTHEA is seen by the system as a UPnP enabled device. So it has

no need of a proxy container. This choose make PLaTHEA independent of

the SM4All system but easily integrated into it.


1.2.2 The Composition Layer and the need of a PLTSystem

In Figure 1.3 are shown the main components of the SM4All Composition

Layer. The goal of this layer is to execute a complex task invoked by the

User Layer using the interaction methods provided by the Pervasive Layer.

Fig. 1.3: SM4All Composition Layer: components architecture

We’ll analyze into detail the location component and the context aware

component.

The location component in composition layer serves as a special and

fundamental type of context information provider. The component is dedi-

cated to process raw location data of objects and analyze the location rela-

tionship between objects. It fulfills the following requirements:

• the location component will be able to be aware of the locations of the

users in the house in any determined moment;

• the location component communication layer must be characterized in

terms of scalability and interoperability;

• the location component will be able to manage and associate users,

limited locations (rooms at home environment) and time information;

• the location component will feed time-spatial context associated to

users, in order to accomplish personalization of services;

• the location component will output binary relationships between user

and devices nearby, e.g. toLeft, toRight, behind, inFront;


At the initial stage of the project, several technologies were proposed to

develop the location services. After evaluation of the technologies, SM4ALL

selected the approach of Video Tracking and RFID (Radio Frequency IDen-

tification).

Our PLaTHEA system (finally we have the pleasure) follows the Video

Tracking approach to give the position (in the context of a specific room)

of recognized user (mobile agents of which PLaTHEA is able to recognize

identity) and unrecognized mobile agents (humans as well as robots). To do

this it adds to the standard features of a PLT (People Localization and Track-

ing) system a face recognition system; so for each tracked object PLaTHEA

add, if possible, informations about the identity of the object (of course the

tracked object must be human and must train the system with his face bio-

metrical informations; that is a user context is defined for it). The choose of a

Video Tracking approach for this task is natural: objects to be tracked don’t

have to wear any marker4 which limit humans freedom and naturalness5. On

the other hand Video Tracking approach suffers in dark rooms; SM4All

intends to solve this problem using infrared cameras but this feature is out

of the scope of this thesis, so we limit our work to room with enough light to

allow cameras’ light compensation.

In our vision every room in the house will have a distinct installation of

PLaTHEA. This multitude of PLaTHEA “device”6 communicate with the

pervasive layer via UPnP and then the composition layer will do his work.

A graphical presentation of this is given in Fig. 1.4.

In SM4All vision the concept of “position” is deeply linked with the

concept of “context”. The context can be categorized in:

• System context. System context refers to the hardware being used,

the bandwidth, and the different devices available and accessible by the

4On the other hand such a solution doesn’t allow to know the position of static elementslike desks, chairs, keys and so on; in this field the use of marker like RFID is preferable.

5This solution is however often used in practice. Bill Gates’ futuristic home solves theproblem of tracking, providing each guest with an active bracelet

6PLaTHEA is a software module running on a Windows machine. In the future wewant to study an implementation on a lightweight operating system such as WindowsMobile or Embedded Linux.


Fig. 1.4: PLaTHEA systems as seen by SM4All system

user in the smart home;

• User context. User context is at a central position of context man-

agement. User context is collected in the form of user profiles and

represent their preferences with respect to the enriched environment

built by the SM4ALL architecture; user context information flows in

system, serving as driving force of context-aware services and goals of

planning composite services;

• Physical context. Physical context refers to information related to

the physical environment where the user is, such as location, tempera-

ture, noise, light, etc.

So the context-aware component and the position component are funda-

mental to obtain the main goals of a home automation system as defined in

section 1.1.


1.2.3 The User Layer

Finally, we are at the higher level of the hierarchy: the SM4All User Layer.

The goals of these layer is easily defined by its main components:

• Abstract Adaptive Interface. This module interacts with the Com-

position Layer, gathering information about users, services, and status

and provides the concrete user interfaces (UIs) with a set of operations,

partially ordered, together with visual information, e.g., icons. More

in detail, the AAI analyzes all the available services, collecting from

each of them the set of initial operations and the associated icons; this

constitute the initial set of available operations. When a service is

started, the initial set of available operations for that service changes

and the AAI, as soon as the Composition Layer notifies that the status

is changed, updates the overall operation set, computing a new par-

tial order. It’s the interface used by the other User Layer interfaces to

interact with the composition layer;

• Brain Computer Interface. The brain computer interface (BCI)

technology allows a direct connection between brain and computer

without any muscular activity required, and thus it offers an unique

opportunity to enhance and/or to restore communication and actions

into external world for people with severe motor disability;

• HTTP Interface. The SM4All project proposes standard HTTP

interfaces to provide rich user interaction for the users at home. These

interactive web applications will allow web pages displayed in standard

web browsers to present responsive user interfaces that approach the

features expected in free-standing applications;

• Remote Interface. The SM4All system allows for a remote inter-

action with home automation features.

1.3. STRUCTURE OF THE THESIS 13

1.3 Structure of the Thesis

To conclude the introductive chapter we give for each of the chapters a little

summary.

Chapter 2 In this chapter we introduce the knowledge background about

camera models and stereo vision useful to continue the reading. For

those who are not familiar for these concepts and want to make their

life easier.

Chapter 3 In this chapter we analyze the state of the art in the field of PLT

systems and face recognition systems. This is a continuously evolving

area of research. The two macro-problems will be decomposed and

for each aspect will be given pros and cons of algorithms and pratices.

The chapter ends with gallery of research projects in the area of PLT

systems.

Chapter 4 In this chapter we describes how PLaTHEA uses and combines

the techniques introduced in Chapter 3 and what original solution we

have founded for other problems.

Chapter 5 This chapter starts analyzing the goals of the system. A detailed

description of the architecture follows: what technologies are used, how

the system is deployed and so on.

Chapter 6 This chapter deepens system architecture analyzing implemen-

tation problems, possible solutions and among these the adopted one

(with motivations).

Chapter 7 This chapter start describing test cases and inferring from them

system’s drawbacks. PLaTHEA performance are also analyzed.

Chapter 8 This chapter ends the thesis introducing possible future works

to enhance the system.


Chapter 2

Camera Model and StereoVision

Contents2.1 Camera Pinhole Model and Camera Calibration 15

2.1.1 Lens Distortion . . . . . . . . . . . . . . . . . . . . 18

2.1.2 Camera calibration . . . . . . . . . . . . . . . . . . 19

2.2 Stereo Vision . . . . . . . . . . . . . . . . . . . . . 20

2.2.1 Triangulation . . . . . . . . . . . . . . . . . . . . . 20

2.2.2 Stereo Calibration . . . . . . . . . . . . . . . . . . 22

2.2.3 Stereo Rectification . . . . . . . . . . . . . . . . . 24

2.1 Camera Pinhole Model and Camera Cal-

ibration

We begin by looking at the simplest model of a camera, the pinhole camera

model. In this simple model, light is envisioned as entering from the scene

or a distant object, but only a single ray enters from any particular point.

In a physical pinhole camera, this point is then “projected” onto an imaging

surface. As a result, the image on this image plane (also called the projective

plane or imager) is always in focus, and the size of the image relative to the

distant object is given by a single parameter of the camera: its focal length.

For our idealized pinhole camera, the distance from the pinhole aperture to

15

16 CHAPTER 2. CAMERA MODEL AND STEREO VISION

the screen is precisely the focal length. This is shown in Figure 2.1, where f

is the focal length of the camera, (X, Y, Z) are the object’s coordinates with

respect to the so called center of projection and (x, y, f) are the object’s

image coordinates on the imaging plane.

Fig. 2.1: A point Q = (X, Y, Z) is projected onto the image plane by theray passing through the center of projection, and the resulting point on theimage is q = (x, y, f).

We can see by similar triangles that:

x

f=X

Z,

y

f=Y

Z(2.1)

The point at the intersection of the image plane and the optical axis is

referred to as the principal point.

You might think that the principle point is equivalent to the center of the

imager; yet this would imply that some guy with tweezers and a tube of glue

was able to attach the imager in your camera to micron accuracy. In fact, the

center of the chip is usually not on the optical axis. We thus introduce two

new parameters, cx and cy, to model a possible displacement (away from the

optic axis) of the center of coordinates on the projection screen. The result

is that a relatively simple model in which a point Q in the physical world,

whose coordinates are (X, Y, Z), is projected onto the screen at some pixel

2.1. CAMERA PINHOLE MODEL AND CAMERA CALIBRATION 17

location given by (xscreen, yscreen) in accordance with the following equations:

xscreen = fx

(X

Z

)+ cx, yscreen = fy

(Y

Z

)+ cy (2.2)

Note that we have introduced two different focal lengths; the reason for

this is that the individual pixels on a typical low-cost imager are rectangular

rather than square. The focal length fx (for example) is actually the product

of the physical focal length of the lens and the size sx of the individual

imager elements (this should make sense because sx has units of pixels per

millimeter while F has units of millimeters, which means that fx is in the

required units of pixels). Of course, similar statements hold for fy and sy.

It is important to keep in mind, though, that sx and sy cannot be measured

directly via any camera calibration process, and neither is the physical focal

length F directly measurable. Only the combinations fx = Fsx and fy = Fsy

can be derived without actually dismantling the camera and measuring its

components directly.

The relation that maps the points Q in the physical world with coor-

dinates (X, Y, Z) to the points on the projection screen with coordinates

(xscreen, yscreen) is called a projective transform. When working with such

transforms, it is convenient to use what are known as homogeneous coor-

dinates. The homogeneous coordinates associated with a point in a projec-

tive space of dimension n are typically expressed as an (n + 1)-dimensional

vector, with the additional restriction that any two points whose values are

proportional are equivalent. In our case, the image plane is the projective

space and it has two dimensions, so we will represent points on that plane

as threedimensional vectors q = (x, y, w). Recalling that all points having

proportional values in the projective space are equivalent, we can recover the

actual pixel coordinates by dividing through by w. This allows us to arrange

the parameters that define our camera (i.e., fx, fy, cx, and cy) into a single

3-by-3 matrix, which we will call the camera intrinsics matrix:

q = MQ, where q =

xyz

, M =

fx 0 cx0 fy cy0 0 1

, Q =

XYZ

(2.3)


2.1.1 Lens Distortion

In theory, it is possible to define a lens that will introduce no distortions.

In practice, however, no lens is perfect. This is mainly for reasons of manu-

facturing; it is much easier to make a “spherical” lens than to make a more

mathematically ideal“parabolic” lens. It is also difficult to mechanically align

the lens and imager exactly. Here we describe the two main lens distortions

and how to model them1. Radial distortions arise as a result of the shape

of lens, whereas tangential distortions arise from the assembly process of

the camera as a whole.

We start with radial distortion. The lenses of real cameras often notice-

ably distort the location of pixels near the edges of the imager. This bulging

phenomenon is the source of the “fish-eye” effect. With some lenses, rays

farther from the center of the lens are bent more than those closer in. A

typical inexpensive lens is, in effect, stronger than it ought to be as you get

farther from the center. Radial distortion is particularly noticeable in cheap

web cameras but less apparent in high-end cameras, where a lot of effort is

put into fancy lens systems that minimize radial distortion.

For radial distortions, the distortion is 0 at the (optical) center of the

imager and increases as we move toward the periphery. In practice, this

distortion is small and can be characterized by the first few terms of a Taylor

series expansion around r = 0. For cheap web cameras, we generally use the

first two such terms; the first of which is conventionally called k1 and the

second k2. For highly distorted cameras such as fish-eye lenses we can use a

third radial distortion term k3. In general, the radial location of a point on

the imager will be rescaled according to the following equations:

xcorrected = xscreen(1 + k1r2 + k2r

4 + k3r6)

ycorrected = yscreen(1 + k1r2 + k2r

4 + k3r6)

The second-largest common distortion is tangential distortion. This dis-

tortion is due to manufacturing defects resulting from the lens not being

exactly parallel to the imaging plane. Tangential distortion is minimally

1The approach to modeling lens distortion taken here derives mostly from [7]

2.1. CAMERA PINHOLE MODEL AND CAMERA CALIBRATION 19

characterized by two additional parameters, p1 and p2, such that:

x′corrected = xcorrected + [2p1ycorrected + p2(r2 + 2x2corrected)]

y′corrected = ycorrected + [p1(r2 + 2y2corrected) + 2p2xcorrected]

Thus in total there are five distortion coefficients that we require. They

are typically bundled into one distortion vector; this is just a 5-by-1 matrix

containing k1, k2, p1, p2, and k3 (in that order).

2.1.2 Camera calibration

In this subsection we analyze what we obtain from the calibration of a single

camera. In the section 2.2.2 we will analyze how stereo calibration completes

the informations generated by single camera calibration.

It’s easy to sense that camera calibration give us a camera intrinsics

matrix and a distortion coeffiecients vector. The camera intrinsic ma-

trix is perhaps the most interesting final result, because it is what allows us

to transform from 3D coordinates to the image’s 2D coordinates. We can also

use the camera matrix to do the reverse operation, but in this case we can

only compute a line in the three-dimensional world to which a given image

point must correspond.

The maths behind camera calibration are out of the scope of this thesis.

For those interested the real“best-seller” is [27]. OpenCV2 library (that is the

one used for the implementation of PLaTHEA) uses the method described

in [32]. We will return on camera calibration in the following chapters; for

now we say that the calibration is done using multiple views of a constant

pattern (in our case a chessboard with known texel side).

To conclude, if we have more than one camera, camera calibration has

to be done for each camera even if the camera models are identical, due to

difference in the manufacturing process.

2OpenCV is an open-source library available at http://opencv.willowgarage.com.A very good guide to this library is [4] to which is inspired this chapter

http://opencv.willowgarage.com


2.2 Stereo Vision

We all are familiar with the stereo imaging capability that our eyes give us.

To what degree can we emulate this capability in computational systems?

Computers accomplish this task by finding correspondences between points

that are seen by one imager and the same points as seen by the other im-

ager. With such correspondences and a known baseline separation between

cameras, we can compute the 3D location of the points. Although the search

for corresponding points can be computationally expensive, we can use our

knowledge of the geometry of the system to narrow down the search space as

much as possible. In practice, stereo imaging involves four steps when using

two cameras3.

1. Mathematically remove radial and tangential lens distortion; this is

called undistortion and is detailed in section 2.1. The outputs of this

step are undistorted images.

2. Adjust for the angles and distances between cameras, a process called

rectification. The outputs of this step are images that are row-aligned

and rectified.

3. Find the same features in the left and right camera views, a process

known as correspondence. The output of this step is a disparity map,

where the disparities are the differences in x-coordinates on the image

planes of the same feature viewed in the left and right cameras: xl−xr.

4. If we know the geometric arrangement of the cameras, then we can turn

the disparity map into distances by triangulation. This step is called

reprojection, and the output is a depth map.

We start with the last step to motivate the first three.

2.2.1 Triangulation

Assume that we have a perfectly undistorted, aligned, and measured stereo

rig as shown in Figure 2.2: two cameras whose image planes are exactly

3Here we give just a high-level understanding. For details, we recommend [8]

2.2. STEREO VISION 21

coplanar with each other, with exactly parallel optical axes (the optical axis

is the ray from the center of projection O through the principal point c and

is also known as the principal ray) that are a known distance apart, and with

equal focal lengths fl = fr. Also, assume for now that the principal points

cleftx and crightx have been calibrated to have the same pixel coordinates in

their respective left and right images. Please don’t confuse these principal

points with the center of the image. A principal point is where the principal

ray intersects the imaging plane. Th is intersection depends on the optical

axis of the lens. As we saw in Section 2.1, the image plane is rarely aligned

exactly with the lens and so the center of the imager is almost never exactly

aligned with the principal point.

Moving on, let’s further assume the images are row-aligned and that every

pixel row of one camera aligns exactly with the corresponding row in the other

camera. We will call such a camera arrangement frontal parallel. We will

also assume that we can find a point P in the physical world in the left and

the right image views at pl and pr, which will have the respective horizontal

coordinates xl and xr4.

In this simplifi ed case, taking xl and xr to be the horizontal positions

of the points in the left and right imager (respectively) allows us to show

that the depth is inversely proportional to the disparity between these views,

where the disparity is defined simply by d = xl− xr. This situation is shown

in 2.2, where we can easily derive the depth Z by using similar triangles.

Referring to the figure, we have:

T − (xl − xr)Z − f

=T

Z⇒ Z =

fT

xl − xr(2.4)

Since depth is inversely proportional to disparity, there is obviously a

nonlinear relationship between these two terms. When disparity is near 0,

small disparity differences make for large depth differences. When disparity

is large, small disparity differences do not change the depth by much. The

consequence is that stereo vision systems have high depth resolution only for

4How these coordinates are founded it’s an important matter because this is a compu-tational expensive operation. We will analyze this aspect in the implementation chapter.


objects relatively near the camera; baseline distance T choose have so a great

relevance because a higher value of T allows to work with higher distance.

With this arrangement it is relatively easily to solve for distance. Now

we must spend some energy on understanding how we can map a real-world

camera setup into a geometry that resembles this ideal arrangement. In

the real world, cameras will almost never be exactly aligned in the frontal

parallel configuration depicted in Figure 2.2. Instead, we will mathematically

find image projections and distortion maps that will rectify the left and right

images into a frontal parallel arrangement. When designing a stereo rig, it

is best to arrange the cameras approximately frontal parallel and as close

to horizontally aligned as possible. This physical alignment will make the

mathematical tranformations more tractable. If you cameras aren’t aligned at

least approximately, then the resulting mathematical alignment can produce

extreme image distortions and so reduce or eliminate the stereo overlap area

of the resulting images.

2.2.2 Stereo Calibration

Stereo calibration is the process of computing the geometrical relationship

between the two cameras in space. In contrast, stereo rectification is the pro-

cess of “correcting” the individual images so that they appear as if they had

been taken by two cameras with row-aligned image planes (review Figures

2.2). With such a rectification, the optical axes (or principal rays) of the two

cameras are parallel and so we say that they intersect at infinity.

Stereo calibration outputs the following elements:

• the rotation matrix R and translation vector T between the two

cameras (as depicted if Fig. 2.3);

• the essential matrix E. Given a point P , we would like to derive a

relation which connects the observed locations pl and pr of P on the

two imagers. This relationship will turn out to serve as the definition

of the essential matrix; that is:

pTr Epl = 0 (2.5)

2.2. STEREO VISION 23

Fig. 2.2: With a perfectly undistorted, aligned stereo rig and known cor-respondence, the depth Z can be found by similar triangles; the principalrays of the imagers begin at the centers of projection Ol and Or and extendthrough the principal points of the two image planes at cleftx and crightx .

Fig. 2.3: The essential geometry of stereo imaging is captured by the essen-tial matrix E, which contains all of the information about the translation Tand the rotation R, which describe the location of the second camera relativeto the first in global coordinates.


Note that E contains nothing intrinsic to the cameras; thus, it relates

points to each other in physical or camera coordinates, not pixel coor-

dinates;

• the fundamental matrix F . In practice, we are usually interested

in pixel coordinates. In order to find a relationship between a pixel

in one image and the corresponding epipolar line in the other image,

we will have to introduce intrinsic information about the two cameras.

Recalling that pixel coordinate q = Mp substituting in 2.5 we have:

qTr (M−1r )TEM−1

l ql = qTr Fql = 0 (2.6)

In a nutshell: the fundamental matrix F is just like the essential matrix

E, except that F operates in image pixel coordinates whereas E operates

in physical coordinates.

2.2.3 Stereo Rectification

We want to reproject the image planes of our two cameras so that they reside

in the exact same plane, with image rows perfectly aligned into a frontal

parallel configuration. We want the image rows between the two cameras

to be aligned after rectification so that stereo correspondence (finding the

same point in the two different camera views) will be more reliable and

computationally tractable5.

Using Bouguet method [3], which pretend we’ve already done stereo cal-

ibration, the rectification outputs the following:

• the 3-by-3 row-aligned rectification rotations for the left and right image

planes Rl and Rr;

• the 3-by-4 left and right projection equations Pl and Pr;

• the 4-by-4 reprojection matrix Q which allow to transform a triple of

screen coordinates (xl, yr, d) into camera coorinates.

5Note that reliability and computational efficiency are both enhanced by having tosearch only one row for a match with a point in the other image.

Chapter 3

A Survey on the State of theArt

Contents3.1 Introduction to PLT systems . . . . . . . . . . . 25

3.2 Typical Structure of a Stereo PLT System . . . 27

3.2.1 Stereo Computation Module . . . . . . . . . . . . 28

3.2.2 Background Modeling and Foreground Segmenta-tion Modules . . . . . . . . . . . . . . . . . . . . . 30

3.2.3 Plan View Projection Module . . . . . . . . . . . . 34

3.2.4 Tracker Module . . . . . . . . . . . . . . . . . . . . 37

3.3 Face Recognition . . . . . . . . . . . . . . . . . . . 39

3.3.1 Face Detection . . . . . . . . . . . . . . . . . . . . 42

3.3.2 Face Recognition . . . . . . . . . . . . . . . . . . . 43

3.4 Projects around the world . . . . . . . . . . . . . 45

3.4.1 LocON Project . . . . . . . . . . . . . . . . . . . . 45

3.4.2 Gator Tech Smart House Project . . . . . . . . . . 46

3.4.3 ARGOS project . . . . . . . . . . . . . . . . . . . 46

3.4.4 RoboCare Project . . . . . . . . . . . . . . . . . . 48

3.1 Introduction to PLT systems

With People Localization and Tracking - PLT system we define a class

of systems able to:

25

26 CHAPTER 3. A SURVEY ON THE STATE OF THE ART

• Locate. That is to provide a human’s position in a complex scene;

• Track. That is the system is able to follow a human’s position at

successive sampling instant. If at time t human labeled as P is located

at ~pt and at time t + 1 his new position is ~pt+1 the system should

understand this, and associate both positions to P .

The techniques to performs these tasks belong to two categories:

• Localization and Tracking using markers. In these systems hu-

mans wear some kind of marker (for example a bracelet). These mark-

ers can emit signals of different kind (electric, luminous and so on);

these signals are received by a specific device that converts them into

tridimensional information;

– Pros. Markers result very useful in dark rooms. They can be

used for person recognition too. In the area of Virtual reality (the

human wear a lot of markers) they allows to obtain very precise

models of body;

– Cons. Physical and psychological conditioning of humans forced

to wear markers.

• Localization and Tracking without markers. These systems ob-

tain humans’ position using only the images’ sequence originated by the

video acquisition device/s. This sequence can be produced by a single

camera (Monocular Vision System), by two cameras (Stereo Vi-

sion System) or by more cameras (Multi-cameras Vision System).

In the first case we need a 3d human model to obtain tridimensional

informations; this factor determines the low precision of this kind of sys-

tems. In the latter two case tridimensional informations are inferred

by geometrical considerations.

– Pros. Humans don’t need to wear any kind of marker; more

cameras we use, more precise is the system, it’s possible to obtain

human height easily;

3.2. TYPICAL STRUCTURE OF A STEREO PLT SYSTEM 27

– Cons. Problems with dark rooms and strange illumination phe-

nomena.

The choose of the direction that PLaTHEA had to follow was easy.

Taking into account that in SM4All trasparency it’s an important issue

and that two cameras allow to obtain a very good precision, we have choosen

to implement our PLT system as a Stereo Vision System.

3.2 Typical Structure of a Stereo PLT Sys-

tem

As we’ll see in the next chapters, the PLaTHEA PLT elaboration flow

follows a well known model, used for example in [23] and in [10], and shown

in Figure 3.1.

Fig. 3.1: Model for PLT elaboration flow.

In Figure 3.1 thick lines denote links present in the PLT component of

PLaTHEA and dashed lines denote links present in other PLT systems.


In the following subsections we’ll study each element of this architecture

analyzing algorithm and techniques to perform the correspondent tasks1.

Moreover the figure doesn’t cover all PLaTHEA issues but it’s a good start

for its analysis.

3.2.1 Stereo Computation Module

Stereo correspondence, matching a 3D point in the two different camera

views, can be computed only over the visual areas in which the views of the

two cameras overlap. Once again, this is one reason why you will tend to get

better results if you arrange your cameras to be as nearly frontal parallel as

possible. In [24] the authors give us a very good review of the wide world

of stereo correspondence algorithms; they found in each algorithm a series of

common step and for each of this algorithms they analyze the choose made

for each step.

OpenCV implements a fast and effective block-matching stereo algorithm

that is similar to the one developed by Kurt Konolige [16]; it works by using

small “sum of absolute difference” (SAD) windows to find matching points

between the left and right stereo rectified images. This algorithm finds only

strongly matching (high-texture) points between the two images. Thus, in a

highly textured scene such as might occur outdoors in a forest, every pixel

might have computed depth. In a very low-textured scene, such as an in-

door hallway, very few points might register depth. There are three stages to

the block-matching stereo correspondence algorithm, which works on undis-

torted, rectified stereo image pairs:

• Prefiltering to normalize image brightness to reduce lighting dif-

ferences and to enhance image texture;

• Correspondence search along horizontal epipolar lines using

an SAD window. For each feature in the left image, we search the

corresponding row in the right image for a best match. After rectifica-

tion, each row is an epipolar line, so the matching location in the right

1We invite the reader interested in how a disparity map help in 3d reconstruction toread Chapter 2


image must be along the same row (same y-coordinate) as in the left

image; this matching location can be found if the feature has enough

texture to be detectable and if it is not occluded in the right camera’s

view. See Fig. 3.2;

• Postfiltering to eliminate bad correspondence matches. With

a uniqueness ratio and a texture threshold.

Fig. 3.2: Stereo correspondence starts by assigning point matches betweencorresponding rows in the left and right images: left and right images ofa lamp (upper panel); an enlargement of a single scan line (middle panel);visualization of the correspondences assigned (lower panel).

OpenCV also implements the graph-cut algorithm described in [15]. The

algorithm gives better result with respect to the SAD algorithm but it’s too

much expansive for real-time processing.


3.2.2 Background Modeling and Foreground Segmen-tation Modules

How do we define background and foreground? If we’re watching a parking

lot and a car comes in to park, then this car is a new foreground object. But

should it stay foreground forever? How about a trash can that was moved?

It will show up as foreground in two places: the place it was moved to and

the “hole” it was moved from. How do we tell the difference? And again,

how long should the trash can (and its hole) remain foreground? If we are

modeling a dark room and suddenly someone turns on a light, should the

whole room become foreground?

Background modeling is a continuosly evolving area of research. In this

section we’ll analyze a lot of method showing their pros and their drawbacks.

For each method we want to answer to a subset of the following questions:

1. How each background pixel is modeled? A simple Gaussian dis-

tribution, a Mixture of Gaussians (MOG), chromaticity statistics

and so on;

2. Are the pixel modeling Time Adaptive? Background, obviously,

changes and an important feature is time adaptivity (for example TAPP-

MOGs - Time Adaptive Per Pixel Mixture of Gaussians);

3. How they react to sudden illumination changes and hotspots?

Some modeling methods suffer from sudden change in illumination sig-

naling the whole interested zone as foreground;

4. How they react to the presence of shadows? They signal a

shadow as a foreground zone? They incorporate shadows in the back-

ground after some times?;

5. Can they manage a foreground object with color similar to

the background? This is a sometimes forgotten detail.

6. How they manage periodically moving object? We think about

curtains, ventilators and so on.


In [23] the authors mantain for each pixel a set of three gaussians rel-

ative to: pixel intensity, disparity and borders (this last computed using a

Sobel filter2). The use of disparity it’s particularly useful: foreground seg-

mentation is performed by background subtraction from the current intensity

and disparity images. By taking into account both intensity and disparity

information, the system is able to correctly deal with shadows, detected as

intensity changes, but not disparity changes, and foreground objects that

have the same color as the background, but different disparities. The draw-

back in the use of disparity is that performance depends on disparity map’s

accuracy3.

Very interesting in this paper (and very effective) is the time adaptivity

of the system. It is based on an extension of the concept of Pixel Activity

first introduced in [9] (that is a TAPPMOGs system). While in [9] the pixel

activity it’s computed starting from the difference in pixel intensity between

the current frame and the previous, in [23] the pixel activity is computed

introducing the concept of vertical and horizontal activities computed

starting from changes in borders. The pixel activity it’s computed starting

from the product of the activity of its row and column. So if a human wear a

yellow shirt and this is poorly moving even the always yellow pixels present

an high activity. Per pixels gaussians distributions are updated in a way

inversely proportional to pixel activity.

In [11] the authors propose a robust and efficiently computed background

subtraction algorithm that is able to cope with local illumination change

problems, such as shadows and highlights, as well as global illumination

changes. For each pixel the system mantain a Gaussian distribution for each

of the RGB channels; the pixel model is enriched using a running standard

deviation on:

• Brightness Distortion. The brightness distortion is a scalar value

that brings the observed color close to the expected chromaticity line,

2A Sobel filter is a special kind of filter approximating the first order derivative ofintensities along x and y direction

3We’ll see that disparity map computed by OpenCV library isn’t perfect from this pointview. This is the reason why we decided to use a different method for shadow detection.


that is the distance between the background pixel’s luminosity and the

current pixel’s luminosity;

• Color Distortion. Color distortion is defined as the orthogonal dis-

tance between the observed color and the expected chromaticity line. In

other words color distortion is the real chromaticity difference between

current pixel color and background pixel color.

A graphical description of this quantities is given in Fig. 3.3.

Fig. 3.3: The [11] proposed color model in the three-dimensional RGB colorspace; the background image is statistically pixel-wise modeled. Ei representsan expected color of a given i-th pixel and Ii represents the color value of thepixel in a current image. The difference between Ii and Ei is decomposedinto brightness (αi) and chromaticity (CDi) components.

However this system is not time adaptive. The background subtraction

algorithm experiences some problems with dark foreground pixels that are

misclassified as shadow pixels; to prevent this the authors define a little

change in background subtraction but this change worsen the otherwise ex-

cellent shadow detection capability. We’ll see, while talking about plan view

maps, that it’s important that a moving object has almost all the pixels

correctly detected as foreground pixels.

The maths behind this algorithm may seem computationally expensive

at a first reading but our experience showed that this is not the truth. For

example with respect to [23] there is only a little overhead.


An interesting background modeling method is described in [14]. The

codebook method is derived by the world of video compression. A codebook

is made up of boxes that grow to cover the common values seen over time on

a specific pixel value (see Fig. 3.4).

Fig. 3.4: Codebooks are just “boxes” delimiting intensity values: a box isformed to cover a new value and slowly grows to cover nearby values; if valuesare too far away then a new box is formed.

This codebook method can deal with pixels that change levels dramat-

ically (e.g., pixels in a windblown tree, which might alternately be one of

many colors of leaves, or the blue sky beyond that tree); so it’s the unique

method among those already seen in this section to support periodically

moving object.

In the codebook method of learning a background model, each box is

defined by two thresholds (max and min) over each of the three color axes.

These box boundary thresholds will expand (max getting larger, min get-

ting smaller) if new background samples fall within a learning threshold

(learnHigh and learnLow) above max or below min, respectively. If new

background samples fall outside of the box and its learning thresholds, then

a new box will be started.

This method has presented the following drawbacks during our tests:

• it doesn’t manage shadows and sudden illumination changes;


• the time adaptive version of the algorithm (as described in the paper)

needs the definition of multiple learning layer and seems very intricate

with respect to others (such as [23]);

• the time complexity seems to not allow application in real time scenar-

ios.

The work presented in [5] doesn’t cope with background learning but only

with foreground detection. Particularly interesting is the method of shadow

detection exploiting the properties showed by HSV (Hue, Saturation, Value)

color space; in fact in this color space is proved that a shadow induce a huge

change of V component, but limited changes in H and S components with

respect to the pixel’s model. From our tests the method is resulted very

effective and very cheap from a computational point of view.

Furthermore the paper show the application of the method to a wide

variety of context. So it’s an attractive tool to improve other foreground

detection method with problems in dealing with shadows.

3.2.3 Plan View Projection Module

The motivation behind using plan-view statistics for person tracking begins

with the observation that, in most situations, people usually do not have

significant portions of their bodies above or below those of other people. We

might therefore expect to separate people more easily, and to reduce occlusion

problems, by mounting our cameras overhead and pointing them toward the

ground. However, methods based on monocular video that exploit this idea

usually either must continue to deal with significant occlusion problems in

all but the central portion of the image (particularly if wide-angle lenses

are used), or must accept a somewhat limited field of view (particularly if

the ceiling is relatively low). Furthermore, when mounted overhead, the

cameras used for tracking are not suitable for extracting images of people’s

faces, which are desired in many applications that employ vision-based person

tracking.


With a stereo camera, we can produce orthographically projected, over-

head views of the scene that better separate people than the perspective

images produced by a monocular camera. In addition, we can produce these

images even when the stereo camera is not mounted overhead, but instead at

an oblique angle that maximizes viewing volume and preserves our ability to

see faces. All of this is possible because the depth data produced by a stereo

camera allows for the partial 3D reconstruction of the scene, from which

new images of scene statistics, using arbitrary viewing angles and camera

projection models, can be computed.

Every reliable measurement in a depth image can be back-projected, using

camera calibration information and a perspective projection model, to the

3D scene point responsible for it. By back-projecting all of the depth image

pixels, we create a 3D point cloud representing the portion of the scene

visible to the stereo camera. If we know the direction of the “vertical” axis

of the world - that is, the axis normal to the ground level plane in which we

expect people to be well-separated - we can discretize space into a regular

grid of vertically oriented bins, and then compute statistics of the 3D point

cloud within each bin. A plan-view image contains one pixel for each of

these vertical bins, with the value at the pixel being some statistic of the

3D points within the corresponding bin. This procedure effectively builds an

orthographically projected, overhead view of some property of the 3D scene.

Fig. 3.5 illustrates this idea.

All the methods that use plan view projection chose to image the same

statistic of the 3D points within the vertically oriented bins, namely the

count of points in each bin. In the resulting images, referred to as plan-

view “occupancy” or “density” maps, people appear as “piles of pixels” that

can be tracked as they move around the ground. Although powerful, this

representation discards virtually all object shape information in the vertical

dimension. In addition, the occupancy map representation of a person will

show a sharp decrease in saliency when the person is partially occluded by

another person or object, as far fewer 3D points corresponding to the person

will be visible to the camera.

To address these shortcomings, we image a second planview statistic,


Fig. 3.5: Concepts important to building plan-viewmaps.

namely the height above the ground-level plane of the highest point within

each vertical bin. This image, which we refer to as a “plan-view height map”,

is effectively a simple orthographic rendering of the shape of the 3D point

cloud when viewed from overhead.

All the papers using such an approach follow a similar way in the con-

struction of the maps. We’ll analyze the approach used in [10].

As we saw in Section 2.2.3 it’s possible, using matrix Q of backprojection,

to obtain from the coordinates (xscreen, yscreen, disparity) a triple of coordi-

nate (Xcam, Ycam, Zcam). We will see in the architecture chapter that during

the installation phase we need to do External Calibration which gives us a

rotation matrix Rworld and a traslation vector Tworld which allow us to obtain

(always refferring to Fig. 3.5):[XW YW ZW

]T= R−1world

( [Xcam Ycam Zcam

]T − ~T Tworld

)(3.1)

Before building plan-view maps from the 3D point cloud, we must choose

a resolution δground with which to quantize 3D space into vertical bins. We


would like this resolution to be small enough to represent the shapes of people

in detail, but we must also consider the limitations imposed by the noise

and resolution properties of our depth measurement system. In practice, we

typically divide the (XW , YW ) plane into a square grid with resolution δground

of 2-4cm.

After choosing the bounds (Xmin;Xmax;Ymin;Ymax) of the ground level

area within which we will restrict our attention, we can map 3D point cloud

coordinates to their corresponding plan-view image pixel locations as follows:

xplan = b(XW −Xmin)/δground + 0.5c

yplan = b(YW − Ymin)/δground + 0.5c

Plan-view height and occupancy maps, denoted as H and O respectively,

can be computed in a single pass through the foreground data. To do so,

we first set all pixels in both maps to zero. Then, for each pixel classified

as foreground, we compute its plan-view image location (xplan, yplan), ZW -

coordinate, and Zcam-coordinate. If the ZW -coordinate is greater than the

current height map value H(xplan, yplan), and if it does not exceed Hmax,

where Hmax is an estimate of how high a very tall person could reach with his

hands if he stood on his toes, we setH(xplan, yplan) = ZW . We next increment

the occupancy map value O(xplan, yplan) by Z2cam

fufv, which is an estimate of the

real area subtended by the foreground image pixel at distance Zcam from

the camera. The plan-view occupancy map will therefore represent the total

physical surface area of foreground visible to the camera within each vertical

bin of the world space.

3.2.4 Tracker Module

The vast majority of PLT systems based on plan view projection use Kalman

filter [13] during tracking. The basic idea behind the Kalman filter is that,

under a strong but reasonable set of assumptions, it will be possible, given

a history of measurements of a system, to build a model for the state of

the system that maximizes the a posteriori probability of those previous

measurements. In addition, we can maximize the a posteriori probability


without keeping a long history of the previous measurements themselves.

Instead, we iteratively update our model of a system’s state and keep only

that model for the next iteration. This greatly simplifies the computational

implications of this method.

Taking cue from [10] we define the kalman state of a tracked subject as

a three-tuple 〈~x,~v, S〉 where ~x is the subject’s position, ~v is the subject’s

vectorial velocity on the discretized plan and S represents the body config-

uration of the subject. It’s very interesting the analysis of how in different

PLT systems S is modeled.

In [25] S is made up by three templates:

• the Htemplate is extracted by the plan view height map H and it’s cen-

tered in ~x. So it represents as much as possible the subject’s morphol-

ogy;

• the Otemplate is extracted by the plan view occupancy map O, it has

the same size of Htemplate and the same center. It represents the “entity

of the presence” of the subject;

• the Ctemplate, the so called color template, is obtained from the ob-

ject’s pixel in the foreground.

In [10] the author use only the height and the occupancy templates. The

operations of tracking module should be divided in three phases:

1. localization. In this phase the system search for all the candidate

templates in the O and H maps obtained by the current frame;

2. correspondence. In this phase is derived the distance between the

objects detected in the previous phase and the tracked object stored

in the db. in [10] and [25] this distance is a weighted sum of these

elements:

• Sum of absolute differences (SAD) of detected height and occu-

pancy template with respect to those stored in kalman state;

• Difference between the position produced by the prediction phase

of the kalman filter and the position contained in the kalman state;

3.3. FACE RECOGNITION 39

• The inverse distance of the candidate object from already associ-

ated objects. The probability of correct associations decrease if

there are other objects in the neighborhood;

• Only in [25] a measure of the difference of a detected color template

with the color tempalte stored in the kalman state.

If this shortest distance is under a predefined threshold then the db is

updated with the “winners’s” data.

3. eventual refinements. After correspondence phase it’s possible to have

two doubt situations:

• a candidate detected during localization phase has not been asso-

ciated with any tracked object;

• a tracked object has not been associated with any candidate ob-

ject.

In these situations the system has to take decisions; to this aim is

useful to define a series of states for the objects stored in the db. An

example of states set is {newobject, tracked,merged, lost, stale}. To

guide state transitions it’s necessary to use some kind of heuristic. This

heuristic may be for example a Bayesian network (like in [25]) or may

take a simpler form; for example if an object during the last frame was

detected near a door in the room and we haven’t found any candidate

associable the persone is likely exited from the room.

3.3 Face Recognition

Over the last ten years or so, face recognition has become a popular area of

research in computer vision and one of the most successful applications of

image analysis and understanding. Because of the nature of the problem, not

only computer science researchers are interested in it, but neuroscientists and

psychologists also. It is the general opinion that advances in computer vision

research will provide useful insights to neuroscientists and psychologists into

how human brain works, and vice versa.


A general statement of the face recognition problem (in computer vision)

can be formulated as follows: Given still or video images of a scene, identify

or verify one or more persons in the scene using a stored database of faces.

Research directions (according to Face Recognition Vendor Test - FRVT

2002):

• recognition from outdoor facial images;

• recognition from non-frontal facial images;

• recognition at low false accept/alarm rates;

• understanding why males are easier to recognize than females;

• greater understanding of the effects of demographic factors on perfor-

mance;

• development of better statistical methods for understanding perfor-

mance;

• develop improved models for predicting identification performance on

very large galleries;

• effect of algorithm and system training on covariate performance;

• integration of morphable models into face recognition performance;

The literature in this area of research is really wide; a good reference to

start is [31] which analyze not only the face recognition problem but even

the related problem of face detection. Face recognition and face detection

problems fall in the area known as machine learning.

The goal of machine learning (ML) is to turn data into information. After

learning from a collection of data, we want a machine to be able to answer

questions about the data: What other data is most similar to this data? Is

there a face in the image?

Machine learning works on data such as temperature values, stock prices,

color intensities, and so on. The data is often preprocessed into features.

We might, for example, take a database of 10000 face images, run an edge


detector on the faces, and then collect features such as edge direction, edge

strength, and offset from face center for each face. We might obtain 500 such

values per face or a feature vector of 500 entries. We could then use machine

learning techniques to construct some kind of model from this collected data.

If we only want to see how faces fall into different groups (wide, narrow, etc.),

then a clustering algorithm would be the appropriate choice. If we want to

learn to predict the age of a person from (say) the pattern of edges detected on

his or her face, then a classifier algorithm would be appropriate. To meet

our goals, machine learning algorithms analyze our collected features and

adjust weights, thresholds, and other parameters to maximize performance

according to those goals. This process of parameter adjustment to meet a

goal is what we mean by the term learning.

Now we want to deepen the difference between clustering and classifier

algorithm. Data sometimes has no labels; we might just want to see what

kinds of groups the faces settle into based on edge information. Sometimes

the data has labels, such as age. What this means is that machine learning

data may be supervised (i.e., may utilize a teaching “signal” or “label” that

goes with the data feature vectors). If the data vectors are unlabeled then

the machine learning is unsupervised.

Supervised learning can be categorical, such as learning to associate a

name to a face, or the data can have numeric or ordered labels, such as

age. When the data has names (categories) as labels, we say we are doing

classification. When the data is numeric, we say we are doing regression:

trying to fit a numeric output given some categorical or numeric input data.

In contrast, often we don’t have labels for our data and are interested in

seeing whether the data falls naturally into groups. The algorithms for such

unsupervised learning are called clustering algorithms. In this situation, the

goal is to group unlabeled data vectors that are “close” (in some predeter-

mined or possibly even some learned sense). We might just want to see how

faces are distributed: Do they form clumps of thin, wide, long, or short faces?

If we’re looking at cancer data, do some cancers cluster into groups having

diff erent chemical signals? Unsupervised clustered data is also often used to

form a feature vector for a higher-level supervised classifier. We might first


cluster faces into face types (wide, narrow, long, short) and then use that as

an input, perhaps with other data such as average vocal frequency, to predict

the gender of a person.

3.3.1 Face Detection

The classifier used in PLaTHEA for face detection is the Haar classifier

that falls in the category of boosted rejection cascade. OpenCV library

implements a version of the Haar classifier technique for face detection first

developed by Paul Viola and Michael Jones and commonly known as the

Viola-Jones detector [30].

This face detector is a supervised classifier. We typically present image

patches (equalized in size and histogram) to the classifier, which are then

labeled as containing (or not containing) the object of interest, which for this

classifier is most commonly a face. The Viola-Jones detector uses a rejection

cascade of nodes, where each node is a multitree classifier designed to have

high (say, 99.9%) detection rate (low false negatives, or missed faces) at the

cost of a low (near 50%) rejection rate (high false positives, or “nonfaces”

wrongly classified). For each node, a “not in class” result at any stage of the

cascade terminates the computation, and the algorithm then declares that

no face exists at that location. Thus, true class detection is declared only if

the computation makes it through the entire cascade. For instances where

the true class is rare (e.g., a face in a picture), rejection cascades can greatly

reduce total computation because most of the regions being searched for a

face terminate quickly in a nonclass decision (see Fig. 3.6).

For the Viola-Jones rejection cascade, the weak classifiers that it boosts in

each node are decision trees that often are only one level deep (i.e., “decision

stumps”). A decision stump is allowed just one decision of the following form:

“Is the value v of a particular feature f above or below some threshold t”;

then, for example, a “yes” indicates face and a “no” indicates no face.

The Haar-like features used by the classifier are shown in Fig. 3.7. At all

scales, these features form the“raw material” that will be used by the boosted

classifiers. They are rapidly computed from the integral image representing


Fig. 3.6: Rejection cascade used in the Viola-Jones classifier: each noderepresents a multitree boosted classifier ensemble tuned to rarely miss a trueface while rejecting a possibly small fraction of nonfaces; however, almost allnonfaces have been rejected by the last node, leaving only true faces.

the original grayscale image; given a grayscale image G the integral image I

is given by:

I(X, Y ) =∑x≤X

∑y≤Y

G(x, y) (3.2)

3.3.2 Face Recognition

Many approaches exist for face recognition4. In this section we’ll give a brief

overview of some face recognition algorithm.

Many methods of face recognition have been proposed during the past 30

years. Face recognition is such a challenging yet interesting problem that it

has attracted researchers who have different backgrounds: psychology, pat-

tern recognition, neural networks, computer vision, and computer graphics.

It is due to this fact that the literature on face recognition is vast and di-

verse. Often, a single system involves techniques motivated by different prin-

4The face recognition field is wide. A very complete guide to face recognition is athttp://www.face-rec.org

http://www.face-rec.org


Fig. 3.7: Haar-like features (the rectangular and rotated regions are easilycalculated from the integral image): in this diagrammatic representation ofthe wavelets, the light region is interpreted as “add that area” and the darkregion as “subtract that area”.

ciples. The usage of a mixture of techniques makes it difficult to classify these

systems based purely on what types of techniques they use for feature rep-

resentation or classification. To have a clear and high-level categorization,

we instead follow a guideline suggested by the psychological study of how

humans use holistic and local features. Specifically, we have the following

categorization:

• Holistic matching methods. These methods use the whole face

region as the raw input to a recognition system. One of the most

widely used representations of the face region is eigenfaces, which are

based on principal component analysis (in this category there is a real

“best-seller” as [28]);

• Feature-based (structural) matching methods. Typically, in these

methods, local features such as the eyes, nose, and mouth are first

extracted and their locations and local statistics (geometric and/or

appearance) are fed into a structural classifier. One example of this

category is Hidden Markov Model - HMM [20];

• Hybrid methods. Just as the human perception system uses both

3.4. PROJECTS AROUND THE WORLD 45

local features and the whole face region to recognize a face, a machine

recognition system should use both. One can argue that these methods

could potentially offer the best of the two types of methods.

One interesting feature-based matching method is based on Scale-Invariant

Feature Trasform - SIFT introduced by Lowe in [18] (the author has previ-

ously defined a method for features matching in [2]).

SIFT method is resulted very powerful with rigid object. The use of SIFT

in face recognition has been investigated in [12] and [19].

3.4 Projects around the world

3.4.1 LocON Project

LocON aims to integrate embedded location systems and embedded wireless

communication systems in a standardised way, developing a new platform

in order to control large scale infrastructures, like airports, more efficient,

secure, robust and flexible.

A set of PLT systems based on different technoligies have been developed

in the context of the european project. All this PLT systems make use of

markers for localization and tracking. The candidate localization systems in

the LocON project are the following:

• Global Positioning System - GPS. It provides reliable positioning, nav-

igation, and timing services to worldwide users on a continuous basis

in all weather, day and night, anywhere on or near the Earth;

• Radio Frequency IDentification - RFID. It’s the use of an object (typ-

ically referred to as an RFID tag) applied to or incorporated into a

product, animal, or person for the purpose of identification and track-

ing using radio waves. The range of the system depends on tag’s type;

passive tags don’t have power supply on board and the energy is given

by inductive coupling with RFID reader so they have limited range;

active tags have wider range. Recently some manufacturers have intro-

duced the RFID e-passport;


• Ultra WideBand - UWB. It is a radio technology that can be used at

very low energy levels for short-range high-bandwidth communications

by using a large portion of the radio spectrum. UWB has traditional ap-

plications in non-cooperative radar imaging. Most recent applications

target sensor data collection, precision locating and tracking applica-

tions;

• Wi-Fi. Based on IEEE 802.11 family ad-hoc and infrastructured net-

works;

• Local Positioning Radar - LPR. A technology developed by Symeo

which uses radio signals which are not susceptible to harsh ambient con-

ditions. Symeo equipment can be deployed indoor and outdoor under

vibrations, extreme temperatures, dust and harsh weather conditions.

3.4.2 Gator Tech Smart House Project

The PLT system of this project combines two main component:

• In the localization area, the Gator Tech Smart House has embedded

sensors in the floor to determine user location. This solution is not

intrusive and guarantees the desired transparency of a pervasive com-

puting environment;

• The use of RFID technology in combination to sensors floor allow iden-

tity detection.

3.4.3 ARGOS project

ARGOS project (Automatic Remote Grand Canal Observation System) is

a video-surveillance system for boat traffic monitoring, measurement and

management along the Grand Canal of Venice. This new system will answer

to the specific requirements for the boat navigation rules in Venice while

providing a combined unified view of the whole Grand Canal waterway. Such

features far exceed the performance of any commercially available product.

3.4. PROJECTS AROUND THE WORLD 47

Therefore, a specific software has been developed, based on the integration

of advanced automated image analysis techniques.

Obviously ARGOS project is not a PLT system (we can define it a BOAT

Localization and Tracking system), but it’s a very interesting project because

its context leverages several problems: background modeling it’s a very diffi-

cult task due to water (that is a periodically moving background entity) and

due to the lenght of the controlled context (the Gran Canal in Venice). In

fact the ARGOS system is going to control a waterway of about 4 km length,

80 to 150 meters width, through 14 observation points (Survey Cells). The

system is based on the use of groups of IR/VIS cameras, installed just below

the roof of several buildings leaning over the Grand Canal. Each survey cell

is composed of 4 optical sensors: one center wide-angle (90 degree), orthog-

onal to the navigation axis, two side deep-field cameras (50-60 degree), and

a pan-tilt-zoom camera for high resolution acquisition of boat details (e.g.,

license plates).

The main ARGOS functions are:

1. optical detection and tracking of moving targets present in the field of

view (FOV);

2. computing position, speed and heading of any moving target within the

FOV of each camera;

3. elaboration at survey cell level of any event (target appears, exits, stops,

starts within the cells FOV) and transmission of any event to the Con-

trol Center;

4. connecting all the track segments related to the same target in the

different cameras FOV into a unique trajectory and track ID;

5. recording all the video frames together with the graphical information

related to track IDs and trajectories;

6. rectifying all the camera frames and stitching them into a composite

plain image so as to show a plan view of the whole Grand Canal;


7. allowing the operator to graphically select any target detected by the

system and automatically activating the nearest PTZ camera to track

the selected target.

3.4.4 RoboCare Project

The goal of the RoboCare project is to build a multi-agent system which gen-

erates user services for human assistance. The system is to be implemented

on a distributed and heterogeneous platform, consisting of a hardware and

software prototype.

Some of the results and publications (especially [23] and [25]) that this

project has originated have had a deep influence in the design of PLaTHEA.

Chapter 4

Our System and Related Works

This chapter represents a first introduction to our system called PLaTHEA

(People Localization and Tracking for HomE Automation). Here we

emphasize the lessons learned in Chapter 3 focusing on how our approach is

to be located in the state-of-the art.

Contents4.1 Background Modeling and Foreground Segmen-

tation . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.1.1 The Background Model . . . . . . . . . . . . . . . 51

4.1.2 Foreground Segmentation . . . . . . . . . . . . . . 53

4.1.3 Foreground Refinements . . . . . . . . . . . . . . . 55

4.2 Plan View Projection and Tracking . . . . . . . 55

4.2.1 Localization . . . . . . . . . . . . . . . . . . . . . . 55

4.2.2 Correspondence . . . . . . . . . . . . . . . . . . . . 56

4.2.3 Refinements . . . . . . . . . . . . . . . . . . . . . . 58

4.3 Face Recognition . . . . . . . . . . . . . . . . . . . 59

4.3.1 Notes on Face Detection . . . . . . . . . . . . . . . 61

4.4 Tracking and Face Recognition Combined . . . . 63

49

50 CHAPTER 4. OUR SYSTEM AND RELATED WORKS

4.1 Background Modeling and Foreground Seg-

mentation

The choice of the background modeling technique (and hence of the fore-

ground segmentation approach too) is one of the most important during the

design of a PLT system. In the first part of the work we have done exper-

imental studies on a lot of background modeling algorithms including those

introduced in [23], [25], [11] and [14]; in the following of this chapter we will

discuss on how our approach uses these techniques to fit with our technolog-

ical environment.

The stereo input to the system is given by a couple of Axis 207 net cam-

eras. This kind of cameras allow to receive a video stream using MJPEG1;

light compensation of this camera model is very sensitive, so the first con-

straint for our foreground segmentation method is to be insensitive to varia-

tions of image pixel intensity due to light compensation; we have noted that

solving this problem is equivalent to use an algorithm insesitive to sudden

illumination changes.

Under this point of view our implementation of the approaches described

in [23], [25] and [14] doesn’t give us the wanted performances: Axis 207 light

compensation was too strong. The algorithm in [11] instead has showed very

good results not only with respect to camera’s features but also with respect

to hotspots and sudden illumination changes; however as already stated in

3.2.2 this algorithm is not time adaptive and this is an added constraint

for PLaTHEA. With respect to this second constraint the method based

on pixel’s activity defined in [23] impressed us immediately; the concept of

border activity is very natural and very effective as showed by our tests.

The result of this consideration has been an hybrid solution using:

• intensity invariant background modeling and foreground segmentation

defined in [11];

1Multipart JPEG is a video format used in streaming applications. The stream is asequence of JPEG images preceded by a header giving file size and other informations

4.1. BACKGROUNDMODELING AND FOREGROUND SEGMENTATION51

• time adaptivity approach introduced in [23].

However, the so structured algorithm didn’t satisfy us completely; as

stated in section 3.2.2 the solution in [11] is presented in two version: the

authors state that, the first one have problems with dark foreground elements

and the second one have problems with shadows; we have choosen the sec-

ond version because for a human we need to have more pixels detected as

foreground as possible (due to the production of plan view map). The use

of disparity described in [23] is very interesting but, during our experimental

studies, we noted that the flashing of the disparity map produced by OpenCV

library makes harder to apply this particular technique. The solution to this

issue come reading [5]; the use of HSV color model to detect shadow it’s

simple so the previous algorithm didn’t need to be modified a lot.

In this section we will describe our proposed solution for background

modeling and foreground segmentation.

4.1.1 The Background Model

So let’s start to resume the elements of our background model. We can divide

the background model in submodels:

• the Edge Intensity Model (as seen in [23] and [25]) stores the average

edge intensity and the absolute difference between the current edge

intensity and the average edge intensity. Given the vertical V and the

horizontal H border matrices computed using the Sobel filter on the

current left frame, the value of the current edge intensity matrix E for

pixel (X, Y ) is:

E(X, Y ) =√V 2(X, Y ) +H2(X, Y ) (4.1)

The average is, in fact, a running average whose sensibility is given by

the parameter β. So value of the average matrix at time t Etavg at the

location (X, Y ) is given by:

Etavg(X, Y ) = (1− β)Et−1

avg (X, Y ) + βE(X, Y ) (4.2)


The default value for β is 0.08. Due to Axis 207 light compensation

system, the difference matrix Ediff obtained by the absolute difference

between E and Etavg is always rumourous, so we fix a minimum value

for difference minRumour; under this value for a specific pixel (X, Y )

we set Ediff = 0;

• the Activity Model (as seen in [23]) stores the average activity for

all the pixels. First we introduce vertical Avert and horizontal Ahorz

activities as follows:

Ahorz(Y ) =∑x

Ediff (x, Y ) Avert(X) =∑y

Ediff (X, y) (4.3)

The value at (X, Y ) of the average activity matrix at time t Atavg is

obtained as a running average with parameter λ:

Atavg(X, Y ) = (1− λ)At−1

avg (X, Y ) + λAhorz(Y )Avert(X) (4.4)

The default value for λ is 0.2;

• the Color Model is the most articulate submodel. For each of the

color channel of the left frame the model stores running average (with

learning parameter α), difference between this average and the current

frame and variance (whose learning factor is always α); for example for

the red channel the model stores the following matrices: CRavg, C

Rdiff

and CRvar.

In addition, to support [11] we need to store the Brightness and Color

distortion (matrices Cbd and Ccd respectively) for the current frame and

the reference Brightness and Color running average (matrices Cavgb and

Cavgc respectively)2.

It’s important to note that accordingly to [23] the value of α it’s calcu-

lated on a per pixel basis and it’s inversely proportional to the pixel’s

activity. So if a pixel presents a high activity value it’s color model

won’t be updated; inversely if pixel activity is low the color model will

2For the maths we suggest the reader to see the original paper.

4.1. BACKGROUNDMODELING AND FOREGROUND SEGMENTATION53

be updated with a learning factor dependent from the activity of the

pixel. So the learning factor αmod for the pixel (X, Y ) is obtained as

follow:

αmod(X, Y ) = α(1−At

avg(X, Y )

η) (4.5)

where η is a Activity Normalization Factor.

So the use of the shadow detection algorithm in [5] doesn’t imply an

overhead in the model. In fact during the foreground segmentation phase

the HSV model is derived directly from the RGB model.

4.1.2 Foreground Segmentation

Now, given the description of the background model we want to describe the

procedure to declare a pixel as a foreground pixel. We apply in cascade the

method described in [11] and then the method introduced by [5]:

1. given the algorithm parameters minCD, minBD, maxBD defined in

[11], first we define for a pixel (X, Y ) the following values:

brightnessRatio =(Cbd(X, Y )− 1

)/Cavgb(X, Y )

colorRatio = Ccd(X, Y )/Cavgc(X, Y )

and then (X, Y ) is a candidate foreground pixel if:

colorRatio > minCD or

(brightnessRatio > minBD and brightnessRatio < maxBD)

2. given the algorithm parameters minDarkening, maxDarkening, ts, th

defined in [5] the pixel (X, Y ) isn’t a foreground pixel if:

minDarkening < CV (X,Y )CV

avg< maxDarkening and

|CS(X, Y )− CSavg(X, Y )| < ts and |CH(X, Y )− CH

avg(X, Y )| < th

In Fig. 4.1 we have a screenshot of the foreground segmentation.


Fig. 4.1: Background Modeling and Foreground Segmentation at work. Inthe“background”window we have the color model for the background. In the“connected components” window we have the currently detected foregroundafter the foreground refinements (see the next subsection). Finally in the“Plan View Occupancy Map” window we have the occupancy map for thetracked subject (see section 4.2).

4.2. PLAN VIEW PROJECTION AND TRACKING 55

4.1.3 Foreground Refinements

Some PLT systems use to clean the foreground matrix. It’s possible for

example to find blobs in the foreground matrix and eliminate those too much

small for being people. Our experience demonstrate that this operation is not

strictly necessary: another possible solution is to clean the foreground matrix

using only a Median Filter to eliminate the so called “salt and pepper” effect.

Because this choice can be a “matter of taste” we leave it to the installator

providing him the possibility of setting it (via administration GUI).

If the installator choose to use the foreground contour scanner, he has

to choose only one parameter, namely the Filter Perimeter Scale factor

ps. This scanner will delete from the foreground maps all the contours whose

perimeter Pcontour is under the selected fraction of image frame semi-perimeter

(the half od the perimeter) Pimage; so a contour is deleted if:

Pcontour <Pimage

ps(4.6)

4.2 Plan View Projection and Tracking

The Plan View projection phase in PLaTHEA is identical to that described

in 3.2.3. However the Tracking phase, though inspired by [10] and [25], it’s

innovative and result of an iterative process of refinement. The Kalman state

for each tracked person is the same used in [25].

So we’ll analyze in the following of this section the Tracking module used

in PLaTHEA using the same subdivision of section 3.2.4.

4.2.1 Localization

Instead of identifying blobs in the foreground matrix [25], our PLT system

identifies candidate humans directly in the plan view occupancy map O. To

solve for localization problem we have the following steps:

1. we use a contour scanner on the plan view occupancy map to retrieve

all the external contours and for each of these we calculate a bounding

box. The dimension of this bounding boxes is normalized to a common


template size obtained by dividing two times the person torso average

weight for the texel side choosen during the plan view projection;

2. we use the bounding boxes detected at the previous step to do statistics

in the correspondent area of O and H. A bounding box contains a

candidate for tracking if the following constraints are respected:

• the integral over the area subtended by the bounding box on Ois greater than a certain threshold. In [10] we see a formula to

obtain this threshold in a deterministic way;

• the max height in the are subtended by the bounding box in H is

greater than a minimum height.

During the two steps just described are produced the templates for this

candidate, so that during the correspondence stage they are ready to be

consumed.

4.2.2 Correspondence

Now we have a set of tracked persons T = {t1, t2, ..., tn} and a set of candidate

objects C = {c1, c2, ..., cn}; we can think to the elements of these two sets as to

nodes in a weighted complete bipartite graph3; see Fig. 4.2 for details.

Defined for each edge a weight, it’s simple to find the best correspon-

dence for each element; the problem to solve is infact known as minimum

weighted bipartite matching problem (or more simply assignment

problem); this is a very well studied problem which has an efficient solution

in the Hungarian Algorithm4.

3In the mathematical field of graph theory, a bipartite graph (or bigraph) is a graphwhose vertices can be divided into two disjoint sets T and C such that every edge connectsa vertex in T to one in C; that is, T and C are independent sets. Such a graph is alsocomplete if every node in T is connected to every node in C. Finally the graph is weightedbecause to every edge we have a weight associated.

4It was published by Kuhn in [17], who gave the name “Hungarian method” becausethe algorithm was largely based on the earlier works of two Hungarian mathematicians.The time complexity of the original algorithm was O(n4), however Edmonds and Karpnoticed that it can be modified to achieve an O(n3) running time.

4.2. PLAN VIEW PROJECTION AND TRACKING 57

Fig. 4.2: An example of complete bipartite graph. A weight is associatedwith every node.

Now let’s discuss the distance measure. We initially thinked to a com-

posite measure including all the elements of the Kalman state of a tracked

subject; it turns out that this solution it’s difficult because to every distance

we have to associate a weight which is hard to obtain in a consistent way.

So the elaboration of the distance go ahead for steps; for each elements in Tand for each C:

1. in the first step we calculate the following distance measures:

• the difference between color templates in the way proposed in [25];

• the euclidean distance between the predicted position (via the

Kalman filter) of the tracked object and the position of the can-

didate;

• the ratio between the average height of the tracked object and the

average height of the candidate (we make it always greater than

1 to work as an incremental factor).

2. for each of these measures we have a different maximum value. Now

we have to possibility:

• if at least one of the measure exceeds the correspondent treshold

the the edge’s weight is set to the product of the three measures;


• in the other cases the edge’s weight is set simply to the color

difference.

The ratio behind this technique is that if all the constraints are respected

there is high probability that the candidate object correspond to the tracked

object; obviously this euristic works better if persons wears clothes very dif-

ferent in colour.

After all weights are defined the weight matrix is given as input to the

hungarian algorithm which find the best matching. The result if the algo-

rithm is analyzed in the following way:

• if the algorithm has founded a correspondence and the weight associ-

ated to this corrispondence it’s under a predefined threshold then the

tracked object is updated using the candidate associated to it by the

hungarian algorithm;

• if the algorithm has founded a correspondence but the weight is too

much high then the tracked object is updated with the position pre-

dicted by the Kalman filter. All the templates are maintained invaried;

• if a tracked object has no correspondence (this should happen if |T | is

greater than |C|) we follow the same behaviour as in the previous case;

• if a candidate object has no correspondence (this should happen if |C|is greater than |T |) we add a new tracked object using as templates

those of the candidate.

4.2.3 Refinements

In PLaTHEA a object in T should be in only one of the following states:

• NEWOBJECT . It’s the state assigned to a new entry in the tracked

object’s database. In this state an object is not really tracked and

PLaTHEA doesn’t provide to the client any update about it;

• TRACKED. A NEWOBJECT enter in this state if it’s succesfully

tracked for more than 5 times consecutively. Of this object we know


a certain position assured by the correspondence founded by the Hun-

garian algorithm;

• LOST . An object enter in this state if for a frame there is no corre-

spondence founded by the Hungarian algorithm;

• STALE. An object enter in this state if it’s a NEWOBJECT and

before he become a TRACKED we find for it no correspondence or

an object is LOST for more than 100 frames. An object in this state

has to be deleted from the database.

In Fig. 4.3 we resume these transition’s policies.

Fig. 4.3: The state transition diagram for PLaTHEA tracked objects. Notethat this transition diagram treat a tracked object as an anonymous entity.We will review this diagram later using identity information.

4.3 Face Recognition

The face recognition method used in PLaTHEA can be classified as feature-

based. It use SIFT features to create a database where for each registered

person the system stores a set of images. When the system starts it prepares

itself to recognition with the following sequence of steps:

1. it loads for each person the correspondent set of images;


2. for each image it computes the SIFT features;

3. for e single image it sort the SIFT features in a kd-tree [2]; this data

structure allow for a fast similarity computation during the recognition

phase.

For an example of SIFT database for a single person see in Fig. 4.4.

Fig. 4.4: The SIFT database for a single person. In each image the SIFTfeatures are highlighted. In the deployment system for each person we haveat least ten images for person.

The recognition imply the following steps:

1. SIFT features are extracted from the test face;

2. we try to assign a similarity score to each person in the database with

respect to the test face; this score is the sum of the score assigned to

each face related to the specific person. For each feature in the test

face’s detected features is finded the nearest feature in a specific face is

finded using the BBF - Best Bin First search on the face’s kd-tree,

computed during face database training. A database’s face score is


given by counting all the features correspondent to test face’s features

which respect the following constraints:

• the distance from the correspondent feature is under a selected

treshold;

• the ratio between this distance and the distance between the test

feature and the second best matching is under a selected threshold.

3. if the higher person’s score it’s over a threshold then the algorithm

assign the face to that person.

SIFT do its best if used with rigid objects. From this consideration we

have that database training plays a prominent role in PLaTHEA’s recog-

nition performances. We want that a single person stores in the database

the more variegate possible expressions. This will help a SIFT based face

recognition algorithm to give high scores to the right person. See Fig. 4.5

for an example of execution.

It’s important to note in Fig. 4.4 and 4.5 the effects of a not diffuse

illumination in the room. In fact in Fig. 4.4 it’s possible to note that the

number of features detected on the left side of the face (the most illuminated)

it’s noticeably greater than that detected on the right side. Even in Fig. 4.5

it’s simpler for the algorithm to match features on the best illuminated side

of the face. This is a remarkable issue for PLaTHEA’s installer.

4.3.1 Notes on Face Detection

Before ending the section we want to talk about the importance of face

detection in PLaTHEA. Obviously the system has to support the presence

of multiple people in a room. The left camera in the stereo rig doesn’t supply

faces’ close-ups; we have an image of the whole room. From this room we

have to extract close-ups of the faces. To this aim it’s useful the face detector

presented in section 3.3.1. Now our experience proof that the face detection

is the computationally more expensive operation (we will in chapter 6 that

this force to execute face detection and face recognition in a parallel thread


Fig. 4.5: Our face recognition algorithm at work. In each sample the upperimage is the test image. The samples show the matched features. Note thatthe algorithm does very few errors . . . and he is my brother.

4.4. TRACKING AND FACE RECOGNITION COMBINED 63

of execution with respect to the rest of the elaboration); the only way to

speed-up the process it to define a minimum face size.

The reader should think that this create problem to the face recognition

phase, because the smaller faces aren’t detected so the system will not even

try to recognize the face. This is not the truth. If a face is to small infact

the face recognition system is not able to find correspondence because on the

test face it’s not possible to detect enough features.

Now the faces stored in the database have a size of 150x150 pixel. The

face detection system is set to find faces with a minimum size of 75x75 pixels.

Before recognition if the detected face is bigger than the database face’s the

system resize it using cubic interpolation; inversely if the face is smaller than

face database size the system doesn’t try to zoom it because it should cause

information loss.

Fig. 4.6: Viola-Jones detector at work. Two faces detected

The reference Haar fetures for the Haar detector employed in PLaTHEA

are provided by OpenCV library. In particular we use the training set for

frontal face. It’s possible to use multiple training set (OpenCV provided

training set also for profiles) but our experience has suggested to use only

trhe training set for frontal faces. This imply that is unuseful to store in the

database person’s profiles.

4.4 Tracking and Face Recognition Combined

Until now we haven’t faced the problem of how to combine identity informa-

tion provided by face recognition and tracking information provided by the


tracking module.

If a face is recognized at a time step, it’s center of mass is reprojected on

the plan used in tracking in the same way as it was a foreground pixel. Then

we find the tracked object present in that position and assign to it the identity

(in fact before to assign an identity to a tracked object the same identity has

to be recognized for three consecutive times). Unfortunately we had to face a

problem; the most of the times the stereo correspondence algorithm doesn’t

provide the disparity value for the face pixels5; so we move the aforementioned

face’s center of mass to the chest (we use the face dimension to do this) and

we reproject this point to the floor.

Now the reader should have a doubt: Why we don’t track directly faces

instead of pixels detected via foreground segmentation? We have some answer

to this question:

1. users don’t look always at the camera;

2. the face detector is not perfect at all. It sometimes successfully find a

face and sometimes, due to adverse light conditions or to obstructions,

it’s not able to di this;

3. like face detector, face recognition system is not perfect. Sometimes a

face doesn’t have a person matching with enough score to be sure of

the identity.

So in our vision the combination of tracking and face recognition gives

better results with respect to the simple face tracking.

5This is due to the speed optimization techniques used in the OpenCV SAD basedstereo correspondence algorithm.

Chapter 5

System Requirements andArchitecture

Contents5.1 Overview on System Requirements . . . . . . . . 66

5.2 A Look at the Architecture . . . . . . . . . . . . 67

5.2.1 Embedding PLaTHEA . . . . . . . . . . . . . . . 67

5.2.2 The Components’ Architecture . . . . . . . . . . . 68

5.2.3 The Software Dependencies . . . . . . . . . . . . . 68

5.3 The Storage . . . . . . . . . . . . . . . . . . . . . . 70

5.3.1 The Camera Calibration Database . . . . . . . . . 71

5.3.2 The Face Database . . . . . . . . . . . . . . . . . . 72

5.4 The Elaboration Core . . . . . . . . . . . . . . . . 72

5.5 The UPnP Device . . . . . . . . . . . . . . . . . . 73

5.6 The External Entities . . . . . . . . . . . . . . . . 75

5.7 Use Cases . . . . . . . . . . . . . . . . . . . . . . . 76

5.7.1 Installation and Configuration . . . . . . . . . . . 76

5.7.2 Run Time Installation Refinements . . . . . . . . . 81

5.7.3 The Face Database Construction . . . . . . . . . . 82

5.7.4 Run Time Use Cases . . . . . . . . . . . . . . . . . 82

65

66 CHAPTER 5. SYSTEM REQUIREMENTS AND ARCHITECTURE

5.1 Overview on System Requirements

The initial chapters have given you the theoretical basis to PLT systems

in general and to PLaTHEA in particular. With this section we start to

describe the system from a pratical point of view and the best way to do

this is to describe what are the system requirements; some of this are al-

ready emerged during the first chapters and we will face them with a more

systematic approach.

In first place the system has to be more transparent as possible to the

user. We have already stated that a markers based PLT system produce a

psychological effect which induce in the user rigidity and lack of naturalness.

So the use of cameras instead of markers give us a first kind of transparency.

In a second sense we intend for transparency the fact that the users don’t

have to follow a particular behaviour to let the system work (for example to

assume particular poses or to pronounce some magic words). In our vision

the only interaction the user has to have with PLaTHEA is during the

training phase; that is the user has only to do it’s photographic book for face

recognition (of course we are talking of the interaction with PLaTHEA; if

we refer to SM4All system as a whole, the users have to define the so called

“scenes”).

In second place we wanted PLaTHEA to be easily integrable in a

home automation system as well as in the home building design. The first

requirements required that the interaction with the system was loosely cou-

pled and a perfect tool to do this was the implementation of a service based

architecture; the client of the system (the pervasive layer in SM4All’s slang)

interact with the system (in a synchronous or asynchronous fashion; we will

return on this aspect later) via services (the infrastructure to do this is offered

by UPnP standard). The second requirements refused the use of special kind

of data buses: video frames, as well as services’ request and reply, travel on

a simple Ethernet bus (in our vision of futuristic home, the whole home is

wired using Ethernet).

In third place we want our system to be more cheap (from an economical

point of view) as possible. In our vision each room in the home should be

5.2. A LOOK AT THE ARCHITECTURE 67

equipped with a couple of off-the-shelf cameras1 (the 207 is the entry level

model of the network cameras’ family produced by Axis) and with a computer

(the system is for now deployed on a notebook, but we hope to deploy it

on a more simple machine (possibly equipped with an embedded operating

system).

Also, we need a robust system. PLaTHEA has to be started and from

that moment never stopped (of course, this assumption may seem a little

strong). So during the implementation a lot of attention has been dedicated

to errors handle, memory management and so on.

Finally, last but not least, we’ve give particular attention to the deploy-

ment phase. The administration interface allow to easily make the system

work in a short time; it provides an easy interface for all kind of calibration

end for the creation of face database.

5.2 A Look at the Architecture

In this section we want to start describing the deployment of a set of PLaTHEA

installations in a typical home and continue describing the components’ ar-

chitecture of a single instance.

5.2.1 Embedding PLaTHEA

In Fig. 5.1 we explain our vision of how the system has to be deployed in a

home.

In our vision we have a router which represents the access point to the

internet. This router offer Wi-Fi networks as well as a Ethernet network.

This router is connected via Ethernet to a set of switchs (one for each room

in the home). Every room has installed a computer; this computer runs all

the services for its particular room including PLaTHEA. The cameras are

connected to this switch. This is for us a good solution because it isolate the

traffic of the cameras’ frames (wi will see in the test chapter that this traffic

1Note that there is trade off between costs and performances. For example a good facerecognition system requires high resolution cameras if possible.


Fig. 5.1: PLaTHEA embedded in a home. For each room we have thebasic elements of the system.

may reach the 10% of a LAN/100) and also offers the possibility to install

on the computer other services.

5.2.2 The Components’ Architecture

Now that we have an idea of how the system is deployed, we want to analyze

the structure of a single instance. The principale components of PLaTHEA

architecture are figured in Fig. 5.2. In the rest of the chapter we will describe

into details the various component of this architecture.

We’ll proceed from the lower layer up to the “presentation layer” (that

is the UPnP Device) and then we’ll talk about the elements external to the

system which interact with it.

5.2.3 The Software Dependencies

To operate PLaTHEA needs the presence of the set of libraries indicated in

Fig. 5.3.

5.2. A LOOK AT THE ARCHITECTURE 69

Fig. 5.2: The components in Plathea and their responsabilites and depen-dencies.

Fig. 5.3: The libraries dependencies in PLaTHEA.


We give a brief introduction to all these libraries:

• OpenCV is a computer vision library originally developed by Intel. It

is free for use under the open source BSD license. The library is cross-

platform. It focuses mainly on real-time image processing, as such,

if it finds Intel’s Integrated Performance Primitives on the system, it

will use these commercial optimized routines to accelerate itself; the

download of OpenCV library it’s for free at http://sourceforge.net/

projects/opencvlibrary/;

• LibJPEG 7 is a C library used for decompression and compression of

JPEG images; the library has been recently updated to support C++

and is available for free at http://www.ijg.org/;

• CyberLink UPnP is a C++ library which follows the version 1.0

of the UPnP standard; it’s developed by Satoshi Konno for a wide

variety of platforms; the C++ version is available at http://clinkcc.

sourceforge.net/;

• Xerces is a XML library used by CyberLink UPnP (UPnP is inherently

based on SOAP and then on XML); it’s available for C++ and Java;

for the C++ version see at http://xerces.apache.org/xerces-c/.

Now, LibJPEG and CyberLink UPnP are linked in the application as

static libraries, so to simply execute the system they don’t have to be installed

(of course they are needed to recompile the code). However OpenCV and

Xerces are linked as dynamic libraries so they have to be installed on the

disk and their bin directories have to be in the path to execute the software.

The application it’s developed using Microsoft Visual Studio 2008 but

we have avoided using Microsoft extensions to C++ language so it’s easy to

compile the code under another compiler.

5.3 The Storage

The system had to manage to permanent repositories of data: the camera

calibration and the face databases. These repositories aren’t real databases,

http://sourceforge.net/projects/opencvlibrary/

http://sourceforge.net/projects/opencvlibrary/

http://www.ijg.org/

http://clinkcc.sourceforge.net/

http://clinkcc.sourceforge.net/

http://xerces.apache.org/xerces-c/

5.3. THE STORAGE 71

but rather collections of files necessary to the system during his run.

5.3.1 The Camera Calibration Database

The camera calibration database contains all the matrices and vectors intro-

duced in chapter 2 and other files.

We start with the files produced by Stereo Cameras Calibration; with

this the term we indicate the Camera Calibration with the Stereo Calibration;

this is due to the fact that during the installation phase PLaTHEA execute

the two operations simultaneously. The results of this operation are:

• the intrinsics matrices for the left and the right camera denoted re-

spectively with Mleft and Mright and stored in “LeftIntrinsics.xml” and

“RightIntrinsics.xml”;

• the distortion vectors for the left and the right camera denoted re-

spectively with Dleft and Dright and stored in “LeftDistortion.xml” and

“RightDistortion.xml”;

• the rotation matrix R stored in“Rotation.xml”and the traslation vector

T stored in “Traslation.xml”;

• the essential matrix E stored in “Essential.xml” and the fundamental

matrix F stored in “Fundamental.xml”;

• the reprojection matrix Q stored in “3DReprojection.xml”;

• the pixel remapping matrices for undistortion and rectification for both

cameras. In particular for the left camera we have the files“mx LEFT.xml”

and “my LEFT.xml”; for the right camera we have “mx RIGHT.xml”

and my RIGHT.xml”.

Now we analyze the files produced by the so called external calibration.

We will see later how this kind of calibration is done; for now we only say

that this produce the way to trasform 3D camera coordinates in 3D room

coordinates; this operation is necessary as we’ve already seen to create the

plan view maps. The external calibration produces as output a traslation


vector Tworld (stored in “External Traslation.xml”) and the rotation matrix

Rworld (stored in “External Rotation.xml”.

The last source for camera calibration dbase are the room settings stored

in a file with extension “.rsf”. This file contains the environment data and

person’s common data:

• the minimum and maximum values for the XW , YW and ZW coordi-

nates; that is the room size;

• the texel side for the plan view projection;

• the persons’ minimum, maximum and average height and the persons’

average width;

Before starting the elaboration the system has to load all this files to do

computation as we have seen in the previous chapters.

5.3.2 The Face Database

The face database it’s not so complex. It contains a resume file with extension

“.dof”containing informations about persons identification number and name

with associated the number of images stored in the database; this images have

a name composed by person name and an incremental index.

5.4 The Elaboration Core

The Elaboration Core is the most important component of the system. This

is the component that does the real work:

• it acquires the MJPEG stream from the cameras, synchronize them

(more on this later) and decompress them in bitmap images (OpenCV

uses a format known as IplImage);

• it models the background and to foreground segmentation;

• it updates the plan view maps and tracks on them the objects;

5.5. THE UPNP DEVICE 73

• it does face recognition and combine these informations with the track-

ing informations.

Fig. 5.4: The principale software modules in the Elaboration Core.

5.5 The UPnP Device

The UPnP Device represents the Presentation Layer of the system. It allows

the clients to interact with the system in synchronous and asynchronous

fashion.

Universal Plug and Play (UPnP) [29] is a set of networking protocols pro-

mulgated by the UPnP Forum. The goals of UPnP are to allow devices to con-

nect seamlessly and to simplify the implementation of networks in the home

(data sharing, communications, and entertainment) and in corporate envi-

ronments for simplified installation of computer components. UPnP achieves

this by defining and publishing UPnP device control protocols (DCP) built

upon open, Internet-based communication standards.

The term UPnP is derived from plug-and-play, a technology for dynami-

cally attaching devices directly to a computer, although UPnP is not directly


related to the earlier plug-and-play technology. UPnP devices are “plug-and-

play” in that when connected to a network they automatically announce their

network address and supported device and services types, enabling clients

that recognize those types to immediately begin using the device.

Fig. 5.5: The synchronous and asynchronouse UPnP interfaces used inPLaTHEA.

A UPnP control point (that is a UPnP client) can interact with an UPnP

device in two ways mainly:

• it can call synchronous methods with blocking calls;

• it can subscribe to services; a services contains a series of variable that

communicate their value change to the service subscriber; this is the

UPnP asinchronous interaction.

In Fig. 5.5 we have inserted the subscription methods in the set of the

synchronous methods. Now for each method and for each evented variable

we describe the use.

• the GetListIDRegistered method returns an XML string with a set

of couples < id, name >, one for each registered user;

• the GetRoomInfo method returns and XML string containing infor-

mations about the controlled room;

• the GetPositionFromPersonID method takes as input a person id

and returns the position of the correspondent user if he’s present in the

room;

5.6. THE EXTERNAL ENTITIES 75

• the GetPositionFromObjectID method takes as input a tracked ob-

ject id (this IDs differently from person IDs are temporary) and returns

the position of this object;

• the GetAllPositions method returns an XML string with the position

(and the eventual identity) of all the tracked objects;

• the notifyNewObject state variable change its value if the system has

detected new object/s; the variable is set to an XML string containing

tracked object’s informations; to receive this update a client has to

subscribe to the mainService;

• the notifyNewRecognizedObject state variable change its value if

a previously tracked object has been recognized; to receive this update

a client has to subscribe to the mainService;

• the notifyAllFrames state variable change periodically with a period

depending on which periodical service the client is subscribed to.

5.6 The External Entities

Now, we have to analyze the remaining components in the components’ ar-

chitecture depicted in Fig. 5.2. An important feature of a home automation

system is the identity database. When a new face database entry is reg-

istered, PLaTHEA has to advise of this event this system; this is done

because a client obtain via UPnP from PLaTHEA an id valid only for our

system. The identity database allow the client to complete the information

about this ID.

In synthesis, the identity database gives the client a correspondence be-

tween the ID registered in PLaTHEA and the real identities certificated by

the home automation system.


5.7 Use Cases

In this section we want to take a look to PLaTHEA at work. The work

of the system can be subdivided in two main phase: the Installation and

Configuration Phase and the Elaboration Phase. It’s remarkable that

a part of the system behaviour (the parameters introduced in Chapter 4) are

configurable at “Run Time” allowing to see the effect of a variable change on

the overall system.

5.7.1 Installation and Configuration

The first installation step is the physical mounting of the stereo rig. During

this phase it’s useful to taking into account the following considerations:

• has already stated the cameras have to be mounted much frontal par-

allel as possible (see Fig. 2.2 for details);

• from the choosen baseline depends the depth resolution of the system;

this means that this parameter has to be fixed accordingly to room size.

Nearer the cameras are, worst is the resolution at high distances;

• it’s better to mount the stereo rig in the corner of the room opposite

to the entering door, near the cealing, to obtain the largest field of

view;

• it’s important to choose a camera model whose resolution is adeguate

to the room size. As we’ve seen the face detection algorithm cut-off the

faces with a size inferior to 75x75. If we have high resolution cameras

we have bigger faces;

• the two cameras have to be identical.

The second step is to start a uncalibrated acquisition from the net cam-

eras (see Fig. 5.6).

Before continuing we want to describe the informations requested in the

acquisition window in Fig. 5.6:

5.7. USE CASES 77

Fig. 5.6: The acquisition window, during this phase it’s important touncheck the calibrated option.

• the IP addresses and ports of the stereo cameras; it’s important to note

that left and right are considered looking from behind the cameras to

the cameras’ direction;

• the user id and password for authentication to the cameras. Axis net

cameras implements a form of open HTTP authentication which re-

quest a base 64 conversion of this data;

• the acquisition resolution. Please note that this resolution refer to the

that used for face recognition. For people localization and tracking this

resolution is scaled down to 320x240;

• the acquisition frame rate. It’s important to note that this rate has

to be adeguate to the room’s computer. However the system is robust

with respect to low or excessive frame rates;

• the use calibration data option which tell the system if undistort and

rectify the images. This option require that stereo cameras calibration

data is loaded.

At this moment is useful to do some correction to camera poses to reach

the best frontal parallel configuration possible.


The third step to do is what we called stereo cameras calibration.

This is done giving to the system (already placed in the desired position)

a sequence of 14 poses of a rigid chessboard pattern (see Fig. 5.7 for an

example).

Fig. 5.7: The stereo calibration window, the system emit a sound to tell theinstallator to remain much more still as possible.

After the calibration the system estimate the error on the data obtained.

A value less or equal than 0.20 assure a good result. If the error is too high,

we have to repeat the calibration.

It’s important to note tha calibration data are valid if the relative position

between the two cameras doesn’t change; it’s possible to change the absolute

position of the stereo rig as a whole, but not the position of a single camera;

in such a case the calibration has to be done from scratch.

We can test our calibration, stopping the acquisition and restarting it

using the calibration data. If we are satisfied we have to save this data in a

folder (in this folder will be placed all the files described previously).

5.7. USE CASES 79

Now with the acquisition started with undistortion and rectification func-

tions active we can do the external calibration. To do this step we have

to do a preliminar operation: we have to put a series of markers in the room

and for each of them calculate the exact position in room coordinates (in

millimeters) like in Fig. 5.8.

Fig. 5.8: The external calibration markers. The room coordinate systemhas to be obtainable by rotation and traslation of camera’s coordinate systemwhich is right-handed and depicted in Fig. 5.9.

Now we can use the external calibration window tool to select using a

viewfinder a marker in the scene and insert its world coordinates. We do this

operation for all markers (see Fig. 5.10). In the same way as after stereo

cameras calibration is done we can’t move a camera relatively to the other,

after stereo calibration we can’t move the stereo rig.

After external calibration we can save the data into a folder (we rec-

comend to use the same folder used for stereo cameras calibration data).

The last installation step is the Room Settings phase. This action

doesn’t require the acquisition from the camera; we use the window in Fig.

5.11 to do all the work.


Fig. 5.9: The camera coordinate system.

Fig. 5.10: The external calibration tool. The installator moves the cursorover the snapshot taken from the left camera and, using a “zoom in” window,clicks on each marker and digits the world coordinates (in millimeters) foreach marker.

5.7. USE CASES 81

Fig. 5.11: The Room Settings window.

This settings can be saved in a file with extension “.rsf” as already stated.

5.7.2 Run Time Installation Refinements

The work for the installator is not finished. We have already stated that

the behaviour of the system can be changed modifying a series of values (for

example the learning factor of a gaussian). The administrator GUI provided

with PLaTHEA allow to change these value at Run Time. After elaboration

is started it’s available the tool window in Fig. 5.12.

The window it’s divided in frames correspondent to different elaboration

tasks:

• the Background Learning frame. Here we have all the settings de-

scribed in section 4.1.1;

• the Disparity Map Settings frame. For detail on the settings we

invite the reader to study the OpenCV block matching stereo corre-

spondence algorithm on OpenCV online wiki or on [4];

• the Foreground Segmentation frame. For details on parameters see

[11] and [5]. Note that it’s possible to choose if using or not the shadow

detector and the foreground contour scanner too;


Fig. 5.12: The elaboration settings window.

• the Plan View Map and Tracking frame. We have already analyzed

how these settings modify the behavior of the system.

5.7.3 The Face Database Construction

The administrator GUI allows the users to modify the face database. It’s

possible to add and delete users, and for each users to add new faces to the

database (it’s not possible to delete faces from the database, if a user has

this need, he have to replace the all set).

5.7.4 Run Time Use Cases

Now we want to describe how the system behaves in different situations. To

do this we’ll define a series of scenes. To start we introduce a state diagram

similar to that introduced in Fig. 4.3. In this diagram we want to change

point of view; here the subject isn’t a tracked object (that as stated in 4.3

has limited life with an initial state and a final state) but the person whose

identity may be associated to a tracked object for a limited period of time;

when a person exit from a room and re-enter after sometimes its identity will

probably associated to another tracked object.

5.7. USE CASES 83

Fig. 5.13: The person’s state transition diagram.

The use cases we want to define have as main actor the client system

that we indicate with UPnPControlPoint. This control point in SM4All

middleware is a component of the Pervasive Layer. The other actors are the

instances of PLaTHEA.

We will spend the rest of the chapter to describe the use of the system.

We have already described the interfaces supported by a single instance of

PLaTHEA. So now we’ll deepen the composite use cases.


Fig. 5.14: Examples of use cases for a home served by PLaTHEA. Thereare many others service obtainable.

5.7. USE CASES 85

Main Success Scenario

1. UPnPControlPoint start the research of X

2. For each PLaTHEA installation UPnPControlPoint guess for X includ-ing Find Person Position

3. The installation k of PLaTHEA returns the position of X

4. UPnPControlPoint does some operation

Extension to 3 No installation have founded X

a) failure

Fig. 5.15: The Person Research Use Case


1. UPnPControlPoint want to know the position for every person

2. For each PLaTHEA installation UPnPControlPoint includes Give AllPersons

3. UPnPControlPoint collect all the informations and do other tasks

Fig. 5.16: The Home Snapshot Use Case



1. UPnPControlPoint want to have an update about each room every 5seconds

2. For each PLaTHEA installation UPnPControlPoint include Subscribeto All Tracked

3. UPnPControlPoint wait for an update

4. UPnPControlPoint receive an update from an installation k ofPLaTHEA

5. UPnPControlPoint does some operation

6. Back to 3

Extension to 6 Surveillance end

a) end of the use case

Fig. 5.17: The Home Periodic Surveillance Use Case

Chapter 6

Implementation Details

Contents6.1 Technological Introduction . . . . . . . . . . . . . 87

6.2 The Elaboration Core Component . . . . . . . . 88

6.2.1 Video Acquisition and Synchronization . . . . . . . 89

6.2.2 The Elaboration and the Face Recognition Threads 92

6.3 The UPnP Device . . . . . . . . . . . . . . . . . . 95

6.3.1 The UPnP device descriptor . . . . . . . . . . . . . 96

6.1 Technological Introduction

In this introductive section we describe PLaTHEA’s technological issues.

The system is implemented in native C++ exploiting Win32 API (on a

Windows 7 operating environment) and tested on an off-the-shelf laptop (a

Toshiba Satellite A300 1GY equipped with an Intel Core 2 Duo CPU at 2.53

GHz and 4 GB of RAM).

As already stated the cameras using during the implementation are a cou-

ple of Axis 207 net cameras; these cameras have a wired Ethernet interface.

Using the HTTP interface it’s possible to retrieve single image (in JPEG

format) or video (following the standard MPEG-4 or as Multipart JPEG).

Axis 207 is the entry level model of the network cameras’ family produced

87

88 CHAPTER 6. IMPLEMENTATION DETAILS

by Axis; it has a maximum resolution of 640x480 at a maximum frame rate

of 30 fotograms per second1.

6.2 The Elaboration Core Component

The goal of this section is to deepen the implementation details of the Elab-

oration Core Component. This is a multithreaded component2 (it consists

of five synchronized threads of execution) and its overall schema is given in

Fig. 6.1.

Fig. 6.1: The elaboration core component in detail.

Each of the following subsections analyzes in detail a part of this figure.

1This cameras offers other features not exploited such as UPnP interface, motion andaudio detection, email sending and so on

2A very interesting guide to Win32 API and Windows architecture is [22].

6.2. THE ELABORATION CORE COMPONENT 89

6.2.1 Video Acquisition and Synchronization

Stereo Vision involves the simultaneous acquisition from the two cameras of

two sequences of frames; to ask the cameras for a single frame when necessary

wasn’t a good solution for the following reasons:

• request from a net camera a frame requires the establishment of a TCP

connection and this involves not only extra traffic (due to connection

instauration) but also the problems involved by TCP Slow Start;

• the establishment of a connection is a “slow” operation and so it’s very

difficult to request two simultaneous images from the two camera; it’s

very important computing disparity that the to images taken from the

left and the right cameras are taken in the same instant.

So we chose to use Multipart JPEG (due to its semplicity with respect

to MPEG-4); in this way we open a persistent TCP connection with both

cameras. However this choice doesn’t remove the synchronization problem

due to the following reasons:

• the cameras are independent devices; when an Axis camera does light

compensation, (as stated in the camera’s documentation) it slows the

frame rate for elaboration and this involves asynchrony in the two se-

quence of frame;

• even if the cameras trasmit at the exactly same frame rate, they could

be start to trasmit at different instances (difference in microseconds).

The second issue is not a problem at all because persons moves slowly

(with respect to computer time). However, the first issue is a very important

problem. In Fig. 6.2 we show the problem and the solution.

To solve this problem we have implemented a three threads structure;

we have an acquisition thread for each camera and a synchronization thread

synchronized via software events; when one of the acquisition threads has

a ready image, it advice the synchronization thread; we have the following

options:


Fig. 6.2: The problem of video sequences’ synchronization.

• if during the previous update we have sent a stereo couple to the elab-

oration, the new frame is stored waiting for a frame received by the

other camera to create a new stereo couple;

• if the previous frame was received by another camera, a new stereo

stereo frame is ready for the elaboration;

• if the previous frame was received from the same camera it is trashed.

The threads are synchronized using a couple of automatic reset soft-

ware events. The sequence diagram is showed in Fig. 6.3.

Our implementation shows very good result for synchronization. The syn-

chronization thread is the input to the elaboration core thread that does the

hard work. The reader should think that this three threads produce in fact

a huge overhead; this is not true because acquisition, decompression3 (done

by the acquisition thread) and synchronization are very quick operations.

This solution produce also an advantage for the elaboration thread be-

cause it can use its“time slot”only for elaboration (and of course visualization

in the administrator interface).

3The JPEG decompression into a Bitmap and the converted into IplImage format isdone using LibJPEG 7.


Fig. 6.3: The sequence diagram for sinchronization. The yellow block rep-resents an ignored frame. The X represents a really short elaboration.


6.2.2 The Elaboration and the Face Recognition Threads

As already state the synchronization thread produces a sequence of stereo

couple (at an average rate correspondent to the acqisition rate); these stereo

sequence is the real input to the elaboration core; the synchronization thread

provide to the elaboration core the full size stereo couple and a 320x240 copy.

The synchronization thread signal to the elaboration core thread the presence

of a new stereo couple always using an automatic reset software event.

The elaboration core execute in sequence the following operations:

1. convert the BGR stereo couple provided by the synchronization thread

in a gray scale clone; the OpenCV stereo correspondence algorithm in

fact works only on gray scale images;

2. apply the Sobel filter to the left image of the small sized stereo couple;

the filter is used for both horizontal and vertical borders obtaining two

matrices;

3. compute the stereo correspondence between the gray scale versions of

images componing the current stereo couple; OpenCV execute this task

using a parallel thread to obtain a speed-up;

4. update the background; this update requires the color version of the

small sized stereo couple and the two borders matrices computed at

step 3;

5. apply the foreground segmentation algorithm. At this step we have a

foreground matrix where white pixels represents the foreground;

6. execute the tracking;

7. signal to the face recognition thread the presence of a new full size left

frame. This is an optional operation; if the face recognition thread is

not working (the precedent face recognition task has been completed

during the last time slot) it can start to process another image (details

on this later);


8. if face recognition thread has ended to analyze a full size left frame,

obtaining a series of identity, we have to interpolate these informations

to those provided by tracking.

In the sequence above we have described two optional operation; face

recognition it’s a slow task: it may take more than a single time slot to

complete; this is a huge problem because a good tracking requires an high

rate sampling to capture all the little variation of a person pose or height.

The solution to this has figured in Fig. 6.1 is to compute tracking and

face recognition in two separate threads of execution. Immediatly after the

elaboration core thread has finished tracking it checks the state of the face

recognition thread:

• if the face recognition thread has ended, the elaboration core thread

acquires the results and does the interpolation;

• if the face recognition thread hasn’t ended, the interpolation is posponed

to the next time slot.

We’ve already stated that, after recognition, the face recognition thread

reproject the face to the plan view used for tracking. This imply that when

face recognition receive a new frame, it needs to receive the information for

the current tracked persons; this is why in Fig. 6.1 the face recognition is

started immediatly after the tracking instead of the top of the sequence4.

The Fig. 6.4 can help in understanding the mechanism.

As depicted in Fig. 6.1 the face recognition thread performs mainly two

operations in sequence:

1. it performs the face detection using a gray scale version of the left frame

of the big sized stereo couple provided by the elaboration core thread;

2. for each face detected, it tries to do face recognition;

3. for each face correctly identified it reproject the center of mass of the

face to the plan and search for a tracked object in that position.

4Of course this is another good reason to allow face recognition to take more than atime slot, otherwise we’ll have to wait immediately for its ending.


Fig. 6.4: The sequence diagram shows the synchronization between the elab-oration core thread and the face recognition thread. The arrows representevent signaling. The red boxes represent the time slots when the elaborationcore thread provide the face recognition thread with new data. The yellowbox represent the time slot when the elaboration core thread does interpola-tion; due to this this elaboration burst is a little longer than the others.


As already stated, the most expensive operation, among those described

in the sequence above, is the face detection (of course it depends on the

number of faces in database; but if the database is under the 100 faces the

claim holds).

Our experience proof that the synchronization solution obtained it’s very

effective. The elaboration times of the elaboration core thread for each stereo

coupled is increased of a little amount; however this solution makes this elab-

oration time more variable, with spikes in correspondence of the interpolation

time slots.

6.3 The UPnP Device

The UPnP Device represents in some sense the presentation layer of PLaTHEA

for the interaction of external entities (UPnP control points). UPnP works

on SOAP (as others Web Services types); all the interactions are managed

for free by the UPnP library. What our UPnP device have to added are the

support for periodical updates.

To this aim our UPnP device is implemented as a thread that periodically

change the value of the variables correspondent to periodic publish/subscribe

services. The basic period for updates is of 0.333 milliseconds (correspondent

to the periodic service of three updates for second) and the others are derived

from this.

This thread runs independently from the elaboration core thread but

depends on it for updates as figured in Fig. 5.2; so UPnP device thread own

a copy of tracked objects data (including identities) whose access is protected

using a single writer and multiple readers lock. So we have different

pretenders to this lock:

• the UPnP Device thread wants to gain a reader lock to produce the

XML update for subscribers;

• the elaboration core thread wants to gain a writer lock to update in-

formations;


• the UPnP Device synchronous methods wants to gain a reader lock to

fulfill the requests.

Another time, a sequence diagram allow us to clarify the concept.

Fig. 6.5: The sequence diagram shows the synchronization via a SWMR(Single Writer and Multiple Readers) lock. Initially the UPnP Device threadacquire reader rights; when it finishes its task, the writer rights are grantedto the elaboration core thread which had requested them in advance and waswaiting for them; finally the UPnP library can acquire reader rights to fulfilla synchronous request.

The use of a SWMR object (the request manager is in fact the operating

system) allows to avoid data inconsistency.

6.3.1 The UPnP device descriptor

We’ve already introduced the interfacec exposed by the UPnP device. In fact

those interfaces were described from a client point of view. Here we want to

deepen how these methods are really described to feed to the UPnP library.

In Fig. 6.6 we have a device description following the standard described


in [29]. This device exposes four services; three of these services are described

by the same XML Service Description file (depicted in Fig. 6.8).

1 <?xml version=”1.0 ” ?>2 <root xmlns=”urn:schemas−upnp−o rg :dev i c e −1−0”>3 <specVers ion>4 <major>1</major>5 <minor>0</minor>6 </ specVers ion>7 <dev i ce>8 <deviceType>urn:sm4al l−dis−sapienza :dev ice :PLTSystem:1</deviceType>9 <fr iendlyName>People Lo ca l i z a t i on Recognit ion and Tracking System</ friendlyName>

10 <manufacturer>Dipartimento di In fo rmat i ca e S i s t em i s t i c a Ruberti</manufacturer>11 <manufacturerURL>ht tp : //www. d i s . uniroma1 . i t</manufacturerURL>12 <modelName>PLTSystem</modelName>13 <UDN>uuid:disPLTSystem</UDN>14 <s e r v i c e L i s t>15 <s e r v i c e>16 <serv iceType>urn:sm4al l−dis−s a p i e n z a : s e r v i c e :ma i nS e r v i c e : 1</ serv iceType>17 <s e r v i c e I d>urn:sm4al l−dis−s ap i e n z a : s e r v i c e I d :ma i nS e r v i c e</ s e r v i c e I d>18 <SCPDURL>mainService . xml</SCPDURL>19 <controlURL>con t r o l</controlURL>20 <eventSubURL>mainServiceEvent</eventSubURL>21 </ s e r v i c e>22 <s e r v i c e>23 <serv iceType>urn:sm4al l−dis−s a p i e n z a : s e r v i c e : g e n e r i cA l lDa t a S e r v i c e : 1</

serv iceType>24 <s e r v i c e I d>urn:sm4al l−dis−s ap i en za : s e r v i c e I d : t h r e eF rame s</ s e r v i c e I d>25 <SCPDURL>gene r i cA l lDataSubsc r ip t i on . xml</SCPDURL>26 <controlURL>con t r o l</controlURL>27 <eventSubURL>threeFramesServiceEvent</eventSubURL>28 </ s e r v i c e>29 <s e r v i c e>30 <serv iceType>urn:sm4al l−dis−s a p i e n z a : s e r v i c e : g e n e r i cA l lDa t a S e r v i c e : 1</

serv iceType>31 <s e r v i c e I d>urn:sm4al l−dis−s ap i en za : s e r v i c e Id : t enFrames</ s e r v i c e I d>32 <SCPDURL>gene r i cA l lDataSubsc r ip t i on . xml</SCPDURL>33 <controlURL>con t r o l</controlURL>34 <eventSubURL>tenFramesServiceEvent</eventSubURL>35 </ s e r v i c e>36 <s e r v i c e>37 <serv iceType>urn:sm4al l−dis−s a p i e n z a : s e r v i c e : g e n e r i cA l lDa t a S e r v i c e : 1</

serv iceType>38 <s e r v i c e I d>urn:sm4al l−dis−s a p i e n z a : s e r v i c e I d : f i f t yF r ame s</ s e r v i c e I d>39 <SCPDURL>gene r i cA l lDataSubsc r ip t i on . xml</SCPDURL>40 <controlURL>con t r o l</controlURL>41 <eventSubURL>f i f tyFramesServ i ceEvent</eventSubURL>42 </ s e r v i c e>43 </ s e r v i c e L i s t>44 <presentationURL>/ pr e s en ta t i on</presentationURL>45 </ dev i ce>46 </ root>

Fig. 6.6: The UPnP description for a device contains several pieces ofvendor-specific information, definitions of all embedded devices, URL forpresentation of the device, and listings for all services, including URLs forcontrol and eventing.

The Main Service (depicted in Fig. 6.7) exposes all the synchronous

methods (except the subscription methods) described at page 74. Moreover

it expose two evented variable used in the publish/subscribe interaction with

a UPnP control point. When a control point subscribes to this service it

receives updates when this two variables changes their values; in fact in

UPnP it’s possible to subscribe only to an entire service and not to the


1 <?xml version=”1.0 ”?>2 <scpd xmlns=”urn:schemas−upnp−o r g : s e r v i c e −1−0” >3 <specVers ion>4 <major>1</major>5 <minor>0</minor>6 </ specVers ion>7 <a c t i onL i s t>8 <ac t i on>9 <name>GetList IDRegistered</name>

10 <argumentList>11 <argument>12 <name>Result</name>13 <r e l a t edS ta t eVa r i ab l e>Result</ r e l a t edS ta t eVa r i ab l e>14 <d i r e c t i o n>out</ d i r e c t i o n>15 </argument>16 </ argumentList>17 </ ac t i on>18 <ac t i on>19 <name>GetRoomInfo</name>20 <argumentList>21 <argument>22 <name>Result</name>23 <r e l a t edS ta t eVa r i ab l e>Result</ r e l a t edS ta t eVa r i ab l e>24 <d i r e c t i o n>out</ d i r e c t i o n>25 </argument>26 </ argumentList>27 </ ac t i on>28 <ac t i on>29 <name>GetPositionFromPersonID</name>30 <argumentList>31 <argument>32 <name>PersonID</name>33 <r e l a t edS ta t eVa r i ab l e>ID</ r e l a t edS ta t eVa r i ab l e>34 <d i r e c t i o n>in</ d i r e c t i o n>35 </argument>36 <argument>37 <name>Result</name>38 <r e l a t edS ta t eVa r i ab l e>Result</ r e l a t edS ta t eVa r i ab l e>39 <d i r e c t i o n>out</ d i r e c t i o n>40 </argument>41 </ argumentList>42 </ ac t i on>43 <ac t i on>44 <name>GetPositionFromObjectID</name>45 <argumentList>46 <argument>47 <name>ObjectID</name>48 <r e l a t edS ta t eVa r i ab l e>ID</ r e l a t edS ta t eVa r i ab l e>49 <d i r e c t i o n>in</ d i r e c t i o n>50 </argument>51 <argument>52 <name>Result</name>53 <r e l a t edS ta t eVa r i ab l e>Result</ r e l a t edS ta t eVa r i ab l e>54 <d i r e c t i o n>out</ d i r e c t i o n>55 </argument>56 </ argumentList>57 </ ac t i on>58 <ac t i on>59 <name>GetAl lPos i t i ons</name>60 <argumentList>61 <argument>62 <name>Result</name>63 <r e l a t edS ta t eVa r i ab l e>Result</ r e l a t edS ta t eVa r i ab l e>64 <d i r e c t i o n>out</ d i r e c t i o n>65 </argument>66 </ argumentList>67 </ ac t i on>68 </ a c t i onL i s t>69 <s e rv i c eS ta t eTab l e>70 <s t a t eVa r i ab l e sendEvents=”no”>71 <name>Result</name>72 <dataType>s t r i n g</dataType>73 </ s t a t eVa r i ab l e>74 <s t a t eVa r i ab l e sendEvents=”no”>75 <name>ID</name>76 <dataType>i n t</dataType>77 </ s t a t eVa r i ab l e>78 <s t a t eVa r i ab l e sendEvents=”yes ”>79 <name>notifyNewObject</name>80 <dataType>s t r i n g</dataType>81 </ s t a t eVa r i ab l e>82 <s t a t eVa r i ab l e sendEvents=”yes ”>83 <name>noti fyNewRecognizedObject</name>84 <dataType>s t r i n g</dataType>85 </ s t a t eVa r i ab l e>86 </ s e rv i c eS ta t eTab l e>87 </ scpd>

Fig. 6.7: The PLaTHEA Main Service description.


single variables; moreover it’s not possible to exercise control on the messages

received.

1 <?xml version=”1.0 ”?>2 <scpd xmlns=”urn:schemas−upnp−o r g : s e r v i c e −1−0” >3 <specVers ion>4 <major>1</major>5 <minor>0</minor>6 </ specVers ion>7 <s e rv i c eS ta t eTab l e>8 <s t a t eVa r i ab l e sendEvents=”yes ”>9 <name>not i fyAl lFrames</name>

10 <dataType>s t r i n g</dataType>11 </ s t a t eVa r i ab l e>12 </ s e rv i c eS ta t eTab l e>13 </ scpd>

Fig. 6.8: The Periodical Updating Service description

The remaining three services describe periodic updates with different fre-

quencies. We have selected frequencies thinking of the use a control point

can do of position informations:

• for a continuous control of positions in the room we suggest to subscribe

to “three for second” or “one for second” service;

• for a more mild control we suggest the “one for five seconds” service.

Obviously the control point can filter the updates following other business

rules, but this is out of the scope of the system.


Chapter 7

Tests and Performance Analysis

Contents7.1 Tests on the PLT Sub-system . . . . . . . . . . . 101

7.1.1 Test Environment . . . . . . . . . . . . . . . . . . 102

7.1.2 Test Results . . . . . . . . . . . . . . . . . . . . . . 104

7.2 Tests on Face Recognition Sub-system . . . . . . 111

7.3 Computational Costs . . . . . . . . . . . . . . . . 114

7.1 Tests on the PLT Sub-system

In this section we aim to analyze the performances of the PLaTHEA’s

People Localization and Tracking subsystem. These tests satisfy two main

needs:

• we want to measure system’s error on dynamic positions. The system

shows very good performances with respect to static measures, obtain-

ing an error of about 10 centimeters on all the axis. So we want to

analyze the reply of the system following a moving agent;

• the errors have to be derived from a client point of view, so we use a

UPnP client in order to gather system’s measurements; this client is

showed in Fig. 7.1.

101

102 CHAPTER 7. TESTS AND PERFORMANCE ANALYSIS

Fig. 7.1: The UPnP client used in our test. It shows PLaTHEA’s asyn-chronous interface from the client point of view. The updates contain infor-mations about id of the tracked subject, with the eventual identity, end thearea containing the subject defined by two corner points of a rectangle. Allthe measure are in plan view coordinates.

7.1.1 Test Environment

Let’s start with the test environment; we have to monitor an area of approx-

imately 16 m2 with maximum distance from the stereo rig of 6 m.

The stereo rig is placed at a height of about 2.5 from the floor and is

pointed towards the entry door of the room (the farthest point to monitor)

with a degree of about 45 ◦C with respect to the parallel to the floor.

For an installer the first problem to solve is the choice of the baseline

between the cameras; bigger is the area to monitor, larger is the baseline; a

baseline of 19 cm is perfect to handle a range from 1 m to 6 m. Unfortunately

larger is the baseline bigger has to be the chessboard for the calibration1

because in each stereo snapshot it has to be present in both the imager

and the cells have to be big enough to avoid precision errors during the

computation.

The error measurements are done with respect to three path designed to

have an increasing degree of complexity. These paths are followed by three

persons with different heights and builds; they will intersect on these path in

1During our test we have used a chessboard printed on an A3 sheet. However, in ourexperience, bigger is the chessboard simpler is to achieve a good calibration.

7.1. TESTS ON THE PLT SUB-SYSTEM 103

Fig. 7.2: The stereo rig used in our tests. Please note that we have orientedthe cameras to obtain the best possible frontal parallel arrangement.

different ways, helping us to obtain information about the sensibility of the

system with respect to partial and total occlusions and with respect to the

proximity to walls and inanimate objects.

Height WeightSubject 1 1.67 50 KGSubject 2 1.73 70 KGSubject 3 1.80 85 KG

Fig. 7.3: Information about the heights and builds of the subjects protago-nist of the tests.

The paths are obtained connecting a series of points; obviously the points

provided by the system haven’t any correspondence with them. So the error

for a given point provided by the system is defined as the euclidean distance

from the nearest point in the walk (of course, not the nearest in absolute,

but the nearest watching at the video sequence). For each walk we provide

the maximum error and the average error with respect to the real positions

occupied by the subject, and the minimum, maximum and average height

measuread for the subject.

The deployment of the test environment has been a perfect example of

system installation; we have specified the room’s features, we have performed


Fig. 7.4: The paths used for the tests. They are placed in increasing orderof complexity. Each cell in the red grid correspond to an area of side 50 cm.The stereo rig is placed in the bottom right corner of the monitored area.

the stereo and the external calibration (for this last we have placed special

markers all around the room providing for each of them the exact position

hand measured) and we spend some time to tune the settings in order to

obtain the best performance for the system; this last task it’s maybe the

harder to execute because it requires a lot of experience with the system, but

it can improve the system’s performance a lot.

7.1.2 Test Results

We start with the simplest kind of tests; each subject follows all the paths

alone. Despite of the simplicity these tests give us many informations about

the dependence of the system’s performance from the quality of the disparity

map.

As already stated in the previous chapters, we use the disparity map

to reproject the foreground pixels on the floor obtaining the so called plan

view maps. The algorithm provided with OpenCV library is very fast but

it outputs very confused disparity maps that make harder to obtain high

precision in a set of situations; for example we can derive from the results

that the disparity map become more and more inprecise while the subject

approach to the wall. The choice of the appropriate stereo correspondence

algorithm is the result of a trade-off between quality and efficiency; we have


to find an algorithm which allows to work in a real time fashon providing

at the same time precise disparity informations about at least the moving

object (some stereo correspondence algorithm such as [15] gives very good

result even on static objects with very low texture information but are very

computationally expensive).

(a) Subject 1 (b) Subject 2 (c) Subject 3

Fig. 7.5: The results for the ‘Blue’ Path.


Fig. 7.6: The results for the ‘Green’ Path.

Before going ahead with the analysis we describe the main elements of

the graphics:

• the red grid is made up by cells which represents a real floor area with

a side of 50 cm;



Fig. 7.7: The results for the ‘White’ Path.

• the green path is the real path that the subjects have to follow;

• the blue points represent the positions provided by the system; for each

of these points we give te tolerance on the position figured as the bias

of the rectangle centered in the correspondent point;

• in each figure we have drawn the height measurements, which the reader

can compare to table 7.3.

Now in the following table, we have the error measurements for these first

tests:

MAX Error AVG ErrorBlue 533.67 201.04

Subject 1 Green 305.94 189.31White 622.90 230.80Blue 580.00 241.67

Subject 2 Green 532.54 272.76White 710.21 241.03Blue 449.45 217.05

Subject 3 Green 556.05 215.50White 442.72 213.83

Fig. 7.8: Error measurements (in mm) for the walk shown in Fig.s 7.5, 7.6,7.7


From the table 7.8 we can derive the following considerations:

• the average error keep itself constant to approximately 20 cm;

• the system is subject to spike in the measurements due mainly to the

problem with disparity map above mentioned.

Taking into account the tolerance provided by the system itself these first

result make PLaTHEA suitable for domestic use.

Now we make our tests a little bit harder; we have two subjects that

follow two paths at the same time. We have done four of these kind of tests

with an increasing degree of occlusion between the two subjects.

(a) Subject 1 on the ‘Blue’ path (b) Subject 3 on the ‘Green’ path

Fig. 7.9: In this test the two subjects experience the smallest degree ofocclusion.

In Fig. 7.9 Subject 1 and 3 perform two different paths with a little

amount of occlusions; so the result is not so longer from the alone walks

showed early.

A little more complex is the scenario depicted in Fig. 7.10. Here while

the Subject 3 performs the first part of the path, Subject 1 (that is a little bit


(a) Subject 1 on the ‘Green’ path (b) Subject 3 on the ‘Blue’ path

Fig. 7.10: In this test the two subjects experience a little bit more of occlu-sion with respect to Fig. 7.9.

small) generate a little degree of occlusion; the result is a worsening of the

disparity map that generate an error in the “sense of distance” of the system.

In Fig. 7.11 the reader can see that not only the two subjects experience

the same kind of imprecisions, but also that the system starts to provide a

little bit set of measurements.

Finally in Fig. 7.12 we have the worst case; here a whole part of the path

followed by Subject 1 is not detected at all. The main cause for this event is

that Subject 3 (that is much more bulky of Subject 1) generate an occlusion

that deny the system the sight of Subject 1 at all. This problem could be

resolved in part placing the stereo rig near the cealing (we have placed the

rig at a relatively small height)2.

Now we resume the error results in the following table:

It’s remarkable that the average errors showed in table 7.13 are very

similar to those presented in table 7.8.

2We will see in the conclusive chapter that another solution to the problem is the useof two stereo system placed at opposite corners in the room.


(a) Subject 1 on the ‘Blue’ path (b) Subject 3 on the ‘Blue’ path

Fig. 7.11: In this test we have the third degree of occlusion; the systembegin to lose precision.

(a) Subject 1 on the ‘Blue’ path (b) Subject 3 on the ‘Green’ path

Fig. 7.12: This is the greater degree of occlusion, part of the Subject 1 walkis not even detected.


MAX Error AVG ErrorSubject 1 624.82 310.82

Degree 1 Subject 3 536.66 244.80Subject 1 568.51 258.35



Degree 4 Subject 3 695.70 313.28

Fig. 7.13: Error measurements (in mm) for the walks shown in Fig.s 7.9,7.10, 7.11, 7.12. The data are ordered for increasing degree of occlusion.Note that the data for the Degree 4 are in some sense misleading because itdoesn’t consider the lost measurements.

Now we take a look a the most complex text. The three subjects follow

three different paths at the same time; the reader may object that we can

have more than three persons in a room, but this test is useful because the

paths are very intricate, so this is a very complex human interaction scene.


Fig. 7.14: All in the test room at the same time.

In Fig. 7.14 we note the same problem showed in Fig. 7.12; in particular

Subject 1 experience the same occlusion problem already seen. Now in the

following table we resume the error measurements:

Before closing this section we want to make some considerations on the

height measurements. The data showed in the figures above show that the

7.2. TESTS ON FACE RECOGNITION SUB-SYSTEM 111

MAX Error AVG ErrorSubject 1 388.33 212.45Subject 2 490.31 194.13Subject 3 772.79 302.22

Fig. 7.15: Error measurements (in mm) for the walk shown in Fig. 7.14.

maximum detected height for each subject it’s near to the real height; un-

fortunately the average detected heights show a relatively high error. The

problem is as always in the disparity map; during the walk the persons assume

poses that prevent the algorithm to detect disparity information merging the

head with the background disparity. This is not the first time we have to face

this problem; the face recognition thread as already stated after the recogni-

tion has to associate the identity to a tracked object; this operation involves

the use of disparity information for reprojection on the plan view but the

pixels of the face often don’t have an associated disparity; so in this case we

move down the y coordinate of the face to the chest that in vast majority of

the cases have an associated disparity.

7.2 Tests on Face Recognition Sub-system

The reader may wonder why we didn’t test the face recognition subsystem

using the same UPnP client used for the PLT tests; the reason is in some

sense a technological matter; the cameras used in PLaTHEA development

are entry level cameras with a maximum resolution of 640x480 pixels3. At

this resolution a face placed at 4 meters from the stereo rig appear on the left

imager in an area of about 40x40 pixels that is not enough for face recognition.

The system correctly associate the face with a tracked object but is not able

to assign to it the correct identity; our test with a face database containing 5

users gives a hit rate of about the 30% that is only a little more of a random

choice; so we postpone this test to the future using high definition cameras

3The system acquires the video sequences from the stereo rig at this resolution, butthis is downsampled to 320x240 for the PLT subsystem. The idea is to acquire from thecameras at the maximum resolution available to perform face recognition, resizing it forthe tracking task.


(we approximately need a resolution of 1920 pixels on width).

However we want to analyze the results that the face recognition tecnique

based on SIFT features has showed as a standalone system. To this aim we

have developed a test application the use a simple webcam as video source.

This application is showed in Fig. 7.16.

Fig. 7.16: The test application for the face recognition system.

The test application works exactly as the face recognition subsytem in-

cluded in PLaTHEA with the exception that the video source is not the left

camera of the stereo rig. Recapitulating it does the following actions:

1. it does the face detection using the Viola-Jones classifier;

2. for each image detected it matches the SIFT features in that image

with the SIFT features of the images in the database;

3. for each user in the database it assign a score summing all the matches

for each image;

4. the face is assigned to the user with the maximum score using a thresh-

old to declare this association valid.

7.2. TESTS ON FACE RECOGNITION SUB-SYSTEM 113

The training set for the system is made up by 50 images (10 images for

each of the 5 user; the training set for a single user is showed in Fig. 7.17);

this training set may seem tiny but designing it we’ve thinked to the final

use of the system: the domestic environment.

Fig. 7.17: The training set for a single user. We have a series of frontalposes; the profile poses aren’t useful because the Viola-Jones classifier istrained only for frontal faces.

Now the face recognition system have two main constraints that corre-

spond to two tests set:

• if the face detected in the current frame correspond to a user in the

database, the system have to correctly indicate it;

• if the face detected in the current frame doesn’t correspond to any user

in the database, the system should detect this situation.

The first test set contains a series of 60 images of the users contained in

the database; the system (tuned with a threshold of only 20 features) has

shown very good results that we ricapitulate in the table 7.18. The results

are very interesting also because the test set contains every kind of strange

pose that justify the two non hit case.

The second test set it’s interesting because the result can be used to tune

the threshold of the system; it’s made up by 60 images of users not present

in the database; if we look at the maximum score and at the average we can

choose the best threshold for the system.


Number AVG ScoreCorrect 58 69

No Answer 1 10Not Correct 1 32

Fig. 7.18: The results for the first test set.

MAX Score AVG Score38 25

Fig. 7.19: The results for the second test set.

Doing the test we have noted the importance of the way in which the

training set is made up. It’s important that every user store in the database

the same number of images and that all the training set is shooted in good

light condition. We experienced that with low light condition the system is in

fact not able to extract enough SIFT features for the recognition. In the worst

case the users shows training set with different illumination conditions; in this

case the performance of the system degrade with frequent mismatching.

7.3 Computational Costs

We close the chapter with some considerations about the cost of the sys-

tem. Vision system are computationally expensive particularly for the stereo

correspondence algorithm and the face recognition system as a whole.

For a good tracking, as already stated, we need to elaborate a frame rate

of at least 10 frames for second; this means that the system has 100 ms to

analyze a stereo snapshot. We’ve already seen that to this aim we need a

second elaboration thread, that works parallel to the main elaboration thread,

for face recognition; this parallel thread to do a single computation occupy

2-3 time slot. In Fig. 7.20 we have a graphic that explain the situation; the

elaboration unit is a Toshiba Satellite laptop equipped with an Intel Core 2

Duo CPU at 2.53 GHz and 4 GB of RAM.

We have already state that isolate the network traffic generated by the

cameras is very important; in fact the two Axis 207 cameras, with two video

7.3. COMPUTATIONAL COSTS 115

Fig. 7.20: The computational time of PLaTHEA. The horizontal axisindicates the stereo frame number; the vertical axis unit measure is in ms.

sequences composed by 640x480 frames at the rate of 10 frames for second

generate a traffic that use the 8% of the Ethernet 100 bandwidth. The traffic

will increase in the future using high definition cameras for face recognition.


Chapter 8

Conclusions and Future Works

Contents8.1 Considerations on Vision Systems . . . . . . . . 117

8.2 Future Works . . . . . . . . . . . . . . . . . . . . . 119

8.1 Considerations on Vision Systems

The development of PLaTHEA has been an interesting exploration of tec-

niques in the field of computer vision; the result of this effort has been the

opinion that vision based systems have very good chances to become the

standard for people localization, recognition and tracking in domestic envi-

ronment.

However PLaTHEA itself suffers for the typical problems of vision sys-

tems; we list the most painful drawbacks:

• as we saw in the test chapter occlusions make it harder to detect the

presence of a person; in some case, if the occluder and the occluded

persons aren’t to much close, this problem may be solved placing the

stereo rig at an adequate height; problems related with occlusions are

more and more perceptible if the room is crowded;

• using illumination and chromaticity components of pixels’ colours we

have solved some of the problems derived from changes in light condi-

117

118 CHAPTER 8. CONCLUSIONS AND FUTURE WORKS

tions; the system is however very sensible to strong illumination changes

such as those derived from hotspots;

• deeply related with the previous item, the system works bad (or it

doesn’t work at all) with low light conditions; as already stated in the

first chapter, one of SM4All goals is exploring the potential of infrared

for night vision; however the techniques explored during the design of

PLaTHEA are deeply based on color information while infrared vision

is inerently grayscale;

• last but not least, the system is robust enough to not include in the

background model relatively static persons (that is for example seated

studying) but has experienced problems with particulary static body’s

parts (such as legs) and particulary static bodies (such as a sleeping

human); in this situations the inclusion in the background is only a mat-

ter of time; to tune the system to support such cases it’s very difficult

because it should cause a loss in adaptivity properties of background

model.

The good new is that PLaTHEA’s code is highly modular so it is sim-

ply extensible; this was a needed feature for our code because during the

development we made a lot of changes in our strategy.

In spite of “popular belief” face recognition, using appropriate cameras,

it’s the easier task in a domestic environment; in fact, differently from other

scenarios where face recognition may be useful (banks, airports, stations or

market), a set of contraints may be neglected:

• the training set is very limited; only house’s inhabitants have to store

their faces in the repository and this fact allows to use simple algorithm

such as SIFT based algorithm;

• in more delicate scenarios humans that have to be recognized don’t

want to be identified (we may think to criminals, terrorists, dishonest

employees and so on).

8.2. FUTURE WORKS 119

As we’ve seen the other main family of PLT systems is that of markers

based systems; in this area the most present technology is RFID (Radio

Frequency IDentificator); this is a very good alternative to computer vision; it

delets some of the problems above mentioned but it introduces new problems

related to radio transmissions. A good example of such a system is given in

[21].

8.2 Future Works

We should talk rather of ‘immediately future’ works; in the previous section

we have introduced a set of problems to solve; however PLaTHEA isn’t a

research project in the strict sense of the term; rather we aim to obtain the

best from the tools produced by the research world.

Following these intents the first goal to reach is the improvement and

completion of the face recognition subsystem; as we’ve already state the

problem with face recognition is only a resolution matter; what this means

is that we have grossly three alternatives:

• we can use a couple of high resolution cameras for stereo vision replacing

the Axis 207 cameras;

• we can mantain only one Axis 207 camera coupling to it an high resolu-

tion camera; obviously this solution introduces a set of problems related

to calibration, rectification and stereo correspondence but reduce the

economical cost of the system (remember that in our opinion in a not

so far day we’ll have a PLaTHEA peer in each room of every house);

• we can add a third camera provided with optical zoom pointed to a

strategic area of the room (the door area); this solution create problems

from the point of view of the association of the face to the tracked

subjects; moreover this seems to be a less attractive idea.

The second main goal is the choice and the implementation of a best

stereo correspondence algorithm; the vast majority of precision problems

of the system derive from the loss of precision of the disparity map. The

120 CHAPTER 8. CONCLUSIONS AND FUTURE WORKS

choice of the stereo correspondence algorithm is a trade-off between quality

and velocity; the algorithm has to be fast enough to allow for real time

processing of stereo frames but it has to be precise enough to avoid excessive

errors durign the localization phase.

In third place, we want to improve the tracking algorithm adding a

bayesian network to aid the changes of the state for the tracked objects.

The previous two goals can be considered as “immediately future” works;

one of the problem of stereo vision system, that we have ignored until now,

is that stereo camera field of view doesn’t cover an whole room; this is not a

trivial problem. We don’t face this problem at all but we think to use in the

future two couple of cameras, with the two stereo rig at the opposite corners

of a room; this solution can aid in the solution of some occlusion situations.

Bibliography

[1] Home automation wikipedia voice. http://en.wikipedia.org/wiki/

Home_automation.

[2] Jeffrey S. Beis and David G. Lowe. Shape indexing using approximate

nearest-neighbour search in high-dimensional spaces. In In Proc. IEEE

Conf. Comp. Vision Patt. Recog, pages 1000–1006, 1997.

[3] J.-Y. Bouguet. Camera calibration toolbox for matlab. http://www.

vision.caltech.edu/bouguetj/calib_doc/index.html, 2008.

[4] Gary Bradski and Adrian Kaehler. Learning OpenCV. O’Reilly, 2009.

[5] Rita Cucchiara, Costantino Grana, Massimo Piccardi, and Andrea Prati.

Detecting moving objects, ghosts and shadows in video streams. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 25:1337–

1342, 2003.

[6] Mecella M. et al. Sm4all - architecture. http://www.sm4all-project.

eu/index.php/activities/deliverables.html, November 2009.

[7] J. G. Fryer and D. C. Brown. Lens distortion for close-range photogram-

metry. Photogrammetric Engineering and Remote Sensing, 52:51–58,

1986.

[8] R. Hartley and A. Zisserman. Multiple View Geometry in Computer

Vision. Cambridge University Press, 2006.

[9] Michael Harville. A framework for high-level feedback to adaptive, per-

pixel, mixture-of-gaussian background models. In European Conference

on Computer Vision, 2002.

121

http://en.wikipedia.org/wiki/Home_automation

http://en.wikipedia.org/wiki/Home_automation

http://www.vision.caltech.edu/bouguetj/calib_doc/index.html

http://www.vision.caltech.edu/bouguetj/calib_doc/index.html

http://www.sm4all-project.eu/index.php/activities/deliverables.html

http://www.sm4all-project.eu/index.php/activities/deliverables.html

122 BIBLIOGRAPHY

[10] Michael Harville. Stereo person tracking with adaptive plan-view tem-

plates of height and occupancy statistics. Image and Vision Computing,

22:127–142, 2004.

[11] Thanarat Horprasert, David Harwood, and Larry S. Davis. A robust

background subtraction and shadow detection. In In Proceedings of the

Asian Conference on Computer Vision, 2000.

[12] Jun Luo Ma, Y. Takikawa, E. Lao, S. Kawade, M. Bao-Liang Lu. Person-

specific sift features for face recognition. In IEEE International Confer-

ence on Acoustics, Speech and Signal Processing, 2007.

[13] R. E. Kalman. A new approach to linear filtering and prediction prob-

lems. Journal of Basic Engineering, 22, 1960.

[14] Kyungnam Kim, Thanarat H. Chalidabhongse, David Harwood, and

Larry Davis. Background modeling and subtraction by codebook con-

struction. In In International Conference on Image Processing, pages

3061–3064, 2004.

[15] Vladimir Kolmogorov, Ramin Zabih, and Steven Gortler. Generalized

multi-camera scene reconstruction using graph cuts. In In Proceedings of

the International Workshop on Energy Minimization Methods in Com-

puter Vision and Pattern Recognition, pages 501–516, 2003.

[16] K. Konolige. Small vision system: Hardware and implementation. In

Proceedings of the International Symposium on Robotics Research, pages

111–116, 1997.

[17] Harold W. Kuhn. The hungarian method for the assignment problem.

Naval Research Logistics Quarterly, 2:83–97, 1955.

[18] David G. Lowe. Distinctive image features from scale-invariant key-

points, 2003.

[19] M. Bicego, A. Lagorio, E. Grosso, M. Tistarelli. On the use of sift fea-

tures for face authentication. In Computer Vision and Pattern Recogni-

tion Workshop, 2006.

BIBLIOGRAPHY 123

[20] Ara V. Nefian and Monson H. Hayes Iii. A hidden markov model-based

approach for face detection and recognition, 1998.

[21] L.M. Ni, Y. Liu, Y.C. Lau, , and A.P. Patil. Landmarc: indoor location

sensing using active rfid. In Proc. of PerCom, pages 407–415, 2003.

[22] Jeffrey Richter and Christophe Nasarre. Windows via C/C++. Microsoft

Press, 2008.

[23] S. Bahadori, L. Iocchi, G.R. Leone, D. Nardi, and L. Scozzafava. Real-

time people localization and tracking through fixed stereo vision. In

International Conference on Industrial & Engineering Applications of

Artificial Intelligence & Expert Systems, 2005.

[24] Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of

dense two-frame stereo correspondence algorithms. International Jour-

nal of Computer Vision, 47:7–42, 2002.

[25] Luigi Scozzafava. Localizzazione e tracciamento di persone e robot at-

traverso la stereo visione. Master’s thesis, University of Rome Sapienza,

2003.

[26] SM4All Partners. SM4All - Description of work, March 2008.

[27] R. Y. Tsai. A versatile camera calibration technique for high accuracy

3d machine vision metrology using off-the-shelf tv cameras and lenses.

IEEE Journal of Robotics and Automation, 3:323–344, 1987.

[28] Matthew Turk and Alex Pentland. Eigenfaces for recognition. Journal

of Cognitive Neuroscience, 3(1):71–86, 1991.

[29] UPnP Forum. UPnP Device Architecture 1.0, 2008.

[30] P. Viola and M. J. Jones. Robust real-time face detection. International

Journal of Computer Vision, 57, 2004.

[31] Z. Zhang. A flexible new technique for camera calibration. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence, 22:1330–1334,

2000.

124 BIBLIOGRAPHY

[32] W. Zhao, R. Chellappa, A. Rosenfeld, and P.J. Phillips. Face recogni-

tion: A literature survey. ACM Computing Surveys, 2003.

Ringraziamenti

Scrivere questa tesi mi ha impegnato mente e corpo per molti mesi, durante

i quali ho trascurato molte persone; i miei primi ringraziamenti vanno quindi

alla mia famiglia, che ha sopportato con pazienza le mie assenze ed i miei

ritardi e soprattutto a nonno Ciccio e nonna Titina: vi prometto per il futuro

un nipote piu presente.

Dal momento in cui mi sono iscritto all’universita se c’e una persona che

tutte le sere ha atteso le mie chiamate (e che un sacco di volte ha dovuto

sgridarmi per questo motivo), quella e nonna Maria: quante volte in vita mia

ti ho chiamato “mamma”?

Nessun professore mi ha mai giudicato cosı duramente quanto zio Gian-

franco (e nessun datore di lavoro, penso, sara cosı esigente); devo a te le mie

inclinazioni da smanettone (e le mille foto da bambino).

Come dimenticare poi gli amici di una vita; Andrea, compagno di studi

prima, di casa e di esplorazioni romane poi; Marcello con cui ho a che fare

dalle elementari (e che ricorda tutto di quel periodo); Ettore che per primo

riuscı a farmi uscire fino a tarda sera e che ha avuto l’onore di dare il nome

al mio sistema; Asish di cui non posso riferire il soprannome; Luigi che in

poco tempo si e guadagnato il mio affetto; Giuseppe, Adriano ed Emilio che

hanno sempre sopportato questo amico poco presente.

Il mio percorso di studi universitario, poi, non sarebbe stato lo stesso senza

la compagnia di Enzo, Pasquale e Valerio; quanti esami abbiamo preparato

insieme? quante risate ci siamo fatti? Con voi ho condiviso questi anni

indimenticabili.

E infine...

125

126 BIBLIOGRAPHY

...i titoli di coda sono per la mia Donatella; sei stata amica, collega e adesso

compagna; hai sopportato in questi mesi gli umori di un’anima in pena

dandomi la forza di reagire nei momenti difficili; voglio scrivere insieme a te

tutte le pagine che seguono...

Date post:	18-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

A Service-Based People Localization and Tracking System for...

Documents