In Copyright - Non-Commercial Use Permitted Rights ...47015/et… · photo booths, Valeria De Luca...

Research Collection

Doctoral Thesis

Large Scale Learning for Automatic Image Organization

Author(s): Bossard, Lukas

Publication Date: 2014

Permanent Link: https://doi.org/10.3929/ethz-a-010279063

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For moreinformation please consult the Terms of use.

ETH Library

https://doi.org/10.3929/ethz-a-010279063

http://rightsstatements.org/page/InC-NC/1.0/

https://www.research-collection.ethz.ch

https://www.research-collection.ethz.ch/terms-of-use

diss . eth no. 22202

L A R G E S C A L E L E A R N I N G F O RA U T O M AT I C I M A G E O R G A N I Z AT I O N

A thesis submitted to attain the degree of

doctor of sciences of eth zurich

(Dr. sc. ETH Zurich)

presented by

lukas anton bossard

Dipl. El.-Ing. ETH Zurichborn on September 9, 1983

citizen of Willisau-Stadt, Switzerland

accepted on the recommendation of

Prof. Dr. Luc Van Gool, examinerProf. Dr. Christoph H. Lampert, co-examiner

2014

In memory of Simon.

A B S T R A C T

More than 1.8 billion photos are uploaded and shared on the Inter-net every day and certainly many more are taken and stored offline.This avalanche of images increases the need for automatic filteringand organization. Fortunately, a large number of those images comewith a rich context and this creates a great opportunity for computervision: Learning from this massive and rich pool of data opens theroad for solving the problem of automatic organization.

This dissertation studies three methods to improve classificationby exploiting weakly labeled data and applies them in the contextof automatic photo organization by means of three different appli-cations: namely the recognition of events in photo collections, therecognition and description of upper-body apparel and the recogni-tion of food images. Two different ideas are being followed. Firstly,to use weakly labeled data to assess and improve the generalizationability of a learner as studied in the context of apparel recognition.And secondly, to automatically mine discriminative sub-structureswhich may be latent and/or tedious to annotate to improve classi-fication performance as examined in the context of event and foodrecognition.

Most photographs are taken to preserve memories and thus oftenhappen in the specific context of a social event at salient moments.Photographers take pictures in a rather sporadic, but bursty fash-ion. Such bursts embody the importance of specific moments forthe photographers. This thesis introduces a large data set of ap-proximately 61 000 images in 14 classes and presents the StopwatchHidden Markov Model (SHMM) for recognizing events in personalphoto collections. The SHMM automatically learns latent sub-eventsby exploiting bursts of images and models events as a sequence ofthem sub-events.

Social events influence what people wear. This thesis furtherpresents a fully automatic and complete pipeline for recognizingand classifying upper-body clothing in natural scenes. The pipe-line combines a number of state-of-the-art building blocks such

v

as upper-body detectors, various visual attribute classifiers and amulti-class learner based on a Random Forest. Classification perfor-mance is improved by extending the Random Forest to use automat-ically downloaded and weakly labeled data from the Internet.

Food is another important part of everyone’s life which is evi-denced by its strong presence on the Internet. To this end, this the-sis examines the recognition of pictured dishes, presents a methodfor mining discriminative components using a Random Forest andintroduces a data set of 101 000 images in 101 food classes. SinceRandom Forests are naturally multi-class learners, discriminativeparts can be mined simultaneously for all classes and their hier-archical structure allows them to share knowledge among classeswhile mining.

vi

Z U S A M M E N FA S S U N G

Mehr als 1.8 Milliarden Fotos werden jeden Tag ins Internet hoch-geladen und sicherlich noch viele mehr mit digitalen Fotokameraserstellt. Eine solche Bilderflut macht eine automatische Organisa-tion und Selektion von Bildern dringend nötig. Glücklicherweiseexistieren im Internet viele Bilder mit einem reichen Kontext an Da-ten. Die Anwendung von maschinellen Lernverfahren auf der Basisdieses überwältigend grossen Fundus an Bild- und Metadaten er-möglicht die Entwicklung automatischer Fotoverwaltungssysteme.

In dieser Dissertation werden drei Methoden studiert, welche mitHilfe von nicht vollständig annotierten Daten die Genauigkeit vonBildklassifikationsverfahren verbessern und diese in drei verschie-denen Applikationen im Kontext automatischer Fotoverwaltung an-wendet: nämlich die automatische Erkennung von Ereignissen inprivaten Fotosammlungen, die Erkennung und Beschreibung vonKleidern, die am Oberkörper getragen werden sowie das Erkennenvon Gerichten auf Bildern. Dabei werden zwei Paradigmen ange-wendet. Entweder wird eine Vielzahl an nicht vollständig annotier-ter Bilder dazu genutzt, um die Verallgemeinerungsfähigkeit einesLernverfahren zu beurteilen und zu verbessern, was im Kontext derKleidererkennung untersucht wird oder es werden Substrukturen,die latent und/oder nur mühsam zu annotieren sind, automatischgelernt und dazu verwendet, die Klassifizierungsrate im Kontextder Ereignis- und Essenserkennung zu verbessern.

Die meisten Bildern werden geschossen, um sich an etwas Spe-zielles zu erinnern; sie entstehen darum oft während besonderenMomenten. Das hat zur Folge, dass Fotos zwar eher sporadischaufgenommen werden, dass aber während dieses besonderen Mo-ments mehrmals auf den Auslöser gedrückt wird und darum Fotosgehäuft entstehen. In dieser Arbeit wird das Stopwatch Hidden Mar-kov Model vorgestellt, welches Momente automatisch lernt und Fo-tostrecken mit Hilfe der gelernten Momente repräsentiert, um denEreignistyp zu erkennen. Zur Untersuchung dieses Problems wird

vii

deshalb ein Datensatz von 61 000 Bildern im Kontext von 14 Ereig-nisklassen vorgestellt.

Unser sozialer Kontext beeinflusst, welche Kleidung wir tragen.In dieser Arbeit wird zudem ein vollautomatisches System um Klei-der, die am Oberkörper getragen werden, zu erkennen und zu be-schreiben präsentiert. Unser System kombiniert zu diesem Zweckverschiedene Algorithmen der modernen Bildverarbeitung wie Ober-körperdetektoren, Klassifikatoren von visuellen Attributen und Ran-dom Forests. Mit Hilfe einer Modifikation des Trainingsverfahrensvon Random Forests konnten schwach annotierte Daten aus demInternet in den Lernprozess integriert werden, was die Klassifikati-onsrate verbesserte.

Essen ist ein anderer wichtiger Aspekt unseres Lebens, was sichklar an der Menge von fotografiertem Essen im Internet zeigt. Dar-um widmet sich diese Arbeit auch der automatischen Erkennungvon abgebildeten Gerichten und präsentiert eine Methode, um mitHilfe eines Random Forests markante Bildelemente automatisch zufinden. Auch wird ein neuer Datensatz mit 101 000 Bildern in 101

Essenskategorien vorgestellt. Wir zeigen, dass sich Random Forestssehr gut dazu eignen, markante Bildelemente zu finden.

viii

A C K N O W L E D G M E N T S

First and foremost I would like to thank my adviser Luc Van Goolfor giving me the opportunity to conduct my studies at his lab andfor providing such a stimulating environment for doing research.Also a word of gratitude to Christoph H. Lampert for acting as theco-referee of this thesis.

I am very grateful to Till Quack – without whom I would nothave started a PhD in Computer Vision – for being a great men-tor, motivator and for always being optimistic. Special thanks toMatthieu Guillaumin, Christian Wengert, Christian Leistner for allthe support, fruitful discussions and night shifts before deadlines.

Thanks go to all my partners in crime a.k.a. lab mates. StephanGammeter for team awesome, Andreas Ess for the song of the day,Matthias Schneider for being a great lunch pall, David Wisti forkeeping the crossmatcher running, Jürgen Gall for the emery board,Danfeng Qin for all the interesting discussions, Gabriele Fanelli andAndrea Fossati for being great travel mates, Daniel Küttel for keep-ing the niveau, Fabian Nater for the moral support, Helmut Grabnerfor staying classy, Severin Stalder for the introduction to Koreanphoto booths, Valeria De Luca for the grandezza, Angela Yao forthe dry humor, Marko Ristin for keeping the perspective, DengxinDai for being cheerful, Mukta Prasad for the introduction to Indiancooking, and all other members of BIWI and CVL. I would also like tospeak a word of gratitude to Christina Krüger, Barbara Widmer andFiona Matthew for handling all the administrative work in such arelaxed way, as well as to Salvatore Bonaccos, Edwin Thaler and thewhole ISG for providing us with such a stable IT-infrastructure.

Special thanks go to Matthias Dantone and Michael Emmersbergerfor all the support during deadlines, for being great office mates and– last but not least – for starting our next endeavour together.

I am deeply thankful to my family and friends for their continu-ous support and reminding me, that there are other things in livebesides sitting in front of a computer. And to Salome, thank you forall your love and being who you are.

ix

C O N T E N T S

1 introduction 1

1.1 The Avalanche of Images . . . . . . . . . . . . . . . . . 1

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Organisation . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 tools of computer vision 7

2.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Mid-Level Feature Encoding . . . . . . . . . . . . . . . . 11

2.2.1 Bag of Words . . . . . . . . . . . . . . . . . . . . 12

2.2.2 Fisher Kernel . . . . . . . . . . . . . . . . . . . . 15

2.2.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.4 Improvements . . . . . . . . . . . . . . . . . . . . 19

2.3 A Selection of Machine Learning Algorithms for Clas-sification . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3.1 Support Vector Machines . . . . . . . . . . . . . 24

2.3.2 Random Forest . . . . . . . . . . . . . . . . . . . 25

3 upper body apparel classification 29

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Classifying Upper Body Clothing . . . . . . . . . . . . . 32

3.3.1 Pre-Processing . . . . . . . . . . . . . . . . . . . 33

3.3.2 Features . . . . . . . . . . . . . . . . . . . . . . . 34

3.3.3 Apparel Type Learning . . . . . . . . . . . . . . 34

3.3.4 Clothing Attribute Learning . . . . . . . . . . . 38

3.4 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.4.1 Apparel Type . . . . . . . . . . . . . . . . . . . . 39

3.4.2 Transfer Forest . . . . . . . . . . . . . . . . . . . 41

3.4.3 Attributes . . . . . . . . . . . . . . . . . . . . . . 41

3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.5.1 Apparel Types . . . . . . . . . . . . . . . . . . . 43

3.5.2 Attributes . . . . . . . . . . . . . . . . . . . . . . 48

xi

contents

3.5.3 Qualitative Results . . . . . . . . . . . . . . . . . 50

3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4 event recognition in photo collections 55

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.4 The Stopwatch Hidden Markov Model . . . . . . . . . 62

4.4.1 Model . . . . . . . . . . . . . . . . . . . . . . . . 62

4.4.2 Inference . . . . . . . . . . . . . . . . . . . . . . . 65

4.4.3 Training . . . . . . . . . . . . . . . . . . . . . . . 66

4.4.4 Initial Sub-events . . . . . . . . . . . . . . . . . . 68

4.5 Features and potentials . . . . . . . . . . . . . . . . . . . 68

4.5.1 Global Temporal Features . . . . . . . . . . . . . 68

4.5.2 Low-level Visual Features . . . . . . . . . . . . . 69

4.5.3 Higher-level Visual Features . . . . . . . . . . . 69

4.5.4 Reducing the Dimensionality . . . . . . . . . . . 69

4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.6.1 Protocol . . . . . . . . . . . . . . . . . . . . . . . 70

4.6.2 Approaches for Event Recognition . . . . . . . . 70

4.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . 71

4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5 food recognition using discriminative compo-nents 79

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . 82

5.3 Data Set: Food-101 . . . . . . . . . . . . . . . . . . . . . 83

5.4 Random Forest Component Mining . . . . . . . . . . . 87

5.4.1 Candidate Component Generation . . . . . . . . 88

5.4.2 Mining Components . . . . . . . . . . . . . . . . 89

5.4.3 Training Component Models . . . . . . . . . . . 90

5.4.4 Recognition from Mined Components . . . . . . 90

5.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . 91

5.5.1 Implementation Details . . . . . . . . . . . . . . 92

5.5.2 Influence of Parameters for Component Mining 93

5.5.3 Comparison on Food-101 . . . . . . . . . . . . . 95

5.5.4 Results on MIT-Indoor . . . . . . . . . . . . . . . 104

xii

contents

5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6 conclusions & outlooks 107

6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . 107

6.2 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . 109

bibliography 113

acronyms 131

xiii

L I S T O F F I G U R E S

Figure 2.1 Archetypal classification pipeline . . . . . . . . 7

Figure 2.2 Feature representation . . . . . . . . . . . . . . 8

Figure 2.3 Feature extraction methods . . . . . . . . . . . 10

Figure 2.4 Spatial Encoding . . . . . . . . . . . . . . . . . 19

Figure 2.5 Decision functions . . . . . . . . . . . . . . . . . 22

Figure 2.6 Support Vector Machine . . . . . . . . . . . . . 23

Figure 2.7 A decision tree . . . . . . . . . . . . . . . . . . . 26

Figure 3.1 Apparel classification pipeline . . . . . . . . . . 29

Figure 3.2 Apparel classification performance . . . . . . . 44

Figure 3.3 Class co-occurence . . . . . . . . . . . . . . . . 45

Figure 3.4 λ Evaluation . . . . . . . . . . . . . . . . . . . . 46

Figure 3.5 Feature selection . . . . . . . . . . . . . . . . . . 48

Figure 3.6 Attribute classifier confusion matrices . . . . . 51

Figure 3.7 Attribute classifier confusion matrices . . . . . 52

Figure 3.8 Qualitative apparel classification results . . . . 53

Figure 4.1 Examples of personal photo collections . . . . 56

Figure 4.2 Examples from the PEC data set . . . . . . . . . 61

Figure 4.3 Graphical model for event recognition . . . . . 64

Figure 4.4 Illustration of the Stopwatch HMM . . . . . . . 65

Figure 4.5 Factor graph with fixed class . . . . . . . . . . 66

Figure 4.6 Confusion matrices of different event classi-fication methods . . . . . . . . . . . . . . . . . . 73

Figure 4.7 Recall@K for event classification . . . . . . . . 74

Figure 4.8 Average images of different sub-events . . . . 76

Figure 4.9 Qualitative event classification examples . . . . 77

Figure 5.1 Examples of discriminant components . . . . . 80

Figure 5.2 Examples of the Food-101 data set . . . . . . . 84

Figure 5.3 Illustration of the component mining method . 87

Figure 5.4 Food classification method . . . . . . . . . . . . 91

Figure 5.5 Component mining parameter evaluation. . . . 94

Figure 5.6 Recall@K comparison . . . . . . . . . . . . . . . 98

Figure 5.7 Classification accuracies . . . . . . . . . . . . . 99

Figure 5.8 Examples of discovered components . . . . . . 101

xiv

Figure 5.9 Visualisation of correctly classified examples . 102

Figure 5.10 Visualisation of wrongly classified examples . 103

L I S T O F TA B L E S

Table 3.1 Apparel attributes . . . . . . . . . . . . . . . . . 40

Table 3.2 Apparel data set statistics . . . . . . . . . . . . 41

Table 3.3 Apparel classification performance . . . . . . . 47

Table 3.4 Feature combinations for apparel attributes . . 49

Table 4.1 PEC data set statistics . . . . . . . . . . . . . . . 60

Table 4.2 Performance on PEC . . . . . . . . . . . . . . . . 72

Table 5.1 List of all Food-101 classes . . . . . . . . . . . . 85

Table 5.2 Feature evaluation on Food-101 . . . . . . . . . 95

Table 5.3 Different classification methods on Food-101 . 97

Table 5.4 Evaluation on MIT-Indoor . . . . . . . . . . . . 105

xv

1I N T R O D U C T I O N

1.1 the avalanche of images

It has been a long journey — it took almost a thousand yearsfrom the discovery of the pinhole principle with its ephemeral lightpictures to the creation of persistent photographs thanks to photochemical effects. Lenses and better chemical techniques progres-sively reduced the effort and price to create a picture. While it usedto take hours to capture an image in the beginning of photography,the disruptive impact of information technology marginalized theeffort and cost to literally nothing.

Nowadays, many people in the world carry a smart phone witha camera in their pockets and use it at various occasions. Astonish-ingly 1.8 billion photos are uploaded to Flickr, SnapChat, Instagram,Facebook and WhatsApp – each and every day [92]. While capturingfunny, serious or other worthwhile moments is still one of the mainreasons for shooting an image, the paradigms of taking pictures areslightly changing.

Nowadays, cameras are progressively used to create a broader vi-sual memory. People are taking pictures of products they want tobuy, articles in magazines, business cards, maps, meals, receipts orother things they want to remember or find interesting in that mo-ment. Most of those images however, are just stored and destinedto be never looked at again. The amount of photographs people re-examine is fairly small compared to the massive amount they create.Reasons for that are manifold: many images are repetitive, do notreach a given aesthetic level or are simply not interesting in retro-spective – even for the creator. Most striking, even if an image con-tains relevant information for the creator, it is almost impossible tofind it in the mass of images. Most people are overwhelmed by theavalanche of images they generated since creating an image is muchless of an effort than actually (re-) viewing them all. And it is notgetting better: Recent trends such as Google Glass envision deeply

1

introduction

and ubiquitously embedded cameras in our lives, consequently gen-erating even more visual data.

Imagine now to be able to find a photography just by typing “Theperson I met at my last birthday party wearing a striped sweater”or “Dinners, where I ate ‘moules et frites’” or even “Pants I pho-tographed while shopping last Saturday.” A system that couldanswer such queries does not necessarily need to understand theimages on a semantically deep level, but it would be sufficient toautomatically recognize and index important entities in the corpusof images to answer such queries.

The avalanche of visual data is a double edged sword: On onehand it raises the urge for automatic organization and filtering tomake the relevant information accessible to users again. On theother hand however, this massive amount of image data accompa-nied with rich context also creates a striking opportunity for ma-chine learning and – in the context of this thesis – computer vision.The necessity of automatic organization coupled with the arisen op-portunities and underlying characteristics of user generated visualdata available on the Internet, constitutes the foundation and moti-vation of this thesis.

1.2 contributions

In recent years, cameras have shrunk unbelievably in size, whilecomputational resources like storage, memory and processing powerfiguratively exploded and at the same time became very cheap. Allthis fostered breakthroughs in machine learning and computer vi-sion on which this thesis builds. Nevertheless, automatic photoorganization is far from being solved. Certainly, one reason for thisis that not so many application specific and publicly available datasets exist which is one point this thesis addresses.

Additionally, this thesis introduces three methods to improve clas-sification by exploiting not fully labeled data and studies them inthe context of automatic photo organization by means of three dif-ferent applications. Two paradigms are being followed. First, to useweakly labeled data to assess and improve the generalization abilityof a learner which is demonstrated in the context of apparel recogni-tion. Secondly, to automatically mine discriminative sub-structures

2

1.2 contributions

to improve classification performance as studied in the context ofevent and food recognition. Such discriminant sub-structures canoften be found in visual data, but may be latent and tedious to anno-tate. In the following, a more detailed synopsis of the contributionof this thesis is given.

As a first contribution this thesis introduces a fully automaticand complete pipeline for recognizing and classifying people’s up-per body clothing in natural scenes. This has several interestingapplications, including e-commerce, event and activity recognition,online advertising, etc. The stages of the pipeline combine a num-ber of state-of-the-art building blocks such as upper body detectors,various feature channels, visual attributes and an apparel type clas-sifier based on Random Forests. By extending Random Forests totransfer learning, a mass of weakly labeled and noisy data can beused to train a more robust classifier.

As a second contribution, this work introduces the StopwatchHidden Markov Model (SHMM) for event recognition in personalphoto collections. In contrast to video, the photos of such collec-tions are sparsely sampled over time and often come in bursts. Suchbursts transpire the importance of specific moments for the photog-raphers. The proposed SHMM treats such collections as sequences oflatent sub-events, while the transition between sub-events is a func-tion of the time gap between consecutive images. Also, the SHMM

automatically learns the model parameters and sub-events jointlywhile training.

As a third and final contribution, this work contributes a novellarge scale data set of real world food images with 101 classes and101 000 images in total. Also, a novel method to mine discriminativeimage components for each food category using Random Forestsis introduced which is not specific to the data set. Random For-est based discriminative component mining has multiple benefitscompared to existing mining methods. First, Random Forests arenaturally multi-class, hence components are simultaneously minedfor all classes. Second, their hierarchical structure yields non-linearclassifiers and allows classes to share knowledge during the miningprocess.

3

introduction

1.3 publications

In particular this thesis discusses and contains extended or modifiedversions of the following publications:

• Lukas Bossard, Matthias Dantone, Christian Leistner, Chris-tian Wengert, Till Quack, and Luc Van Gool. “Apparel Clas-sification with Style.” In: Asian Conference on Computer Vision.Springer Berlin Heidelberg, 2012, pp. 321–335

• Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool.“Event Recognition in Photo Collections with a StopwatchHMM.” in: IEEE International Conference on Computer Vision.2013, pp. 1193–1200

• Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool.“Food-101 – Mining Discriminative Components with Ran-dom Forests.” In: European Conference on Computer Vision.Springer International Publishing, 2014, pp. 446–461

During the time of my Ph.D. studies, I also worked together withfellow colleagues on problems related to this thesis. They are mostlyrelated to instance recognition using large scale data mining andimage retrieval or visual recognition and tracking on mobile devices.These works are not further discussed in this thesis, but listed herefor the sake of completeness.

• Stephan Gammeter, Lukas Bossard, Till Quack, and Luc VanGool. “I Know What You Did Last Summer: Object-LevelAuto-Annotation of holiday snaps.” In: IEEE International Con-ference on Computer Vision. 2009, pp. 614–621

• Stephan Gammeter, Alexander Gassmann, Lukas Bossard, TillQuack, and Luc Van Gool. “Server-Side Object Recognitionand Client-Side Object Tracking for Mobile Augmented Real-ity.” In: IEEE Computer Vision and Pattern Recognition Workshops.2010, pp. 1–8

• Danfeng Qin, Stephan Gammeter, Lukas Bossard, Till Quack,and Luc Van Gool. “Hello Neighbor: Accurate Object Re-

4

1.4 organisation

trieval with K-Reciprocal Nearest Neighbors.” In: IEEE Confer-ence on Computer Vision and Pattern Recognition. 2011, pp. 777–784

• Matthias Dantone, Lukas Bossard, Till Quack, and Luc VanGool. “Augmented Faces.” In: IEEE International Conference onComputer Vision Workshops. 2011, pp. 24–31

1.4 organisation

Fundamental basics of computer vision are introduced in Chapter 2

and are thus a good starting point to get a (short) introduction toimage classification. Chapters 3 to 5 can be read independentlyfrom each other. They study different applications in the context ofautomatic image organization and are written as self-contained aspossible. A more detailed overview of the remaining chapter of thisthesis is given next.

Chapter 2 elaborates fundamental tools of computer vision andmachine learning. This includes a short outline over local visualdescriptors, feature encoding schemes and basic algorithms for ma-chine learning.

Chapter 3 introduces the apparel classification and descriptionpipeline and details its core method, a multi-class learner based on aRandom Forest that uses strong discriminative learners as decisionnodes. Also it is shown how automatically crawled and thus noisytraining data from the web can be used to improve the learningprocess.

Chapter 4 presents a method for recognizing events in personalphoto collections. In such collections, images are typically sparselysampled in the time domain and often come in bursts. The Stop-watch Hidden Markov Model (SHMM) represents such sequences bylatent sub-events which are jointly learned with the SHMM parame-ters.

5

introduction

Chapter 5 discusses food recognition using discriminative com-ponents. A Random Forest is used to perform weakly supervisedcomponent mining. The most discriminative component models arethen used for classification. The approach is evaluated on a largescale data set of 101 food classes.

Chapter 6 finally concludes this thesis by discussing its resultsand contributions and gives an outlook to future directions of re-search.

6

2T O O L S O F C O M P U T E R V I S I O N

φ1φ2...

φK

Feature representation

Classification

φ1

φ2

Figure 2.1: Archetypal classification pipeline. The input image isrepresented through a single, multi-dimensional feature.Features of images from different classes are preferablyseparable in the feature space, which enables to learn adecision boundary for classification.

This thesis builds upon recent advances and breakthroughs inthe field of computer vision and machine learning. In particular theappearance of local low level features in conjunction with powerfulencoding methods marked a big leap forward for computer vision.

In the following, a brief overview of current state-of-the-art toolsfor image classification is given. A first and fundamental step ofall machine learning tasks is to define and represent the input andoutput domain in an adequate form. In case of image classification,the domain definition is straight forward: Input is one or more im-ages and output is simply a class label. While the representation ofclass labels is a simple task, the representation of image(s) is muchmore complex. This step has emerged to be essential not only forcomputer vision but for machine learning in general.

Image classification essentially consists of two stages as visual-ized in Figure 2.1: feature representation and classification. Feature

7

tools of computer vision

Local feature extraction

Encoding

Pooling

Figure 2.2: Local descriptor extraction with subsequent mid-levelfeature encoding.

representation refers to the transformation of raw input data into acommon domain (often also called feature space). A relevant bodyof work in computer vision dealt with and still deals with develop-ing methods for extracting invariant and descriptive features. Whileearly methods focused on heavily engineered and hand tuned fea-tures, recent methods directly learn how to transform raw pixel datainto discriminative features. These features can be interpreted aspoints in a multi-dimensional space. Preferably, features of imagesof different classes are separable in this space. In this case a decisionboundary can be learned, that can be used to predict the class of animage with unknown class.

In this thesis, we follow the archetypal image classification pipe-line using local visual features as shown in Figure 2.2. In such aframework, the feature representation step is split into (i) the ex-traction of local low level features as discussed in Section 2.1 and(ii) a subsequent encoding and (iii) pooling of these features intoa mid-level representation as detailed in Section 2.2. The chaptercloses with the discussion of different classification models in Sec-tion 2.3.

8

2.1 feature extraction

2.1 feature extraction

In the context of computer vision, the extracted features ideallymeet a set of requirements: while being invariant to geometric orphotometric image transformations, they should capture the rele-vant characteristics of the input data and describe it in a compactand discriminative way.

Features are generally extracted from image patches, but differ-ent extraction strategies are found in the literature as visualized inFigure 2.3. Global features like Gist [98], Histogram of OrientedGradients (HOG) [29] or color histograms, for example are extractedfrom the full image or a larger patch1 and yield one fixed length de-scriptor for this patch. Local features on the other hand, are extractedfrom a small but fixed neighborhood around points in an image andproduce one fixed size feature descriptor per patch location. Sinceimages vary in size, the number of points and thus the number oflocal features varies for different images.

A subsequent encoding step transforms the set of interest pointdescriptors into a feature vector of fixed length (mid-level feature)since machine learning algorithms generally require a well defined(and fixed) input space. This strategy has proved advantageous overglobal feature descriptors in many computer vision tasks, since ityields a more robust representation. Reasons for this are twofold:Changes caused by deformations or geometrical transformationsmostly affect the spatial configuration between local features andnot their global statistics. While in the cases of clutter and partialocclusion, only a small sub-set of the extracted local features areconcerned which do not change the global statistics significantly.

Depending on the application, two methods for choosing the lo-cations of extraction are commonly encountered in the literature:For classification tasks, features are extracted on a regular grid [41]while for retrieval [112], image matching or instance recognition de-tectors are typically used to identify robustly re-detectable and dis-criminative locations in an image, so called interest points. In thecontext of retrieval and instance recognition, detectors help to re-duce the memory footprint since far fewer features are extracted

1 Note that conceptually every global feature descriptor could also be used as a localdescriptor by applying it to a small image patch around an interest point.

9


(a) Global feature. (b) Local featuresextracted at detectedlocations.

(c) Local features sampledon a grid.

Figure 2.3: Different feature extraction methods. (a) illustrates a pos-sible area for a global feature, while (b) visualizes de-tected interest points and (c) grid sampled interest pointson multiple scales.

(typically a few thousand), while for classification often tens to hun-dreds of thousand interest points are considered. Since this thesismostly focuses on classification, interest point detectors are not fur-ther discussed in this section.

There exists a vast amount of local feature descriptors and mostof them capture how an image changes within a small patch whilebeing invariant to a set of photometric transformations like variationin scale, rotation, viewpoint, or illumination. A popular choice toquantify this change is to aggregate the oriented gradients (or oneof its approximations) over smaller sub regions into a histogram.Scale-Invariant Feature Transform (SIFT) [85], Speeded Up RobustFeatures (SURF) [6] or Gradient Location and Orientation Histogram(GLOH) [93] accumulate such histograms over rectangular (SIFT, SURF)or a log polar (GLOH) regions, while DAISY [119] or FREAK [1] learnthe spatial configuration and size of such sub regions from trainingimages.

10

2.2 mid-level feature encoding


While local features describe small neighborhoods in images, theycannot be used directly to describe a full image or a larger regionthereof. Recently, mid-level feature coding has been shown to be agood solution to overcome this problem. Conceptually, it consistsof two stages, namely: (i) encoding, which transforms each local de-scriptor into a different, higher dimensional domain and (ii) pooling,which aggregates a set of encoded descriptors into a single, fixed-length representation.

In the past few years, two (related) encoding families have shownto be practical. Both of them encode feature descriptors using apre-learned model of the descriptor space: Histograms encodingschemes commonly referred to as Bag of Words (BoW) [28] accu-mulate the assignment statistic of local feature descriptors to a pre-viously learned codebook into a sparse histogram and are furtherstudied in Section 2.2.1. A second popular and powerful familyof encoding schemes are loosely grouped around the Fisher Kernelprinciple [107], which is further detailed in Section 2.2.2. They en-code higher order statistical moments of the extracted local featuredescriptors relative to the pre-learned model of the feature space,typically producing a dense vector representation. The subsequentpooling step,2 discussed in Section 2.2.3, brings a set of encodeddescriptors into a fixed-length representation while preserving in-formative and discriminative information. Common to all thesemethods is that substantial gain in performance can be made byemploying a few best practices like spatial pooling or different nor-malization methods as shortly reviewed in Section 2.2.4. For a fur-ther (also quantitative) analysis of mid-level feature coding, we referto the studies of Boureau et al. [14], Boureau, Ponce, and LeCun [15],Chatfield et al. [22], and Koniusz, Yan, and Mikolajczyk [73].

2 Note that in case of Fisher encoding, average pooling follows naturally from itsformulation and thus the discussion of pooling methods mostly concerns the BoWmodel.

11


2.2.1 Bag of Words

Bag of Words (BoW) – sometimes also known as Bag of Visual Wordsor Bag of Features – refers to a method that originated in text re-trieval. There, a text document is represented with a (sparse) his-togram over the occurring words, effectively discarding the struc-ture of the document and only representing it by very simple statis-tics. The similarity between documents is then defined as a vectordistance measure. In the context of document retrieval, the cosinesimilarity turned out to be a robust measure.

With the advent of local visual features, the BoW model emergedin computer vision. Sivic and Zisserman [112] set the corner stonein image retrieval and subsequently, Csurka et al. [28] used the Bagof Words encoding for image classification.

Let us describe the encoding step more formally: Starting with aset of N feature descriptors X = {xi} and a codebook representa-tionM with a total of K entries, each descriptor x is encoded withan assignment function

α(x) : Rd → RK (2.1)

that assigns x to one or multiple entries of prototype vectors in thecode book M. To represent an assignment of the vector x ∈ Rd tothe kth entry of the codebook, the according dimension is set to anon zero value αk(x).

Over time, BoW evolved into more complex encoding schemes,that yield improved performance like Soft Assignment, Sparse Codingor Locality-constrained Linear Coding. These encoding methods aimat lowering the quantization error introduced by representing a fea-ture descriptor through a single codebook entry, by assigning it tomultiple ones. Since such a multi-assignment is not unique, code-book vectors can be chosen according to different constraints (e.g.,sparsity or locality).

Hard Quantization

Hard vector quantization is the simplest, yet surprisingly effectiveencoding method, introduced by Sivic and Zisserman [112] for im-age retrieval and subsequently adapted to classification by Csurka

12


et al. [28]. In a first step, a set of prototype vectors M = {µj} arelearned by a clustering algorithm, e.g., k-means. To encode a fea-ture descriptor, it is simply assigned to its nearest neighbor in thecodebook, i.e. a feature descriptor x is a vector full of zeros excepta one at a single location:

αk(x) =

{1 if k = argmink′‖x−µk′‖2

2

0 else. (2.2)

Soft Assignment

Soft quantization – or often called soft assignment – was introducedby Gemert et al. [50] and ameliorates the possible large quantiza-tion errors of hard quantization by assigning the feature vector tomultiple code book entries

αk(x) =exp

(−β ‖x−µk‖2

2)

∑Ki=1 exp

(−β ‖x−µi‖2

2) . (2.3)

The assignment weight αk(x) exponentially decreases with increas-ing distance between feature vector and codebook entry. Dependingon parameter β, the number of significant non-zero elements can becontrolled, with the special case of hard quantization for β→ ∞.

In practice, a variant of soft assignment, coined locality constrainedsoft assignment (LcSA), has shown to be a simple, yet effective alter-native to more complex encoding schemes. For this variant, onlya small and fixed number of nearest codebook entries are consid-ered for Equation 2.3. As studied by Liu, Wang, and Liu [82] andChatfield et al. [22], soft quantization with a restricted number ofnearest neighbors, performs comparably or even outperforms themore advanced local linear or sparse coding methods.

Sparse Coding

Both hard and soft quantization require that all of the assignmentssum up to ‖α(x)‖1 = 1. For Sparse Coding (SC) [132], this con-straint is replaced with a sparsity constraint ‖u‖1 on the assignmentvector. This results in the following optimization problem:

α(x) = argminu∈RK

‖x−Mu‖22 + λ‖u‖1 (2.4)

13


Hence the feature vector x is not required to be represented throughits immediate nearest neighbors in the codebook M = {µk}, butwith the entries that result in the sparsest assignment vector.

Locality-constrained Linear Coding

Locality-constrained Linear Coding (LLC) [125] constrains featureencoding not on sparsity, but requires the assigned codebook entriesto be local, i.e., stem from the neighborhood of x. This is enforcedwith the second term in the following equation:


‖x−Mu‖22 + λ

K

∑k=1

(uk e(

1σ ‖x−µk‖2)

)2

s.t. 1Tu = 1

(2.5)

In practice, the following – much simpler – optimization problem issolved. It considers only a local codebook (or basis) M consisting ofa limited number of nearest neighbors of x:


‖x− Mu‖22 + λ‖u‖2

2

s.t. 1Tu = 1(2.6)

To this end, LLC is very similar to soft assignment with a restrictedset of neighbors.

14


2.2.2 Fisher Kernel

Compared to the previously introduced Bag of Words representa-tion, encoding methods related to the Fisher Kernel principle fromJaakkola and Haussler [58] encode the individual descriptors differ-ently. Instead of employing vector quantization, i.e., assigning de-scriptors to codebook entries and aggregating the assignment statis-tics in a histogram, Fisher encoding schemes accumulate first andsometimes second order statistics about the difference between thedescriptor and a model3 M of the descriptor space. While thismight be unintuitive at first sight, Fisher encoding can be inter-preted as a generalization of BoW encoding as pointed out by Sánchezet al. [107].

In the following, we briefly study the different encoding methodsthat evolved around this principle. To this end, we interpret thesubsequently discussed super vector and VLAD encoding methodsas variants or approximations to Fisher encoding.

Fisher Encoding

Fisher encoding – originally introduced to the computer vision com-munity by Perronnin and Dance [101] – slightly shifted the para-digm of feature encoding. Instead of encoding assignments of avector to a codebook, first and second order statistics of the differ-ence of a vector relative to a previously learned model is encoded.In an offline stage, a Gaussian Mixture Model M = {µk, σk} islearned on training data, similar as in the case of Bag of Words en-coding. To encode a feature x, each assignment factor αk(x) is givenby the probability to be generated from the kth mode, similarly asin the case of soft assignment

αk(x) =p(x|µk, σk)πk

∑Kj=1 p(x|µk, σk)πk

. (2.7)

3 Conceptually, the codebook of BoW encoding and the model of Fisher encoding arestrongly related. Consequentally, we use the same symbol M to represent them inboth cases.

15


To encode a set of descriptors X = {xj|xj ∈ Rd}, the first andsecond order moments of the deviation of each descriptor in respectto one component of the GMM model are accumulated as

φ(1)k =

1|X |√πk

∑x∈X

αk(xj)

(xj −µk

σk

)(2.8)

φ(2)k =

1|X |√

2πk∑x∈X

αk(xj)

((xj −µk)

2

σ2k

− 1

)(2.9)

while the assignment factors αk(x) weight the influence of each de-scriptor in respect to this component. Finally, the different momentsare concatenated

Φ =[φ(1)1 ,φ(2)

1 , . . . ,φ(1)K ,φ(2)

K

](2.10)

yielding an encoding with a dimension of Φ ∈ R2 K d with K beingthe number of components of the GMM and d the dimensionality ofthe descriptor x.

Fisher encoding initially performed worse than BoW encoding. Asstudied in Perronnin, Sánchez, and Mensink [102] and Sánchez et al.[107], a few improvements are necessary to allow Fisher encoding to(significantly) outperform BoW. First and foremost, the descriptorsneed to be normalized using PCA such that they are better repre-sented through the Gaussian Mixture Model. Second, power- andL2 normalization (as discussed in Section 2.2.4) of the final Fishervector are also of importance. This combination of Fisher encodingand normalization methods has proven successful for large scale im-age retrieval and classification and is often referred to as ImprovedFisher Vectors (IFV).

As is the case with soft assignment, versions of Fisher encodingonly consider one or a few nearest neighbors to compute Equa-tions 2.8 and 2.9 which yields significant computational speed-uppaid by a moderate decrease in (classification) performance.

Vector of Locally Aggregated Descriptors

Vector of Locally Aggregated Descriptors (VLAD) as introduced byJégou et al. [63] and Jégou et al. [64] can be interpreted as an approx-imation of Fisher encoding as discussed in [63, 64, 107]. Compared

16


to IFV, they have the advantage of a slightly lower dimensionality ofK d and simple computation.

VLAD only encodes the first order moment statistics and thus themodel M = {µk} is simpler. It requires only prototype vectors asin the case of Bag of Words which are also learned by k-means. Toencode a set of descriptors X = {xj|xj ∈ Rd}, VLAD simply aggre-gates the deviation of the feature vector to the prototype vector:

φk = ∑x∈X

αk(x) (x−µk) . (2.11)

In the original formulation[63], hard assignment (Equation 2.2) isused. In principle, any other assignment function αk(x) can beused. Again, the encoded feature vector Φ ∈ RK d is obtained byconcatenating all individual components φk

Φ = [φ1, . . . ,φK] . (2.12)

As studied in other works by Jégou et al. [63], Jégou and Chum[60], and Arandjelovic and Zisserman [3], power- and L2-normaliza-tion improve performance significantly.

Super Vector Coding

Super vector coding combines the ideas of BoW and Fisher encodingand was introduced by Zhou et al. [139]. Like BoW, it encodes theassignment statistics in a first component φ

(1)k ∈ R which is a single

value and computed as

pk =1|X | ∑

x∈Xαk(x) (2.13)

φ(1)k = s

√pk. (2.14)

The second component φ(2)k ∈ Rd aggregates the differences accord-

ing to Fisher encoding and is similarly computed as in the case ofVLAD except for a normalization factor

φ(2)k =

1√

pk∑x∈X

αk(x)(xj −µk

). (2.15)

17


To obtain the final representation, both components are again con-catenated

Φ =[φ(1)1 ,φ(2)

1 , . . . , φ(1)K ,φ(2)

K

]. (2.16)

As in the case of VLAD, different methods for the assignment func-tion αk(x) can be used and often only nearest neighbor entries areconsidered. Since the magnitudes of φ

(1)k and φ(2)

k can differ signif-icantly, they are balanced using a fixed parameter s. This versionof super vector coding is reported in [22] to perform competitivelyto other encoding methods, albeit being sensitive to the choice ofparameters.

2.2.3 Pooling

Pooling refers to the process of aggregating a set of (encoded) de-scriptors into a single, fixed size representation. Favourably, thisaggregation process retains the discriminative information inherentto the set of encoded features, while reducing irrelevant, possiblynoisy signals.

There exist a few different variants loosely grouped around thetwo most prevalent pooling schemes: average pooling and max poolingas studied by Boureau, Ponce, and LeCun [15] and Koniusz, Yan,and Mikolajczyk [73].

More formally, pooling can be defined as a function

P : V → RK (2.17)

that aggregates a set of encoded descriptors V = {vi|vi ∈ RK}into the fixed size representation Φ ∈ RK. With average and maxpooling, the pooling function operates on each dimension k inde-pendently and the kth dimension of Φ for average pooling can bedescribed as

φk =1|V| ∑

v∈Vvk (2.18)

and in the case of max pooling as

φk = maxv∈X

(vk). (2.19)

18


Figure 2.4: A graphical representation of spatial pooling with a spa-tial pyramid of three levels and the Bag of Words his-togram for earch region.

Despite its simplicity, the choice of the pooling function has shownto be important, especially in conjunction with the selected encod-ing method.

For Fisher encoding schemes, average pooling falls naturally inplace as can be seen when comparing Equation 2.18 with Equa-tions 2.8 and 2.9.

2.2.4 Improvements

In addition to the aforementioned methods, there are a few addi-tional techniques which yield significant and consistent improve-ments when applied to the classification pipeline.

Spatial Pooling

One drawback of BoW and Fisher encoding is the loss of spatial in-formation in the pooling step. Spatial Pyramid Matching (SPM) wasintroduced by Lazebnik, Schmid, and Ponce [78] to ameliorate thisdrawback on the account of a heavily increased feature dimension.As visualized in Figure 2.4 in the case of a BoW encoding, the localfeatures are encoded and pooled separately in each region. Each ofthe resulting mid-level features are then concatenated. Despite its

19


simplicity, spatial pyramid pooling yield impressive improvementsfor image classification and in case of sparse mid-level features (likein case of BoW), the increase of memory and computational complex-ity is not so substantial for a few pyramid levels (e.g., three). How-ever, for dense mid-level features like Fisher vectors, the exponentialincrease in dimensionality with each pyramid level is prohibitive formany applications. In such cases, dimensionality reduction like PCA

or Product Quantization (PQ) [62] is applied and/or less poolingregions are chosen.

Power Law Normalization

A simple and recently more widely known form of normalizationis the so called power law or Signed Square Root (SSR) normalization.Low-level feature as well as encoded mid-level features are oftencompared by euclidean distance. Empirically, this distance measureis dominated by large values in single dimensions. Such large val-ues often occur for BoW encoding as studied by Jégou, Douze, andSchmid [61], Perronnin, Sánchez, and Mensink [102], Vedaldi andZisserman [124], and Jégou et al. [64] as well as for low level de-scriptors as studied by Arandjelovic and Zisserman [4]. This effectof dominant large values can be alleviated by applying the follow-ing function to each element of a given low level or encoded feature

f (x) = sign(x) |x|λ (2.20)

with 0 ≤ λ ≤ 1 and subsequent L2 normalization. A popular choicein literature is to set λ = 0.5 for which the canonical term SignedSquare Root (SSR) has been established, or to tune λ with the helpof a validation set. Despite its simplicity, power law normalizationyields consistent empirical improvement in the area of image re-trieval and classification. In the literature, different explanations forsuch improvement have been given, ranging from the Hellinger Ker-nel [124, 4], over bursty features [61] or variance stabilizing transfor-mation [129].

Vector Normalization

Another source of improvement which is widely known in the statis-tics and machine learning community is feature scaling and normal-

20


ization. Tools range from Lp normalization, to PCA and whiteningwhich are applied to low level descriptors and mid-level features.While L1 and L2 normalization are quite common for all kind of fea-tures, PCA and whitening are mostly used for low level descriptorsbecause of their low dimensionality (e.g., as a required step beforeFisher encoding [107]).

In the case of mid-level features, they were seldom employed untilrecently [60]. Like for BoW encoding, they transform the sparse his-togram into a dense representation, implying considerably highermemory usage and slower performance or because of the high di-mensionality of encoded features.

21


x

p(x|y1) p(x|y2)

(a) Generative Model

x

p(y1|x) p(y2|x)

(b) Discriminative Model

x

f (x)

(c) Discriminative Func-tion

Figure 2.5: Different strategies for representing a classification func-tion. While generative models (a) learn the statistics ofthe input data and the class priors, discriminative mod-els (b) only model the posterior probabilities. Discrim-inative functions directly model the decision boundary(c).

2.3 a selection of machine learning algorithms for

classification

In the context of machine learning, classification refers to a functionf : X → Y that assigns a single (class-) label y ∈ Y to an inputx ∈ X , while in case of structured prediction the label y is a structuredobject. In contrast to clustering, where the labels are discovered dur-ing learning, classification and structured prediction models needsupervision at the training stage, i.e., require fully (sometimes alsoonly partially or weakly) annotated training data.

In the machine learning literature [8], three different approachesfor learning such functions f are common (c.f . Figure 2.5). They areshortly described here:

discriminative functions are functions, that directly map in-put x to the class labels y ∈ Y . Examples of such discrimi-native functions include decision trees or ensembles thereof,k-Nearest Neighbors classifiers, Artificial Neural Networks,Support Vector Machines and Random Forests.

discriminative models directly estimate the posterior proba-bilities p(y|x). This is often simpler than inferring the fullprobability density function p(x) as is the case for generative

22

2.3 a selection of machine learning algorithms for classification

x1

x2

〈w,x〉+ b = 0

b

w

Figure 2.6: Conceptual visualization of an SVM in two dimensions.

models since there are typically fewer parameters to estimate.On the downside, no synthetic data can be generated. Ex-amples of probabilistic discriminative models are logistic re-gression or, in the case of structured prediction, ConditionalRandom Fields.

generative models estimate both the class-conditional proba-bility density functions p(x|y) as well as class priors p(y) forall classes. Generative models thus implicitly or explicitly es-timate p(x) and thus can be used to generate synthetic databy sampling from this distribution. Examples of generativemodels include the Gaussian Mixture Models or in the case ofstructured prediction Hidden Markov Models.

In this thesis, we mostly rely on discriminative functions andmodels which are discussed in more detail in this section. We dis-cuss Support Vector Machines and Random Forests in Section 2.3.1and 2.3.2, respectively.

23


2.3.1 Support Vector Machines

A Support Vector Machine is a binary classifier defined by a hyperplane parametrized by its normal vector w and an offset b

〈w,x〉+ b = 0 (2.21)

where 〈·, ·〉 denotes the inner product (see Figure 2.6). A sample xis classified into one of the two classes y ∈ {−1,+1} by evaluating

f (x) = sign (〈w,x〉+ b) = sign ( fw,b(x)) . (2.22)

For a given set of linearly separable training data {(xi, yi)}, theproblem of finding a separating hyper plane is overdetermined, i.e.there exist multiple solutions. In case of SVMs, the parameters wand b are chosen, such that the margin between the training dataand the hyper plane is maximal. This can be formalized as follows:

argminw,b

12‖w‖2 + C ∑

i`(yi, fw,b(xi)) (2.23)

with hinge loss function ` defined as

`(y, fw,b(xi)) = max(0, 1− y · fw,b(xi)). (2.24)

The hinge loss serves two purposes: (i) it maximizes the marginbetween hyper plane and the training data and (ii) punishes mis-classified training samples under the chosen parameters w and b.

Additionally, SVMs can also handle non-linearly separable data us-ing a feature map function ϕ : X → X ′ that transforms the data intoa different space, where the data points can be linearly separated.

fw,b(x) = 〈w, ϕ(x)〉+ b (2.25)

Using the representer theorem by Schölkopf, Herbrich, and Smola[108], w can be represented through a sparse set of vectors xi fromthe training data for which αi > 0 (so called support vectors):

w = ∑i

αi yi ϕ(xi). (2.26)

24


Thus Equation 2.25 can be rewritten as

fw,b(x) = ∑i

αi yi 〈ϕ(xi), ϕ(x)〉+ b (2.27)

From this formulation in Equation 2.27, the so called kernel trickcan be explained: Instead of explicitly defining the feature mappingfunction ϕ(x), it is sufficient to directly specify the kernel function

K(xi,xj) = 〈ϕ(xi), ϕ(xj)〉, (2.28)

which defines the inner product in the transformed space and thusEquation 2.27 becomes

fw,b(x) = ∑i

αi yi K(xi,x) + b. (2.29)

This has two main advantages: First, the feature mapping ϕ(x)might be expensive to compute and second, it is not required to beexplicitly known, since it is implicitly included in the kernel func-tion K(xi,xj).

In practice, linear SVMs are used more often than kernelized SVMs,because they are considerably faster to train and evaluate whileyielding good results. Recently, linear SVMs with explicit approxi-mate feature maps, that provide approximate kernels for linear SVMs

offer a good trade-off between the speed of linear SVMs and flexibil-ity of kernel SVMs as studied by Vedaldi and Zisserman [124].

2.3.2 Random Forest

Similar to Support Vector Machines, Random Forests (RFs) also di-rectly learn decision boundaries. They are inherent multi-class learn-ers and partition the input space hierarchically as discussed in greatdetail by Criminisi, Shotton, and Konukoglu [27]. In the context ofthis thesis, we mainly discuss Random Forests for classification.

Decision Trees

The fundamental building blocks of a Random Forest are decisiontrees which have been studied by Breiman et al. [18]. As visualizedin Figure 2.7, a decision tree resembles a hierarchical composition

25


φ(s)

φ(s) φ(s)

φ(s) φ(s) φ(s) φ(s)

Figure 2.7: Schematic of a decision tree. A decision tree containstwo type of nodes: split nodes (including root and innernodes), and terminal nodes (leaf nodes).

of simple binary decision functions, which constitute the tree’s nonlinear decision boundaries, and leaf nodes, that store applicationspecific model parameters. In case of classification, each leaf storesthe empirical probability distribution of class labels of the trainingsamples that reached it. For regression, each leaf stores the em-pirical distribution over the application specific continuous modelparameters. The root node and inner nodes constitute the split nodesand have an associated binary split function φ(s)→ {0, 1} that isevaluated on a sample s at test time. Depending on the result, thesample is sent down to either the left or right sub-tree until a leafnode is eventually reached.

Classification Forest

A single tree alone heavily over-fits to the training data. Inspiredby ensemble methods, Ho [54] and Amit and Geman [2] proposedthe combination of several decision trees with randomized splittingfunctions, while Breiman [17] coined the term Random Forests forsuch ensembles and further randomized the learning by trainingeach of the trees on a random subset of the training data.

26


As shown in [17], the upper bound of the generalization errorPE? of a RF is a function of two parameters of the forest: namely thestrength ς of each individual tree (i.e., its accuracy) and the meancorrelation ρ of the trees in the forest which should both be low fora low generalization error:

PE? ≤ ρ (1− ς2)

ς2 . (2.30)

The correlation of the different trees can be influenced in two ways:(i) randomization of the split functions φ (e.g. kind of function orfeature channels), and (ii) random sampling of the training data foreach tree (commonly referred to as bagging [17]).

A classification forest T ={Tt} consisting of a set of decision treesTt with low correlation can be now trained as follows. Given a train-ing set S = {(si, yi)} with samples si and their accompanying classlabels yi, each tree of the forest is trained on a randomly sampledsub-set St ⊂ S . For every node in each tree, a number of binarydecision functions {φi} are randomly sampled out of the set of allpossible functions. The definition of the set of possible functions isan important design decision and offers many possibilities such asthe choice of families of decision functions, feature channels, num-ber of considered feature dimensions, decision thresholds etc.

Each such functionφ(s)→{0, 1} (2.31)

maps a sample s that reaches a particular node to either the left orright sub-tree, effectively splitting the samples that reach a particu-lar node S into two disjoint sets Sl and Sr. The decision functionφ? that maximizes the information gain criterion

I(S , φ) = H(S)−(|Sl ||S| H(Sl) +

|Sr||S| H(Sr)

)(2.32)

is then chosen, with H(S) the Shannon entropy, i.e.,

H(S) = − ∑y∈Y

p(y) log (p(y)) (2.33)

and p(y) being the empirical class distribution. The training of asingle tree continues until a stopping criterion is met. This includes

27


reaching a maximal depth, a minimum number of samples per nodeor having only samples of only one class. Finally, the empiricaldistribution p(y|l) of the leaf l is stored.

For predicting a class label of a given sample at test time, it ispassed down each tree. Beginning with the root node, split func-tions are evaluated until the sample reaches a leaf node. The empir-ical distributions of all reached leaves {lT} are then averaged andthe class with maximum confidence constitutes the prediction

y? = argmaxy

1|T |∑lT

p(y|lT). (2.34)

28

3U P P E R B O D Y A P PA R E L C L A S S I F I C AT I O N

In this chapter, we introduce a complete pipeline which includes anumber of state-of-the-art building blocks for recognizing and de-scribing people’s upper body clothing, as illustrated in Figure 3.1.By extending Random Forests to transfer learning, a mass of weaklylabeled data which is automatically crawled from the Internet isused to assess and improve the generalization of Random Forestsat training time. This chapter is based on a previous work that waspublished in [11].

3.1 introduction

Clothing serves for much more than covering and protection. Itis a means of communication to reflect social status, lifestyle, ormembership of a particular (sub-) culture. The apparel is also an

�oral short dress.

long

dre

sssh

ort d

ress

polo

shi

rtsw

eate

rja

cket

x1x2x3

x1x2x3

knitt

ed girl

�ora

lle

athe

rsu

mm

er

Figure 3.1: Overview of our classification pipeline. First, an upperbody detection algorithm is applied to the image. Thenwe densely extract a number of features. Histogramsover the extracted features are used as input for a Ran-dom Forest (type classification) and for SVMs (attributeclassification).

29

upper body apparel classification

important cue for describing other people. For example: “The manwith the black coat”, or “the girl with the red bikini” are examplesof statements you may hear. The objective of this work is to detect,classify, and describe clothes appearing in natural scenes in orderto generate such descriptions with a focus on upper body clothing.Typically this means not only recognizing the type of clothing aperson is wearing, but also the style, color, patterns, materials, etc.An example of a desired outcome would be to label the clothingin Figure 3.1 as “girl wearing a summer dress with a floral pattern”.Only such a combination of type and attributes comes close to thedescriptions we use as humans.

Such a system has many potential applications, ranging from au-tomatic labeling in private or professional photo collections, overapplications in e-commerce, or contextual online advertising up tosurveillance. Hence, systems for analyzing visual content may ben-efit significantly from autonomous apparel classification.

Enabling such robust classification of clothing in natural scenes isa non-trivial task that demands the combination of several computervision fields. We propose a fully automated pipeline that proceedsin several stages (see Figure 3.1): First, a state-of-the art face and up-per body detector is used to locate humans in natural scenes. Theidentified relevant image parts are then fed into two higher levelclassifiers, namely a Random Forest for classifying the type of cloth-ing and several Support Vector Machines (SVMs) for characterizingthe style of the apparel. In case of the Random Forest, SVMs are alsoused as split nodes to yield robust classifications at an acceptablespeed.

Since the learning of classifiers demands large amounts of data forgood generalization, but human annotation can be tedious, costlyand inflexible, we also provide an extension of our algorithm that al-lows for the transfer of knowledge from corresponding data in otherdomains. E.g., knowledge from crawled web-data may be trans-ferred to manually curated data from a clothing retail chain. Wedemonstrate this approach on 15 common types (classes) of cloth-ing and 78 attributes. The benchmark data set for cloth classificationconsists of over 80 000 images.

In summary, the contributions of this work are:

30

3.2 related work

• a pipeline for the detection, classification and description ofupper body clothes in real-world images

• a benchmark data set for clothing classification

• an extension of Random Forests to transfer learning from re-lated domains

The remainder of this chapter is organized as follows. Section 3.2discusses related work. An overview of our method is given in Sec-tion 3.3. In Section 3.4, the benchmark data set is introduced andin Section 3.5 our algorithms are evaluated. The chapter ends withconcluding remarks in Section 3.6.

3.2 related work

Classifying apparel or clothing is part of the wider task of classi-fying scenes. It is also related to detecting and describing personsin images or videos. Interestingly, in the past there has been lit-tle work on classifying clothing. Chen et al. [23] manually builta tree of composite clothing templates and match those to the im-age. Another strand of work specifically focuses on segmentationof garments covering the upper body [56]. More recently Wang andAi [126] also investigated segmentation of upper bodies, where theindividuals occlude each other.

Retrieving similar clothes given a query image was initially ad-dressed by Liu et al. [84] and Wang and Zhang [127]. In the latterwork, the authors use attribute classifiers for re-ranking the searchresults. Subsequently, Di et al. [34] and Kalantidis, Kennedy, and Li[70] extended these approaches by incorporating more fine grainedtype and attribute recognition.

Song et al. [113] predict people’s occupation incorporating infor-mation on their clothing. Information extracted from clothing hasalso been used successfully to improve face recognition results asby Gallagher and Chen [47].

Recently, detection and classification of apparel has gained somemomentum in the computer vision community. For instance, Ya-maguchi et al. [131] show impressive results, relying strongly onstate-of-the-art body pose estimation and superpixel segmentation.

31


Their work focuses on pixel-wise annotation and was extended toparsing of multiple people by Jammalamadaka et al. [59], parsingmultiple images jointly by Yang, Luo, and Lin [134] and parsing inthe presence of color and apparel type labels by Liu et al. [83].

A somewhat limiting factor of these works is that occurring la-bels are supposed to be known beforehand. This was subsequentlyaddressed in the follow up work of Yamaguchi, Kiapour, and Berg[130], where the unknown labels are estimated via retrieval of simi-lar looks in a corpus of partially annotated images or by Dong et al.[36], in which possible combinations of apparel are encoded in anhierarchical and-or graph.

In this work, we do not focus on clothing segmentation or similar-ity search, but on classification, i.e., the problem of describing whattype of clothing is worn in an image. To do so, we build on top ofexisting work [56, 47, 127] for clothing segmentation as described inSection 3.3.1, to then fully focus on the classification task.

Our work is also related to learning visual attributes, which alsohas gained importance in recent years. They have been applied incolor and pattern naming [44], object description [40], and face veri-fication [75]. Within the context of our proposed task, attributes areobviously suited for describing the visual properties of clothing. Tothis end, we follow the algorithm by Farhadi et al. [40] for semanticattributes and extend it with state-of-the-art techniques as describedin the following section.

3.3 classifying upper body clothing

In this work we focus on identifying clothing that people wear ontheir upper bodies, in the context of natural scenes. This demandsthe combination of several robust computer vision building blocks,which we will describe in the sequel.

Our apparel classification mechanism consists of two parts: onepart describes the overall type/style of clothing, e.g., suit, dress,sweater. The other part describes the attributes of the style, suchas blue or wool. By combining the outputs of these parts the systemcan come up with detailed descriptions of the clothing style, suchas blue dress. This combination is crucial for a real-world applica-tions, because the labeling with either only the type (dress), or only

32


its attributes (blue) would be quite incomplete. The combination isalso important for higher level tasks, such as event detection. Forinstance the knowledge that a dress is white may refer to a wedding.

More specifically, our method carries out the following steps: thefirst stage consists of state-of-the-art upper body detection as will bedescribed in Section 3.3.1. After identification of upper bodies, weextract a number of different features from this region with densesampling as explained in Section 3.3.2. These features are then trans-formed into a histogram representation by applying feature codingand pooling.

These features build the basis for classifying the type of apparel(part 1 of the system, Section 3.3.3) and for classification of apparelattributes (part 2 of the system, Section 3.3.4).

3.3.1 Pre-Processing

Throughout this work we deal with real-world consumer imagesas they are found on the Internet. This entails multiple challengesconcerning image quality, e.g., varying lighting conditions, variousimage scales, etc. In a first pre-processing step, we address thesevariations by normalization of image dimensions and color distri-butions. This is achieved by resizing each image to 320 pixels max-imum side length and by normalizing the histogram of each colorchannel.

As mentioned earlier, in order to identify clothing we need toidentify persons first. One straightforward way to localize personsis to parametrize the upper body bounding box based on the posi-tion and scale of a detected face. In addition to this simple method,we also use the more sophisticated Calvin upper body detector [37],to generate additional bounding box hypotheses. All generated hy-potheses are then combined through a non maximum suppression,in which hypotheses originating from the calvin upper body detec-tor are scored higher than hypotheses coming only from the faceposition.

33


3.3.2 Features

In terms of feature extraction and coding, we follow a state-of-the-art image classification pipeline:

feature extraction Within the bounding box of an upperbody found in the previous step, we extract a number of featuresincluding SURF [6], HOG [29], LBP [97], Self-Similarity Descriptor(SSD) [109], as well as color information in the L*a*b space. Allof those features are densely sampled on a grid.

coding For each of the feature types except LBP, a code bookis learned by using K-Means1. Subsequently all features are vectorquantized using this code book.

pooling Finally, the quantized features are then spatially pooledwith spatial pyramids [78] with max-pooling applied to the his-tograms.

For each feature type this results in a sparse, high-dimensionalhistogram.

3.3.3 Apparel Type Learning

After person detection and feature extraction, we use a classifier forthe final clothing type label prediction. Since we face a multi-classlearning problem with high-dimensional input and many trainingsamples, we use Random Forests (RFs) [17] as our classificationmethod. Random Forests (RFs) are fast, noise-tolerant, and inher-ently multi-class classifiers that can easily handle high-dimensionaldata, making them the ideal choice for our task.

A RF is an ensemble of T decision trees, where each tree is trainedto maximize the information gain at each node level, quantified as

I(S , φ) = H(S)−(|Sl ||S| H(Sl) +

|Sr||S| H(Sr)

)(3.1)

1 We used 1 024 words for SURF and HOG, 128 words for color and 256 words for SSD,respectively.

34


where H(S) is the entropy for the set of samples S and φ is a bi-nary test to split S into subsets Sl and Sr. Class predictions areperformed by averaging over the empirical class leaf distributionsas

p(y|L) = 1T

T

∑t=1

p(y|lt) (3.2)

with L = (l1, . . . , lT) being the leaf nodes where a given sampleended up. The term random stems from the fact that during trainingtime only a random subset over the input space is considered forthe split tests φ and each tree uses only a random subset of thetraining samples. This de-correlates the trees and leads to lowergeneralization error [17].

Besides de-correlation of the trees, due to Breiman [17] also theirindividual strength contributes to the overall classification perfor-mance. Therefore, we follow the idea of Yao, Khosla, and Fei-Fei[135] to use strong discriminative learners in form of binary SVMs assplit decision function φ. In particular, if x ∈ Rd is a d-dimensionalinput vector and w the trained SVM weight vector, an SVM nodesplits all samples with wTx < 0 to the left and all other samples tothe right child node, respectively.

To enable the binary classifier to handle multiple classes, we ran-domly partition these classes into two groups. While training, sev-eral of those binary class partitions are randomly generated. Foreach grouping, a linear SVM is trained for a randomly chosen fea-ture channel. Finally the split that maximizes the multi-class in-formation gain I(S ,w), measured on the real labels, is chosen assplitting function, i.e.,

w = argmaxw

I(S ,w) (3.3)

Random Forests are highly discriminative learners but they canalso overfit easily to the training data if too few training samples areavailable as shown by Caruana, Karampatziakis, and Yessenalina[21], an effect that tends to intensify if SVMs are used as split nodes.Therefore, in the following, we propose two extensions to the Ran-dom Forest algorithms of [17] and [135] that shall improve the gen-eralization accuracy but keep the discriminative power.

35


Large Margin

While training, different split functions often yield the same infor-mation gain. Breaking such ties is often done by randomly selectingone split function out of the best performing splits. In this work weintroduce an additional selection criterion to make more optimaldecisions.

It is inspired by Transductive Support Vector Machine (TSVM) stud-ied by Joachims [66], where the density of the feature space aroundthe decision boundary is taken into account while solving the opti-mization problem for w. Opposed to TSVMs however, we do not usethis information while optimizing w, but go after minimal featuredensity (or largest margin) as an additional optimality criterion forthe split selection. In other words, if several split functions performequally well, the density of the feature space within the margin istaken into account and estimated as:

Im(S ,w) =1|S| ∑

x∈Smax

(0, 1− |wTx|

)(3.4)

with the decision boundary w and training examples S . Then theoptimal split can be chosen by minimizing the above equation withrespect to w, i.e., the optimal split function is given by

w = argminw

Im(S ,w). (3.5)

Transfer Forests

Another option to improve the generalization power of Random For-est is to use more training samples. However, it is often not easyto acquire more training samples along with good quality annota-tions. One way to achieve this is to outsource the labeling task tocrowd sourcing platforms, such as Mechanical Turk [114]. Yet, thisdemands careful planning for an effectively designed task and anadequate strategy for quality control. It can also not be used toannotate confidential data. Therefore, previous work also studiedthe extension of RFs to semi-supervised learning [79, 27] in order tobenefit from additional unlabeled data, which is usually cheap tocollect.

36


For our task, we can use text-based image search engines to gatherlarge amounts of images, such that the returned images come al-ready with prior labels y. For instance, we can type cotton, black, pas-tel, etc. to get clothing images that probably exhibit these attributes.Similarly, we can type jacket, t-shirt, blouse, etc. to get images con-taining our target type classes. On the downside, these images maycontain high levels of noise and originate from variable source do-mains. Thus, not all samples might fit to our task and y cannot beconsidered to flawlessly correspond to the real label y.

Therefore, we extend Random Forests to Transfer Learning (TL),which tries to improve the classification accuracy in scenarios wherethe training and test distributions differ. In particular, we assumeto have access to M samples from the labeled target domain S l

along with their labels Y (e.g., from a manually labeled and qualitycontrolled data set). Additionally, in TL one has access to N samplesfrom an auxiliary domain S a together with their (noisy) labels Y(e.g., from Google image search). The task of TL is to train a functionf : S → Y that performs better on the target domain via trainingon S l ∪ S a than solely relying on S l .

There exist many approaches to Transfer Learning as discussed inPan and Yang [100] and its usefulness has also been demonstratedin various vision domains, e.g. [115, 77]. We present here a novelvariant of Transfer Learning for Random Forests as this is our mainlearner.

To this end, we exploit the idea that although the source andtarget distributions might be different, some of the source samplesxi ∈ S a can still be useful for the task and should thus be incorpo-rated during learning, while samples that may harm the learnershould be eliminated. In order to accomplish such an instance-transfer approach for Random Forests, we augment the informationgain of Equation 3.1 to become

I∗(S ,w) = (1− λ) · I(S l ,w) + λ · I(S a,w), (3.6)

where the first term corresponds to Equation 3.1 and I(S a,w) mea-sures the information gain over the auxiliary data. The overall influ-ence of S a is controlled via the steering parameter λ ∈ [0, 1].

37


The information gain I relies on the standard entropy measure

H(S) = −∑y

py log(py) (3.7)

withpy =

1|S|∑i

1[xi∈Sy ] (3.8)

where 1[xi∈Sy ] is the indicator function and is defined as

1[xi∈Sy ] =

{1 if xi ∈ Sy

0 else,(3.9)

with Sy representing the set of samples for class y. Note, how theauxiliary data set influences only the selection of the trained SVM foreach node, but it is not used during the actual training of the SVM.

Instead of the previously mentioned hard indicator function, wealso experimented with a soft indicator function for the auxiliarydata S a as defined as

ϕa(xi) =

{νi if xi ∈ S a

y

0 else.(3.10)

with νi being a weight for sample xi to steer its influence on theinformation gain. Thus, νi was estimated using an additional one-vs-all SVM that was trained on Sy to measure if it fits into the targetdomain. However, our experiments did not show a clear advantageusing this soft indicator function. Hence, we do not use the softindicator function in the further discussion.

3.3.4 Clothing Attribute Learning

The slight differences in appearance of apparel are often orthogonalto the type of clothing, i.e., the composition of colors, patterns, ma-terials and/or cuttings often matter more than the information, thata particular cloth is e.g. a sweater. A common way to include suchkind of information is to represent it by semantic attributes. Wedefine eight attribute categories with in total 78 attributes as shownin Table 3.1. The training of the attributes happens for each of the

38

3.4 data sets

eight attribute categories separately. Within each of those, the dif-ferent attributes are considered mutually exclusive. Thus, within acategory, we train for each attribute a one-vs-all linear SVM on thefeatures described in Section 3.3.2.

3.4 data sets

For both tasks – classification of clothes and attribute detection – wecollected two distinct data sets. Additionally, an auxiliary data setS a was automatically crawled to be used for our transfer learningextension for Random Forests.

3.4.1 Apparel Type

To the best of our knowledge, there is no publicly available dataset for the task of classifying apparel or clothing, respectively. Thelarge variety of different clothing types and, additionally, the largevariance of appearance in terms of colors, patterns, cuttings etc. ne-cessitate that a large data set be used for training a robust classifier.However, assembling a comprehensive and high quality data set isa daunting task.

Luckily, ImageNet [33], a quality controlled and human anno-tated image data-base that is hierarchically organised according toWordNet, contains many categories (so called synsets) related toclothes. Nevertheless, a closer look at ImageNet’s (or rather Word-Net’s) structure reveals that clothing synsets often do not corre-spond to the hierarchy a human would expect. Therefore we hand-picked 15 categories and reorganized ImageNet’s synsets accord-ingly. Due to how ImageNet is built, some images are ambigu-ous and quite a few are very small. As a cleaning step, we pre-process each image as described in Section 3.3.1. If no face or upperbody can be detected, a centered bounding box is assumed as Ima-geNet also contains web shop images that show pieces of clothingalone. The resulting bounding boxes smaller than 91 pixels werediscarded.

An overview over the categories can be found in Table 3.2. Asa contribution of this work, we make the details of the data set

39


Colors

PatternsM

aterialsStructures

LooksPersons

SleevesStyles

beigeanim

alprintcotton

frillyblack/w

hitechild

long2

0’snerd

blackzebra

denimknitted

coloredboy

short5

0’soutdoor

blueleopard

furruffled

gaudygirl

none6

0’spreppy

brown

argylelace

wrinkled

pastelfem

ale7

0’spunk

graycheckered

leatherm

ale8

0’srock

greendotted

silk9

0’srom

anticorange

floraltw

eedbohem

iansports

pinkherringbone

wool

businessw

eddingpurple

houndstoothcasual

springred

paisleydandy

summ

erteal

pinstripeship

hopautum

nw

hiteplaid

hippiew

interyellow

printm

odstripedtartan

Table3.

1:Listof

attributecategories

andthe

attributestherein.

40

3.4 data sets

Category Images Boxes Category Images Boxes

Long dress 22 372 12 622 Sweater 8 393 6 515

Coat 18 782 11 338 Short dress 7 547 5 360

Jacket 17 848 11 719 Shirt 3 140 1 784

Cloak 15 444 9 371 T-shirt 2 339 1 784

Robe 13 327 7 262 Blouses 1 344 1 121

Suit 12 971 7 573 Vest 1 261 938

Undergarment 10 881 6 927 Polo shirt 1 239 976

Uniform 8 830 4 194

Total 145 718 89 484

Table 3.2: Main classes and number of images per class of the bench-mark data set.

publicly available so that the community can use this subset of Ima-geNet as a benchmark for clothing classification.

3.4.2 Transfer Forest

For each of the clothing type classes, we collected the auxiliary dataset S a by querying Google image search multiple times with differ-ent word combinations for the same category (e.g., sweater women,sweater men or e.g., long dress formal, long dress casual) such that theretrieved data contains some variation. We again restricted the re-sult to photos of a minimum size of 300 × 400 pixels and performedno further supervision on the 42 624 downloaded images.

3.4.3 Attributes

In order to train classifiers for visual attributes, we need a specialtraining data set just for this task. While ImageNet provides im-ages with attribute annotation, it only covers a small part of ourdefined attributes (c.f. Table 3.1). Moreover, ImageNet provides at-tribute annotation only for a subset of its synsets, thus making thisdata source not appropriate for learning our selection of attributes.Therefore we construct a third distinct data set by automatically

41


crawling the Web. For each attribute, we let an automated scriptdownload at least 200 images using Google image search and re-stricted results to photos of a minimum size of 300 × 400 pixels.For each attribute, the script generates a query composed of the at-tribute label and one of the words clothing, clothes or look as querykeyword. No further supervision was applied to those 25 002 im-ages after downloading.

42

3.5 experiments

3.5 experiments

In this section we present experiments to evaluate our algorithmquantitatively. First we show the results for the apparel type part,then the results for the attribute part. An overview of the relevantresults can be found in Table 3.3.

3.5.1 Apparel Types

We present three sets of numerical evaluations. First, using the ap-parel type data set introduced in Section 3.4.1, we trained a SVM asa baseline. Then, the results for Random Forest with SVMs as splitnodes are shown. Finally, the effectiveness of the proposed TransferForest is demonstrated.

SVM Baseline

As a baseline experiment we train a one-vs-all linear SVM for eachclothing type category. We evaluated all possible feature combina-tions, and also L2 regularized hinge as well as L1 regularized logisticloss. For the evaluation of the feature combinations, the histogramsof the different extracted features (c.f. Section 3.3.2) were simplyconcatenated. We used 80 % of the bounding boxes of each class fortraining and the remaining part for testing. Finally, L1 regularizedlogistic loss using all available features yielded with 35.03 % aver-age accuracy the best performance. The confusion matrix is shownin Figure 3.2a. There is a clear bias towards overrepresented classes.

Random Forest

To evaluate the performance of the Random Forest framework wedefine the following protocol: again we use 80 % of the images ofeach type class of the data set for training and the remainder fortesting. Each tree has been trained on a random subset of the train-ing set, which contains 500 images for each class, thus 7 500 imagesin total.

While training, we generate at each node 50 linear SVMs with thefeature type and the binary partition of the class labels chosen at

43


(a)(b)

Figure3.

2:Confusion

matrix

ofour

clothingclasses

forthe

bestperform

ingSV

Mclassifier

onthe

leftside

andthe

proposedTransfer

Foreston

theright

side.

44

3.5 experiments

Figure 3.3: Percentage of co-occurring classes in the deepest splitnodes. Note how semantic similar classes often occurtogether.

random. Other than what Yao, Khosla, and Fei-Fei [135] propose,we do not randomly sample subregions within the bounding boxes,but use the spatially pooled histograms (c.f . Section 3.3.2) as inputfor the SVMs. Each tree is then recursively split until either the in-formation gain stops increasing, the numbers of trainings examplesdrops below 10, or a maximum depth of 10 is reached. In total wetrained 100 trees out of which we created 5 forests by randomlychoosing 50 trees. The final result is then averaged over those 5

forests to reduce the variance of the results.

baseline With 38.29 % average accuracy, our Random Forest clas-sifier with SVMs as split functions outperforms the plain SVM base-line (35.03 %) significantly. It handles the uneven class distributionmuch better as can be seen in Figure 3.2b. These results confirmour expectation that a Random Forest is a suitable learner for ourtask. Figure 3.3 shows the co-occurrences of the different classes atthe deepest levels of the tree. Interestingly, semantic similar classesoften occur together.

45


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 136

38

40

42

44

λ

Ave

rage

Acc

urac

y[%

]

Figure 3.4: Evaluation of the λ of Equation 3.6 parameter of thetransfer forest approach.

large margin Having strong discriminative learners as deci-sion nodes renders the information gain as optimization criterionoften as too weak a criteria: several different splits have the sameinformation gain. In this case, choosing the split with the largestmargin amongst the splits with the same information gain on thetraining data seems beneficial as performance increases about 1 %compared to the Random Forest baseline.

Transfer Forest

To assess the performance of our approach, we follow the protocoldefined in the baseline Random Forest evaluation. The parameterλ of Equation 3.6 was varied between 0 < λ < 1 in 0.05 steps.Unfortunately, no distinct best choice for λ is obvious as visiblein Figure 3.4. Yet, our approach yields minimum and maximumimprovement of 2.18 % and 3.09 % over the baseline Random Forest,respectively. On average, any choice of λ increased the performanceabout 2.45 % in that 0 < λ < 1 interval.

To validate our assumption that transfer learning is beneficial inthis case, we also trained a forest on the union of S a and S l , thustreating the auxiliary images as they would stem from the regulardata set. In this case, the performance significantly drops below thatof the baseline Random Forest.

46

3.5 experiments

Learner Avg. Acc. [%]

One vs. all SVM 35.03

RF 38.29

RF + large margin on S l39.31

RF on S l ∪ S a, naïve 36.27

RF on S l ∪ S a, Daumé [32] 35.00

Transfer Forest 41.36

Table 3.3: Classification performance measured as average accuracyover all classes on our benchmark data set for differentmethods.

As a sanity check, we also compared to another domain adap-tation method presented by Daumé [32], which comes at a cost oftripling the memory requirements and substantially longer trainingtimes, as the feature vectors that are passed on to the SVM are thriceas large. Moreover, also this approach does not improve the per-formance over that of the baseline Random Forest (see Table 3.3).This (i) highlights the importance of using transfer learning whenincorporating data from different domains for our task and (ii) alsoshows that Random Forests are useful for transfer learning.

Feature Selection

The way we generate our splitting functions allows each node inthe tree to perform feature selection. In addition to the features wedescribed in Section 3.3.2, we use the attribute classifier introducedin Section 3.3.4 as a further feature channel. In Figure 3.5 we showwhich feature type is chosen at which level of the tree. Featuresthat are to be known to capture general appearance like SURF andHOG are chosen at the top of the trees. With increasing depth, otherfeature types are favoured to make more fine grained distinctionsbetween semantic similar classes. This is in accordance with Fig-ure 3.3 that shows the co-occurrences of the different classes at thedeepest levels of the tree.

47


1 2 3 4 5 6 70

0.2

0.4

0.6

0.8

1

Tree depth

Sele

ctio

n[%

]

LBPSURFHOGSSD

ColorAttributes

Figure 3.5: Statistics of the selected feature channels for differenttree depths.

3.5.2 Attributes

For training and testing we assume that within a given attributecategory (e.g. colors or patterns) attributes (e.g. red, white or ze-bra,dotted) are mutually exclusive. Furthermore attribute with theleast samples constrains the number of samples for all other at-tributes in the same category. With this, out of the 25 002 down-loaded images, 16 155 were used for testing and training the at-tributes. The data set was split in 75 % of samples for training and25 % for testing.

We extract the features as described in Section 3.3.2 and trainseveral linear one vs. all SVMs [39] with all possible feature combi-nations as well as with L1 regularized logistic loss and L2 regular-ized hinge loss. For the experiments, the cost was set at C = 1 asthe classification performance stayed invariant in combination withmax pooling. Results are shown in Table 3.4 and we also show thefull confusion matrices in Figures 3.6 and 3.7.

The classification accuracy ranges between about 38 % and 71 %depending on the category. Of course it is expected that attributecategories with less possible values (e.g. sleeves) perform better thanthose with many (e.g. patterns). Nevertheless a classification tasksuch as the sleeve length is not trivial and performs surprisingly

48

3.5 experiments

Cat

egor

yA

cc.[

%]

Reg

.Lo

ssSU

RF

HO

GC

olor

LBP

SSD

Look

s(4

)7

1.6

3L 2

hing

e×

××

×Sl

eeve

s(3

)7

1.5

2L 1

logi

stic

××

×Pe

rson

s(5

)6

4.0

7L 2

hing

e×

××

×M

ater

ials

(8)

63

.97

L 1lo

gist

ic×

××

××

Stru

ctur

e(4

)6

3.6

4L 1

logi

stic

×C

olor

s(1

3)

60

.50

L 2hi

nge

××

×Pa

tter

ns(1

5)

49

.75

L 1lo

gist

ic×

Styl

es(2

5)

37

.53

L 2hi

nge

×

Tabl

e3

.4:B

est

aver

age

accu

racy

for

each

attr

ibut

eca

tego

ryw

ith

the

corr

espo

ndin

gfe

atur

es.

The

num

ber

ofat

trib

utes

per

cate

gory

isde

note

din

inpa

rent

hese

s.

49


well. On the other hand color and pattern classification could prob-ably be improved. It appears the classifier is distracted too muchby background data present within the bounding box. A simplefix would be to sample data only from a small part from the centerof the bounding box for categories such as colors or patterns. Alarge category such as styles with many fuzzy or semantic attributevalues such as punk or nerd poses of course a challenge to even anadvanced classifier.

3.5.3 Qualitative Results

In Figure 3.8 some example outputs of our full pipeline are shown.Note how we are able to correctly classify both style and attributesin many situations. This would allow a system to come up withthe desired description combining attributes and style. For instancefor the first example in the middle row a description such as “Girlwearing a pastel spring short dress without sleeves” could be gen-erated. Also note how the Random Forest robustly handles slightvariations in body pose for cloth classification (e.g., in the top rightexample). Of course, accurate detection of the upper body is crucialfor our method, and many of the failure cases are due to false upperbody detections (example in the 3

rd row, 3rd image). Another source

for confusion are ambiguities in the ground truth (3rd row, 1st and

2nd example). For attributes performance is mainly challenged by

distracting background within the bounding box or lack of contextin the bounding box (e.g., 2

nd row, 2nd example).

3.6 conclusion

We presented a complete system, capable of classifying and describ-ing upper body apparel in natural scenes. Our algorithm first iden-tifies relevant image regions with state-of-the-art upper body detec-tors. Then multiple features such as SURF, HOG, LBP, SSD and colorfeatures are densely extracted, vector quantized and pooled into his-tograms and fed into two higher level classifiers, one for classifyingthe type and one for determining the style of apparel. We couldshow that the Random Forest framework is a very suitable tool for

50

3.6 conclusion

beige

black

blue

brown

gray

green

orange

pink

purple

red

teal

white

yellow

beige

black

blue

brown

gray

green

orange

pink

purple

red

teal

white

yellow

.48 .02 .00 .04 .08 .06 .04 .04 .00 .00 .02 .17 .06

.04 .77 .00 .04 .04 .00 .00 .02 .02 .00 .02 .04 .02

.02 .08 .58 .04 .00 .02 .04 .02 .06 .00 .13 .02 .00

.04 .06 .02 .50 .00 .06 .02 .04 .06 .04 .02 .10 .06

.12 .06 .02 .02 .37 .04 .08 .06 .02 .06 .06 .06 .06

.02 .08 .04 .06 .04 .42 .02 .04 .08 .00 .04 .06 .12

.04 .02 .02 .04 .02 .00 .63 .00 .06 .06 .00 .00 .12

.00 .02 .02 .04 .00 .02 .02 .60 .08 .15 .00 .02 .04

.04 .02 .04 .04 .04 .04 .02 .04 .58 .02 .06 .02 .06

.00 .00 .02 .02 .00 .02 .06 .08 .02 .71 .04 .02 .02

.00 .02 .10 .02 .06 .06 .00 .00 .00 .00 .71 .04 .00

.00 .04 .02 .04 .04 .06 .02 .04 .02 .00 .02 .71 .00

.02 .00 .00 .04 .02 .04 .02 .00 .02 .00 .02 .02 .81

(a) Colors

cott

on

deni

m

fur

lace

leat

her

silk

twee

d

woo

l

cotton

denim

fur

lace

leather

silk

tweed

wool

.57 .06 .02 .10 .04 .16 .02 .04

.04 .69 .02 .02 .02 .06 .06 .10

.02 .06 .63 .12 .04 .04 .02 .08

.02 .00 .10 .69 .02 .10 .04 .04

.02 .04 .04 .04 .80 .02 .04 .00

.16 .08 .04 .10 .00 .57 .02 .04

.06 .06 .00 .06 .08 .02 .71 .02

.22 .02 .08 .08 .06 .02 .06 .47

(b) Materials

anim

al p

rint

argy

le

chec

kere

ddo

tted

flor

alhe

rrin

gbon

e

houn

dsto

oth

leop

ard

pais

ley

pins

trip

espl

aid

prin

tst

ripe

d

tart

antr

ibal

zebr

a

animal printargyle

checkereddottedfloral

herringbone

houndstoothleopardpaisley

pinstripesplaidprint

striped

tartantribalzebra

.42 .06 .00 .02 .06 .00 .04 .16 .04 .00 .00 .02 .00 .06 .04 .08

.02 .52 .02 .06 .04 .08 .04 .00 .02 .08 .00 .00 .02 .08 .02 .00

.06 .02 .44 .00 .02 .04 .04 .04 .02 .04 .06 .02 .08 .08 .02 .02

.00 .02 .02 .68 .02 .00 .06 .04 .00 .00 .08 .00 .02 .04 .02 .00

.06 .00 .02 .02 .54 .02 .00 .10 .06 .00 .06 .04 .00 .02 .04 .02

.00 .06 .02 .00 .02 .54 .06 .02 .04 .16 .00 .02 .06 .00 .00 .00

.04 .02 .02 .02 .06 .00 .44 .06 .12 .02 .04 .08 .00 .02 .06 .00

.12 .00 .02 .04 .10 .02 .04 .48 .04 .02 .00 .02 .00 .02 .04 .04

.06 .02 .04 .02 .08 .10 .02 .10 .28 .04 .00 .06 .04 .02 .04 .08

.04 .00 .00 .04 .06 .02 .00 .00 .06 .54 .06 .08 .06 .02 .02 .00

.04 .04 .04 .02 .06 .10 .06 .02 .04 .00 .46 .00 .00 .02 .08 .02

.06 .08 .06 .00 .02 .02 .02 .04 .02 .04 .02 .52 .02 .02 .06 .00

.02 .00 .00 .04 .00 .06 .00 .04 .06 .06 .04 .02 .54 .06 .04 .02

.02 .08 .06 .02 .00 .02 .00 .04 .14 .02 .08 .00 .00 .48 .00 .04

.04 .06 .06 .02 .00 .02 .02 .00 .00 .00 .00 .02 .00 .06 .70 .00

.04 .04 .00 .02 .04 .04 .00 .10 .08 .04 .02 .08 .02 .08 .02 .38

(c) Patterns

20's

50's

60's

70's

80's

90's

autu

mn

bohe

mia

nbu

sine

ssca

sual

dand

yhi

phop

hipp

iem

odne

rdou

tdoo

rpr

eppy

punk

rock

rom

anti

csp

orts

spri

ngsu

mm

erw

eddi

ngw

inte

r

20's50's60's70's80's90's

autumnbohemian

businesscasualdandy

hiphophippie

modnerd

outdoorpreppy

punkrock

romanticsportsspring

summerwedding

winter

.42 .02 .04 .00 .02 .02 .00 .04 .06 .06 .00 .04 .02 .00 .04 .00 .06 .06 .00 .02 .06 .02 .00 .00 .02

.00 .42 .08 .06 .02 .00 .08 .02 .04 .02 .04 .00 .00 .00 .02 .02 .04 .02 .00 .02 .06 .02 .00 .00 .04

.00 .08 .27 .12 .04 .00 .00 .02 .02 .08 .04 .00 .02 .02 .00 .00 .00 .06 .02 .06 .02 .02 .04 .02 .08

.04 .04 .08 .27 .06 .02 .04 .02 .00 .04 .02 .04 .02 .08 .02 .02 .04 .04 .00 .04 .02 .02 .04 .02 .00

.02 .04 .02 .04 .42 .00 .04 .00 .00 .02 .00 .08 .02 .02 .00 .02 .06 .06 .04 .02 .00 .04 .04 .02 .00

.02 .04 .02 .02 .04 .38 .00 .04 .04 .02 .02 .06 .02 .04 .02 .02 .00 .04 .02 .04 .02 .00 .00 .10 .00

.04 .00 .00 .12 .00 .02 .23 .02 .04 .06 .04 .02 .04 .00 .02 .02 .04 .04 .00 .04 .04 .02 .08 .06 .04

.04 .04 .04 .02 .04 .02 .08 .31 .00 .02 .08 .04 .00 .02 .00 .00 .04 .04 .00 .04 .00 .04 .02 .02 .08

.02 .06 .06 .02 .00 .04 .04 .04 .31 .04 .04 .06 .02 .02 .00 .02 .00 .00 .08 .00 .02 .02 .06 .02 .04

.06 .02 .00 .02 .00 .06 .06 .00 .02 .38 .02 .00 .02 .00 .00 .02 .02 .02 .02 .04 .00 .06 .06 .08 .04

.00 .04 .02 .00 .06 .04 .08 .04 .02 .00 .42 .00 .06 .02 .02 .02 .04 .06 .00 .00 .02 .00 .00 .00 .06

.06 .00 .00 .02 .08 .06 .04 .02 .02 .08 .04 .37 .00 .00 .00 .00 .00 .08 .04 .04 .02 .04 .00 .00 .02

.02 .00 .06 .04 .04 .02 .08 .04 .06 .06 .06 .00 .33 .02 .02 .00 .02 .04 .00 .02 .00 .06 .00 .04 .00

.02 .04 .08 .00 .06 .00 .00 .02 .06 .04 .02 .02 .02 .40 .02 .02 .00 .06 .04 .04 .00 .04 .02 .00 .00

.00 .04 .04 .00 .06 .02 .00 .00 .02 .04 .00 .08 .02 .00 .60 .00 .06 .00 .00 .02 .00 .02 .00 .00 .00

.00 .04 .00 .00 .06 .00 .08 .02 .06 .04 .04 .02 .02 .04 .02 .31 .02 .08 .02 .06 .02 .00 .02 .02 .04

.02 .00 .00 .02 .00 .02 .00 .02 .04 .04 .00 .00 .00 .06 .02 .02 .67 .00 .00 .00 .02 .02 .00 .04 .00

.06 .06 .02 .04 .08 .06 .04 .00 .00 .02 .02 .00 .02 .06 .00 .02 .04 .29 .06 .00 .04 .04 .02 .00 .04

.10 .00 .00 .04 .00 .02 .02 .04 .00 .00 .00 .02 .00 .00 .00 .00 .00 .02 .65 .00 .02 .02 .02 .00 .04

.00 .06 .04 .02 .00 .04 .02 .00 .04 .12 .00 .02 .02 .00 .02 .02 .08 .00 .00 .40 .00 .06 .02 .00 .04

.06 .02 .06 .02 .04 .06 .04 .02 .06 .00 .00 .06 .02 .00 .02 .00 .08 .00 .00 .02 .35 .04 .00 .02 .04

.08 .00 .06 .04 .02 .02 .02 .02 .04 .00 .04 .04 .02 .00 .00 .00 .04 .00 .08 .00 .02 .35 .06 .00 .08

.04 .00 .04 .08 .02 .02 .08 .02 .02 .02 .02 .06 .10 .02 .04 .00 .02 .04 .06 .00 .02 .10 .19 .00 .02

.00 .02 .02 .02 .00 .04 .06 .00 .02 .02 .04 .08 .02 .02 .04 .04 .02 .02 .02 .06 .02 .06 .10 .29 .00

.08 .00 .00 .02 .02 .02 .04 .08 .04 .04 .02 .04 .02 .00 .02 .06 .06 .02 .00 .02 .02 .10 .04 .06 .21

(d) Styles

Figure 3.6: Here we show the confusion matrices for the differentattribute categories.

51


blac

k &

whi

te

colo

red

gaud

y

past

el

black & white

colored

gaudy

pastel

.90 .06 .00 .04

.13 .54 .15 .17

.10 .13 .60 .17

.12 .00 .06 .83

(a) Looksbo

y

child

fem

ale

girl

mal

e

boy

child

female

girl

male

.61 .08 .07 .12 .12

.15 .53 .12 .19 .02

.02 .08 .66 .08 .15

.14 .07 .03 .71 .05

.14 .03 .12 .02 .69

(b) Persons

frill

y

knit

ted

ruff

led

wri

nkle

d

frilly

knitted

ruffled

wrinkled

.45 .16 .25 .13

.16 .60 .11 .13

.16 .07 .69 .07

.05 .05 .11 .78

(c) Structures

long

sle

eves

no s

leev

es

shor

t sl

eeve

s

long sleeves

no sleeves

short sleeves

.69 .16 .15

.15 .67 .18

.09 .13 .78

(d) Sleeves

Figure 3.7: Here we show the confusion matrices for the differentattribute categories (continued).

52

3.6 conclusion

Figure 3.8: Some example output of our pipeline. The header de-notes the ground truth class. Each example shows thedetected bounding box and the output of the type clas-sifier. On the left side of each example, the output ofthe most confident attribute classifier for each attributegroup is shown.

53


this task, outperforming other methods such as SVM. Since thereare many apparel images available on the web but they often comewith noise or unrelated content, we extended Random Forests totransfer learning. While this improved the accuracy for the taskat hand, we believe that also other vision applications using Ran-dom Forests might benefit from this algorithmic extension. We alsointroduced a challenging benchmark data set for the community,comprising more than 80 000 images for the 15 clothing type classes.On this data set, our Transfer Forest algorithm yielded an accuracyof 41.36 %, when averaged over the type classes. This represents animprovement of 3.08 % compared to the base line Random Forestapproach and an improvement of 6.3 % over the SVM baseline.

54

4E V E N T R E C O G N I T I O N I N P H O T O C O L L E C T I O N S

In this chapter we demonstrate, how modeling latent sub-struc-tures in sequential visual data helps to improve classification perfor-mance. We study this in context of recognizing events in personalphoto collections, which is central for automatic image organization.To this end, we build upon recent and state-of-the-art work in eventrecognition in videos to propose a latent sub-event model. Like invideos, most images of an event are not informative and differencesbetween classes can be very subtle for single ones. However, im-ages in photo collections are sparsely sampled in the time domainand often come in bursts, indicating moments with some level ofimportance to the photographer. Therefore we represent events bymeans of such latent sub-events and introduce the Stopwatch HiddenMarkov Model (SHMM), to model photo collections as a sequence ofthese. The work presented in this chapter appeared in [12].

4.1 introduction

With the advent of digital photography, we have witnessed the ex-plosion of personal and professional photo collections, both onlineand offline. The vast amount of pictures that users accumulateraises the need for automatic photo organization. This has initiatedextensive research on content-based image retrieval systems suchas image indexing based on objects [103], faces [52] or tags [5]. Inaddition to visual content, EXIF meta data [16], GPS tracks [137] andcaptions [7] provide excellent cues to reduce the complexity of thesetasks. However, these works seldomly exploit the simple fact thatonline and offline images frequently come in collections: People or-ganize their personal photos in directories, either corresponding toparticular contents (persons, things of interest) or particular events.Online photo sharing websites such as Flickr, Panoramio or Face-book adopted this scheme and are organised in albums (examplesshown in in Figure 4.1). The benefits from recognizing event types

55

event recognition in photo collections

Boa

tC

ruis

eG

radu

atio

nW

eddi

ngB

irth

day

Figure 4.1: Eight examples of photo collections from four eventclasses in our data set. The difficulty in predicting thecorrect class comes from the sparse sampling of imagesin time, shown with the histograms (number of imageswithin a small time frame), and from the high semanticambiguity of images (e.g., portraits appear in many eventclasses).

56

4.1 introduction

are evident: Automatic organisation helps users keep order in theirphoto collections and also enables the retrieval of similar event typesin large photo repositories.

Event and action classification has recently received a lot of atten-tion in the computer vision community when it comes to video [45,87, 118]. The availability of data sets like from Marszalek, Laptev,and Schmid [87] and difficult challenges [99] explain such profu-sion. As in videos, discriminative features in photo collection areoften outnumbered by many diverse and semantically ambiguousframes that contribute little to the understanding of an event class:portraits, group photos and landscapes all occur in multiple typesof events. In contrast to videos where images are sampled at a fixedframe rate, photo collections instead present a very sparse samplingof visual data, such that relating consecutive images is typically aharder task, c.f . Figure 4.1. A great benefit of photo collections,however, is that the frequency of sampling is itself a measure ofthe relative importance of photos [26], and that we can exploit thisinformation to distinguish between event classes.

Unfortunately, there are no standard benchmark data set for study-ing the challenging problem of event recognition for photo collec-tions. In the literature on classifying photo collections [90, 120, 137],only small and private data sets are used. This eventually limitsthe possibilities to compare different approaches and research newideas. As a contribution of this work, we have collected a large dataset of more than 61 000 images in 807 collections from Flickr andmanually annotated it with 14 event classes as we describe in Sec-tion 4.3. These collections correspond to real-world personal photocollections taken by individual photographers. The diversity of de-picted events is large: Birthday party, Boat Cruise, Concert, etc. asshown in Figure 4.2. This data set is available for download withthe intention to establish a solid benchmark. As a second contri-bution, we propose to modify a recent state-of-the-art model intro-duced by Tang, Fei-Fei, and Koller [118] – initially designed forvideos – for event recognition in photo collections. This includesa proper multi-class formulation and a modified Hidden MarkovModel where the transition probabilities depend on observed tem-poral gaps between images. Hence, we coin this model a StopwatchHidden Markov Model. We present it and show how to perform in-

57


ference and learning in Section 4.4. Thirdly, we combine cues frommultiple modalities to form image-level and collection-level features(Section 4.5). These cues include low-level visual channels and tem-poral frequency, as well as higher-level visual information such asscene and human attributes.

We show in our experiments (Section 4.6) that our model outper-forms alternative event classification schemes for photo collectionsbased on feature or score pooling or simple Hidden Markov Modelsand present our conclusions in Section 4.7. We first discuss relatedwork in Section 4.2 below.

4.2 related work

Our work is related to a large literature on the automatic organi-zation of images. For instance, from an unlabelled set of images,various image similarity measures have been proposed to clustersimages based on the objects [121], people [52] or sub-sequences [26]they contain. While these algorithms focus on finding structure inunorganized data, our goal is to exploit the collection structure thatis often found in personal and professional photo archives.

Cao, Luo, and Huang [19] exploit photo collections to reduce thecomplexity of propagating labels between images by observing that(i) images in different collections might depict similar scenes evenif they are visually dissimilar, and (ii) images within a collectionare more likely to depict similar scenes. The authors use a data setof 100 collections and label each image with an event and a scenelabel. Cao et al. [20] further extend this idea towards a hierarchicalmodel where a photo collection is split in a sub-sequence of so-called events, composed of images from similar scenes, and exploitsadditional information such as GPS tracks. GPS tracks make it sim-pler to distinguish between events such as backyard parties, hikesand road trips [137] because of the difference of their geographicalextent, but are still not very common in photo collections. Mattiviet al. [90] propose a simple scheme to aggregate the SVM scores ofeach photo in a collection, and use it for classification into 8 socialclasses.

Event classification has also been considered for single static pho-tos. For instance, the generative model of Li and Li [80] allows its

58

4.3 data set

authors to integrate cues such as scene, object categories and peo-ple to segment and recover the event category in a single image.However, because of the ambiguity between events, a generativeapproach might not lead to optimal predictive performance. Exper-iments were performed on a small-scale data set of 8 sport activitieswith up to 250 images each. Many other works also integrate ad-ditional higher-level cues, most often for image classification in awider sense. McAuley and Leskovec [91] exploit user context, lo-cation and user-provided tags and comments on a photo sharingwebsite to improve automatic image annotation.

The most related works to ours deal with event classification invideos [57, 118]. Although not directly applicable to photo col-lections, these models share several aspects with our work. Bothof them consider the use of latent sub-events in a discriminativelearning framework, to maximize predictive performance. However,[57] relies on known sub-events and uses them as an intermediaterepresentation of collections for event classification. Time informa-tion is discarded in favor of co-occurrence of sub-events. Instead,we build upon the recent work of Tang, Fei-Fei, and Koller [118]and treat sub-events as unobserved latent variables. In [118], thesesub-events are associated with explicit durations, and transitions be-tween sub-events can only occur when the previous sub-event hasexpired. This requires that sub-events and the sub-event bound-aries are fully observed. Because of the sparsely sampled photosin our collections, we need to adapt this model. Inspired by dis-cretely observed Markov jump processes [9], we propose a Markovmodel where transition probabilities are functions of the temporalgap between images as if it were measured by a stopwatch (c.f . Sec-tion 4.4).

4.3 data set

In this section, we describe our efforts to collect and annotate a largedata set of personal photo collections for use as an event recognitionbenchmark. Previous work either were not public, did not includemeta information such as EXIF-tags or did not contain completephoto sets but rather single images.

59


Class Collections Photos Class Collections Photos

Birthday 60 3 227 Graduation 51 2 532

Children Birthday 64 3 714 Halloween 40 2 403

Christmas 75 4 118 Hiking 49 2 812

Concert 43 2 565 Road Trip 55 10 469

Boat Cruise 45 4 983 St. Patrick’s Day 55 5 082

Easter 84 3 962 Skiing 44 2 512

Exhibition 70 3 032 Wedding 69 9 953

Total 807 61 364

Table 4.1: Statistics of our data set. For each of the 14 classes, we de-tail the number of photo collections and the total numberof images that they contain.

We first defined event classes of interest by using the most pop-ular tags on Flickr and Picasa as well as Wikipedia categories thatcorrespond to social events. Because we did not have direct accessto large private photo collections we formulated different keywordqueries by using variations of the event’s name or by adding yearnumbers to retrieve single images from Flickr in lack of an appro-priate API for photo album search. If a returned image was con-tained in a Flickr set and if we could access the original image andits EXIF meta data, we downloaded the whole photo set. As thesesets only loosely correspond to collections, we manually reviewedand discarded those sets that did not consist of a personal albumor one single event, had wrong or missing meta data or were heav-ily retouched. About 60 % of the downloaded photo sets had to bediscarded.

This led to the selection of 14 event classes as shown in Table 4.1,with in total 807 photo collections which together contain 61 364

photos with EXIF data. We show examples of the resulting dataset in Figure 4.1 and 4.2. The data set is available at http://www.vision.ee.ethz.ch/datasets/pec/.

60

http://www.vision.ee.ethz.ch/datasets/pec/

http://www.vision.ee.ethz.ch/datasets/pec/

4.3 data set

Figu

re4

.2:U

nord

ered

sam

ples

from

our

data

set,

whe

reea

chro

wco

rres

pond

sto

acl

ass.

From

top

tobo

ttom

:C

hild

ren’

sbi

rthd

ay,E

aste

r,C

hris

tmas

,Hal

low

een,

Hik

ing,

Roa

dTr

ip,S

kiin

g.N

ote

the

high

intr

aan

dso

met

imes

low

inte

rcl

ass

vari

atio

ns.

61


4.4 the stopwatch hidden markov model

People usually do not take pictures at fixed intervals when pho-tographing at an event they attend. More often, photos are takenwhen something interesting happens and thus show a bursty distri-bution when looking at the time domain. Secondly, events are oftencomposed of different sub-events: At Easter, eggs are hunted andthere is often also a joint meal. Weddings also often contain somesort of meal and afterwards people might be dancing. Other eventsmight even expose a more subtle and thus latent sub-structure. Inthis work, we assume that the photo bursts act as a proxy for thissub-structure.

Since events of the same type show a very large variety in theirtemporal composition, it can be difficult even for humans to identifyand thus annotate sub-events. This is why we treat the sub-events aslatent in this work and learn them while training the event classifier.

Given a photo collection X = {x0, . . . , xT} of T + 1 time orderedimages originating from a single event, our goal is to predict thecorrect event class label y in a set Y of K possible labels.

We cast this prediction task in the framework of structured-out-put SVM with latent variables [96, 136], where the output is a multi-class prediction y∗ parametrized by Θ:

y∗ = fΘ(X) = argmaxy

maxZ〈Θ, Φ(X , y,Z)〉 ∈ Y (4.1)

and where the latent variables Z = {z0, . . . , zT} that are associatedwith the images form a chain.

In the next sections, we first describe our model in detail, explain-ing the factor graph of the prediction function (Section 4.4.1). Thenwe discuss in Section 4.4.2 how the solution of Equation 4.1 can beefficiently inferred given known parameters Θ. In Section 4.4.3, wedetail how we learn the parameters given a set of training photocollections with manual annotations.

4.4.1 Model

As visible in Figure 4.1, events can be described as a series of smaller(visually diverse) sub-events. In this work we model these sub-

62


events explicitly to improve the classification performance. Ourmodel to represent photo collection for classification is based on aHidden Markov Model, as commonly done for modelling sequences,e.g., as by Lafferty, McCallum, and Pereira [76], Quattoni et al. [106],and Tang, Fei-Fei, and Koller [118]. Each observed image xt in thecollection is associated with an unobserved latent variable zt repre-senting its state among S possible ones. In the specific context ofevent recognition, those latent states are often called sub-events, tostress their intended semantics. In our model, the prediction func-tion decomposes as:

〈Θ, Φ(X , y,Z)〉 =⟨Θg, Φg(X , y)

⟩+

1T + 1

T

∑t=0〈Θl , Φl(xt, zt, y)〉

+1T

T−1

∑t=0

θp,zt ,zt+1,y · φp(xt, xt+1, zt, zt+1, y)

(4.2)

The feature map Φg(X , y) allows the integration of global cues fromthe full sequence into the event prediction, while the map Φl(xt, zt, y)represent images xt and their assignments to latent sub-events zt fora particular event class y. Finally, the pairwise features denoted byφp(xt, xt+1, zt, zt+1, y) encode the sub-event transition costs betweenconsecutive images. Defining the follwing short hands

φg =⟨Θg, Φg(X , y)

⟩(4.3)

φl,t = 〈Θl , Φl(xt, zt, y)〉 (4.4)

φp,t→t+1 = θp,zt ,zt+1,y · φp(xt, xt+1, zt, zt+1, y) (4.5)

Figure 4.3 shows the factor graph corresponding to a photo collec-tion.

Unlike most previous modelling of sequential visual data, allthese terms depend on the unobserved variable y. This allows tolearn sub-events that help discriminate between events in a multi-class setting, whereas Tang, Fei-Fei, and Koller [118] only considersbinary CRFs. In essence, our CRFs are calibrated to maximize multi-class prediction accuracy.

63


y

φg

x0 x1 xT

φl,0 φl,1 φl,T

z0φp,0→1

z1φp,1→2

. . .φp,T−1→T

zT

Figure 4.3: Factor graph corresponding to our photo collection eventrecognition model. Please refer to the text for notationsand explanations.

Note also how the pairwise terms depend on observed data xtand xt+1. Indeed, inspired by Markov Jump Processes [9], we usethe observed time gap

δt→t+1 = τ(xt+1)− τ(xt) (4.6)

between two consecutive images xt and xt+1 to influence the transi-tion probabilities.

Our Stopwatch Hidden Markov Model can model the intuitionthat the transition matrices for short temporal gaps should typicallybe close to the identity matrix (i.e., prefers not to change state) whiletransition matrices for longer temporal gaps should be more dis-tributed as illustrated in Figure 4.4.

The model seamlessly integrates the information of the temporalgap δt→t+1 between two consecutive images by making the energiesfor changing sub-event assignments dependent on the probability,that an observed δt→t+1 originates from this class. In this particularwork, we used

φp(xt, xt+1, zt, zt+1, y) = − log(p(δt→t+1|y)) 1[zt 6=zt+1](4.7)

where p(δt→t+1|y) is estimated by Kernel Density Estimation usinga Gaussian kernel. Intuitively, the model trusts a transition more,if the observed time-gap is consistent with time-gaps observed forclass y.

64


09:44:25

�

09:44:29

�

11:03:43

δ0→1 δ1→2

z z z

Figure 4.4: Illustration of our Stopwatch Hidden Markov Model.The transition matrix between two consecutive imagesdepends on the temporal gap δt→t+1. This allows tomodel bursts of photos and the typical durations of sub-events.

Inference, i.e. estimating the event and sub-event label can be sim-ply done as shown in Section 4.4.2, using the forward-backwardalgorithm. The learning of the parameters of the structured out-put SVM with latent variables [136] resembles the Expectation-Max-imization (EM) algorithm, alternating between assigning images tosub-events (using fixed parameters) and optimizing the parameters(under fixed assignments) as we describe in the subsequent Sec-tion 4.4.3.

4.4.2 Inference

Given a photo collection, inferring the event class label and the la-tent sub-events means to jointly maximize over the latent variablesand the class labels as in Equation 4.1.

This can be done efficiently by observing that, for a fixed eventlabel y, the problem of inferring over the latent variables Z, i.e. solv-ing

Z?y = argmax

Z〈Θ, Φ(X , y,Z)〉 , (4.8)

65


x0 x1 xT

φyl,0 φ

yl,1 φ

yl,T

z0φ

yp,0→1

z1φ

yp,1→2

. . .φ

yp,T−1→T

zT

Figure 4.5: Factor graph corresponding to our photo collection clas-sification model when the event label y is fixed. As theglobal term becomes constant, it is omitted. The super-scripts are added to acknowledge the dependency on y.The Viterbi algorithm can be used directly to infer thelatent sub-event variables.

consists of inferring a chain model since states are not shared be-tween different classes. We show such a chain in Figure 4.5.

To perform inference in the full model, we therefore simply applythe Viterbi algorithm to infer the latent variables Z∗y for each choiceof event label y, and then maximize the corresponding predictionfunction over y:

y? = argmaxy

⟨Θ, Φ(X , y,Z∗y )

⟩. (4.9)

In essence, our model is therefore equivalent to having one chainmodel per event class, and predicting the class with highest confi-dence.

The Viterbi algorithm has a complexity of O(TS2), therefore thecomplexity of inferring our full model is O(KTS2), i.e. linear inthe number of event classes and size of the photo collection, butquadratic in the number of sub-events.

4.4.3 Training

In this section, we aim at learning the parameters Θ given a train-ing setD = {(X0, y0), . . . , (XN , yN)} of N photo collectionsXi with

66


their class labels yi ∈ Y . We adopt the Latent Structural SVM frame-work [136]. The objective function is:

minΘ

λ

2‖Θ‖2 +

N

∑i=0

maxy,Z

(⟨Θ, Φ(Xi, y, Z)

⟩+ ∆(yi, y)

)−

N

∑i=0

maxZi〈Θ, Φ(Xi, yi,Zi)〉 ,

(4.10)

where ∆ is the 0/1 margin function that represents the mis-classifi-cation cost

∆(y, y′)=1[y 6=y′ ]. (4.11)

Minimizing Equation 4.10 consists of finding the best parametersΘ such that the correct class yi is the minimizer of the margin-augmented prediction function. This is equivalent to wanting yito have the most confident score by a margin of 1.

We apply the Concave-Convex Procedure (CCCP) [138], which it-erates between the following two optimization problems until con-vergence like in [136]:

1. Infer the latent sub-event labels Z∗i for the ground-truth la-bels yi for fixed parameters Θ. This is precisely solving Equa-tion 4.8.

2. Solve the convex problem in Equation 4.12 below which isEquation 4.10 with fixed latent sub-events Z∗i .

minΘ

λ

2‖Θ‖2 +

N

∑i=0

maxy,Z

(⟨Θ, Φ(Xi, y, Z)

⟩+ ∆(yi, y)

)−

N

∑i=0〈Θ, Φ(Xi, yi,Z∗i )〉

(4.12)

We optimize the convex objective in Equation 4.12 using the Op-timized Cutting Plane Algorithm as implemented in the Dlib C++Library [72]. This implies performing margin-augmented inferencewhich we do simply by adding ∆(yi, y) to Equation 4.9.

67


4.4.4 Initial Sub-events

In this section, we describe how we initialize the sub-event labels.We found this initialization to be much more robust than initializ-ing CCCP with random sub-event assignment. The key is to takeagain advantage of the photo bursts in the time domain. In thisway, we exploit the relative importance given to those photos by thephotographer. Our assumption is again that such bursts act as aproxy to latent sub-events. To do so, we segment each photo collec-tion using Hierarchical Agglomerative Clustering using a Gaussiankernel in the time domain

d(t1, t2) = exp(−(t1 − t2)

2

2σ2

)(4.13)

with smoothing window σ as distance function. For each event class,the averaged visual features of each segment are then clustered us-ing K-Means. We use the resulting clusters as initial sub-event as-signments.

4.5 features and potentials

In this section, we provide specific details about the global featurevectors φg and the image-level ones φl that we use later in our ex-periments. Global features are functions of the whole photo collec-tion and help to capture holistic properties, while sub-event featureshelp to capture properties of single photos. Having access to EXIF-data, we also include non-visual cues into our model.

4.5.1 Global Temporal Features

We define different cues based on time and aggregate them over thephoto collection in different histograms. Those cues include time ofday, day of week, month and the duration to help recognize events thatshow specific patterns in the time domain.

68

4.5 features and potentials

4.5.2 Low-level Visual Features

As visual features, we sample SURF [6] descriptors densely on a gridand code them into a Bag of Words (BoW) representation using a vo-cabulary of 1 024 words, which is then max pooled. The vocabularyis previously learned using K-Means.

4.5.3 Higher-level Visual Features

For simplicity and speed reasons, we treat sub-events to be classspecific, i.e., sub-events are not shared among different classes. Toobtain a richer representation of images and to improve the seman-tics of sub-events, we use a number of attribute predictions whichhave been shown to help classification (e.g., [80]). These attributesconsist of the type of scene and type of indoor scene, the number offaces, whether the image is a portrait, and a histogram of facial at-tributes over detected faces. To compute the attributes, we pre-trained a set of classifiers on external data. For scene and indoorattributes, a multi-class SVM was trained on the 15 Scenes [78] andthe MIT-Indoor [105] data set, respectively. For facial detection andattributes, we use the code of [31] to predict age, gender and pres-ence of sunglasses.

4.5.4 Reducing the Dimensionality

The high-dimensional BoW vectors are not directly used in the fea-ture map in Equation 4.10, as this would make it prohibitively large.Instead we use multi-class SVMs to linearly project the BoW to a spaceof dimensionality equal to the number of sub-events. To further im-prove the robustness of these intermediate features, we add a nega-tive sub-event containing random images from other classes whiletraining the multi-class SVMs.

69


4.6 experiments

In this section, we evaluate our approach on our novel data set.First we define the experimental protocol in Section 4.6.1. Then weexplain in Section 4.6.2 the different baselines and variants of ourapproach that we compare. In Section 4.6.3, we report and analyzethe results.

4.6.1 Protocol

We start by defining a training, a validation and a testing set. Outof the pool of 807 photo collections, we randomly selected 10 collec-tions for each of the 14 classes as test set, which we use to report ourevaluations. We also sampled 6 random collections per class to vali-date the hyper-parameter. All the remaining collections can be usedfor learning the parameters of the algorithms for event recognition.Each event class has at least 24 training collections.

We report different performance measures for all the evaluatedmethods: Average accuracy, recall@K and the F1-score to illustratethe precision/recall characteristics of the evaluated methods. In par-ticular, recall@K is the fraction of test data samples for which thecorrect class is among the top-K scores.

In the experiments that we report below, we have balanced ourtraining data and used 24 random collections for each event class.For mining the initial sub-events, we set the smoothing windowσ = 90 of Equation 4.13 and clustered them into 5 sub-events foreach class. In the subsequent iterations, sub-events that are not as-signed to any image are removed. We used the validation data toset the number of outer iterations in our training procedure (c.f . Sec-tion 4.4.3).

4.6.2 Approaches for Event Recognition

We now describe the different baselines and variants that we havecompared in our experiments.

70

4.6 experiments

Aggregated Multi-class SVM

Here, we employ a method inspired by Mattivi et al. [90]. We traina multi-class SVM to recognize events in single images, based onvisual features alone. At test time, each image is classified on itsown and the confidence scores within each collection are averagedto predict the class of the collection.

Bag of Sub-events

In this variant, we adopt a Bag of Sub-events view and drop the pair-wise connections of our model by setting it to 0, i.e., φp = 0. Thisway, the model has no information about transitions and orderingof the photos in the collection. Instead, the latent sub-events areindependently assigned to each image to maximize the predictionon the training set. We use the same procedure as described in Sec-tion 4.4.3 to learn Θ. In this model, inference becomes trivial.

Hidden Markov Model (HMM)

To obtain a discriminative Hidden Markov Model (HMM), we buildon the previous approach and incorporate sub-event transitions.However, we adopt the classical definition of transition matrices inHidden Markov Models, which corresponds to:

φp(xt, xt+1, zt, zt+1, y) = 1. (4.14)

Stopwatch Hidden Markov Model (SHMM)

We use our full model as described in Section 4.4, including thetime-dependent sub-event transition features.

4.6.3 Results

In Table 4.2, we present the different performance measures ob-tained on the test set by our four approaches. The correspondingconfusion matrices are shown in Figure 4.6. As we see, the baselinemethod of Aggregated SVM scores reaches an average accuracy of41.43 %. Note how events taking place in different scene types can

71


Method Avg. Acc. [%] Recall@2 [%] F1-Score

Aggregated SVM 41.43 63.57 38.87

Bag of Sub-events 51.43 70.00 50.63

HMM 53.57 68.57 54.61

SHMM (this work) 55.71 72.86 56.16

Table 4.2: Different performance measures for the evaluated meth-ods.

be discriminated properly, but events that have a similar scenery areconfused (e.g. Hiking vs. Skiing, Figure 4.6a).

Switching to the Bag of Sub-events model leads to a significantimprovement: 51.43 %. This demonstrates how the latent sub-eventmodel can handle the variability within single event classes muchbetter than a single SVM. This is also reflected in the F1 scores, whichincreases in the same order. As shown in Figure 4.6b, classes thatcontain different scenes like boat cruise are better captured. TheHMM learned in a discriminative fashion increases the average ac-curacy further to 53.57 %. As transitions between sub-events canbe captured by the model, dependencies between sub-events canbe learned, which helps to recognize classes that failed before (e.g.Halloween or Road Trip, see Fig. 4.6b and 4.6a).

Finally, our approach gives the best results both in terms of accu-racy and F1 score. With 55.71 % average accuracy, it is 2.14 % moreaccurate than HMMs, 4.28 % better than Bags of sub-events, and theperformance is as much as 14.28 % higher than Aggregated SVM

scores. In terms of F1 scores, the improvements are 1.55 %, 5.53 %and 17.29 %, respectively. Accordingly, we see in Figure 4.6b howthe confusion was reduced for most classes.

The performances of the different methods as measured by therecall@K are shown in Figure 4.7. Our method consistently per-forms as well as, or outperforms, all other considered approaches.For instance, the correct event is among the top two predictions for72.86 % of the collections.

Looking at the average of images assigned to sub-events by ourSHMM shown in Figure 4.8, we can sometimes clearly identify se-mantic concepts: outdoor view for the Hiking class, a typical photo

72

4.6 experiments

Birthday

Children's Birthday

Christmas

Concert

Cruise

Easter

Exhibition

Graduation

Halloween

Hiking

Road Trip

Saint Patrick's Day

Skiing

Wedding

Bir

thda

y

Chi

ldre

n's

Bir

thda

y

Chr

istm

as

Con

cert

Cru

ise

East

er

Exhi

biti

on

Gra

duat

ion

Hal

low

een

Hik

ing

Roa

d Tr

ip

Sain

t Pa

tric

k's

Day

Skiin

g

Wed

ding

0 20 0 20 0 0 20 10 10 0 0 10 10 0

0 30 0 0 10 20 0 10 0 0 0 10 10 10

0 0 50 0 0 0 10 30 0 0 0 0 0 10

0 0 0 100 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 20 40 0 40 0

0 10 0 0 0 50 0 10 0 10 0 10 10 0

0 0 0 20 10 10 50 0 0 0 0 0 10 0

0 0 0 10 0 10 0 20 10 0 0 50 0 0

10 10 0 10 0 0 0 10 10 0 0 40 0 10

0 0 0 0 0 0 0 0 0 10 40 0 50 0

0 0 0 0 0 0 0 0 0 10 80 0 10 0

0 0 10 10 0 0 0 0 10 0 10 50 10 0

0 0 0 0 0 0 0 0 0 0 0 0 100 0

0 0 0 0 0 10 10 10 10 0 0 20 10 30

(a) Aggregated SVM 41.43 %

Bir

thda

y

Chi

ldre

n's

Bir

thda

y

Chr

istm

as

Con

cert

Cru

ise

East

er

Exhi

biti

on

Gra

duat

ion

Hal

low

een

Hik

ing

Roa

d Tr

ip

Sain

t Pa

tric

k's

Day

Skiin

g

Wed

ding

0 10 20 10 0 0 20 20 0 0 0 10 10 0

10 30 10 0 10 30 0 0 0 0 0 0 10 0

10 0 50 0 0 10 10 10 10 0 0 0 0 0

0 0 0 100 0 0 0 0 0 0 0 0 0 0

0 0 0 0 80 0 0 0 0 0 0 0 20 0

0 30 0 0 0 50 0 10 0 10 0 0 0 0

0 0 0 10 10 10 70 0 0 0 0 0 0 0

0 10 0 10 10 0 0 40 10 0 0 0 0 20

30 20 0 10 0 0 0 10 0 0 0 0 10 20

0 0 0 0 0 0 0 0 0 90 0 0 10 0

0 0 0 0 30 0 0 0 0 40 20 0 10 0

0 0 10 10 10 0 0 0 0 0 0 60 0 10

0 0 0 0 0 0 20 0 0 0 0 0 80 0

0 20 0 10 0 10 0 0 10 0 0 0 0 50

(b) Bag of sub-events 51.43 %

Birthday

Children's Birthday

Christmas

Concert

Cruise

Easter

Exhibition

Graduation

Halloween

Hiking

Road Trip

Saint Patrick's Day

Skiing

Wedding

20 10 20 0 0 10 0 0 0 0 10 10 0

10 40 0 0 0 30 10 0 0 0 0 0 10 0

10 0 70 0 0 10 0 10 0 0 0 0 0 0

0 0 0 100 0 0 0 0 0 0 0 0 0 0

0 0 0 0 50 20 0 0 0 0 0 0 30 0

0 20 0 0 0 50 10 0 10 0 0 0 10 0

10 0 0 10 10 0 60 0 0 0 0 10 0 0

0 10 10 0 0 10 10 50 0 0 0 0 0 10

10 0 0 10 0 10 0 10 40 0 0 10 0 10

0 0 0 0 0 0 0 0 0 70 10 0 20 0

0 0 0 0 30 0 0 0 0 20 30 0 20 0

0 0 20 10 0 10 10 10 0 0 10 30 0 0

0 0 0 0 0 0 0 0 0 0 0 0 100 0

10 0 10 10 0 20 0 0 0 0 0 0 10 40

Bir

thda

y

Chi

ldre

n's

Bir

thda

y

Chr

istm

as

Con

cert

Cru

ise

East

er

Exhi

biti

on

Gra

duat

ion

Hal

low

een

Hik

ing

Roa

d Tr

ip

Sain

t Pa

tric

k's

Day

Skiin

g

Wed

ding

20

(a) HMM 53.57 %

Bir

thda

y

Chi

ldre

n's

Bir

thda

y

Chr

istm

as

Con

cert

Cru

ise

East

er

Exhi

biti

on

Gra

duat

ion

Hal

low

een

Hik

ing

Roa

d Tr

ip

Sain

t Pa

tric

k's

Day

Skiin

g

Wed

ding

10 20 10 10 0 0 10 20 0 0 0 20 0 0

0 30 0 0 0 40 10 0 0 0 0 10 0 10

0 0 70 0 0 10 10 10 0 0 0 0 0 0

0 0 0 100 0 0 0 0 0 0 0 0 0 0

0 0 0 0 50 20 0 0 0 0 0 0 30 0

0 20 0 0 0 50 0 0 10 0 0 0 10 10

0 0 0 10 20 0 70 0 0 0 0 0 0 0

0 0 0 0 0 0 30 40 0 0 0 10 0 20

0 0 10 10 0 10 10 20 30 0 0 0 0 10

0 0 0 0 0 0 0 0 0 80 0 0 20 0

0 0 0 0 20 0 0 0 0 30 40 0 10 0

0 0 10 10 10 0 10 10 0 0 10 30 0 10

0 0 0 0 0 0 0 0 0 0 0 0 100 0

10 0 0 0 0 0 0 0 0 0 0 0 10 80

(b) SHMM 55.71 %

Figure 4.6: Confusion matrices for the approaches we compare. Foreach confusion matrix we also show the average accu-racy. Please refer to the text for explanations.

73


1 2 3 4 540

60

80

K

Rec

all@

K[%

]

Aggregated SVM

Bag of Sub-eventsHMMSHMM

Figure 4.7: Recall of the different methods when looking at the topK classification scores for each collection.

74

4.7 conclusion

setting for Graduation, painting frames for Exhibitions. This high-lights the benefits of using a latent model for event recognition, asit can provide some additional semantic knowledge that eventuallyincreases the ability to automatically understand, organize and ex-ploit images in photo collections.

We also show in Figure 4.9 some examples of photo collectionsthat our approach correctly and incorrectly classified. As can beseen, visually and semantically very similar classes such as Birthday,Children’s Birthday, Graduation, Halloween etc. are still confused tosome extent. This highlights the difficulty of our new data set andthe challenge it represents for the future.

4.7 conclusion

In this chapter, we have introduced a novel data set for event recog-nition in photo collections. We have proposed a model based onHidden Markov Models that takes into account the time gap be-tween images to estimate the probability to change state. Our modeloutperforms several other approaches based on previously publishedworks. The final accuracy of 56 % highlights the sheer difficulty ofthe data set, which we hope will foster research in this domain.

We believe that semantic hierarchies would help model eventsas well as complex sub-events, while scaling sub-linearly with thenumber of event classes and sub-events.

75


(a) Graduation (b) Hiking

(c) Christmas (d) Concert

(e) Easter (f) Exhibition

Figure 4.8: Average images corresponding to sub-events learned byour model for different classes (best viewed in color ona computer screen).

76

4.7 conclusion

XChildren’sbirthday

XChristmas

XSt. Patrick’sDay

XExhibition

XWedding

XConcert

×Hiking(Road Trip)

×Exhibition(Christmas)

×Children’sbirthday(Easter)

Figure 4.9: Some Classification examples. On the right side, thepredicted event class labels are shown and the colorindicates if the SHMM correctly predicted it (correct la-bels shown in braces, only selected subset of images areshown).

77

5F O O D R E C O G N I T I O N U S I N G D I S C R I M I N AT I V EC O M P O N E N T S

As in Chapter 4 – where latent sub-events are automatically learnedto represent photo collections – this chapter automatically minesdiscriminative components from a set of single images for recogni-tion. We study a novel method for the mining of such discriminativecomponents based on Random Forest (RF). They allow us to minecomponents simultaneously for all classes and to share knowledgeamong them. To improve efficiency, only patches that are alignedwith image superpixels are considered. The method is motivatedfrom the application of recognizing pictured dishes, for which wealso introduce a novel, large-scale data set consisting of 101 000 im-ages in 101 classes. But it also achieves competitive performance onthe task of scene classification. This chapter is based on the workthat was published in [13].

5.1 introduction

Food is an important part of everyday life. This clearly ripplesthrough into digital life, as illustrated by the abundance of food pho-tography in social networks, dedicated photo sharing sites and mo-bile applications.1 Automatic recognition of dishes would not onlyhelp users effortlessly organize their extensive photo collections butwould also help online photo repositories make their content moreaccessible. Additionally, mobile food photography is now used tohelp patients estimate and track their daily calory intake, outsideof any constraining clinical environment. However, current systemsresort to nutrition experts [88] or Amazon Mechanical Turk [95] tolabel food items.

Despite these numerous applications, the problem of recogniz-ing dishes and the composition of their ingredients has not been

1 E.g.: foodspotting.com, sharedappetite.com, foodgawker.com, etc.

79

http://foodspotting.com

http://sharedappetite.com

http://foodgawker.com

food recognition using discriminative components

(a) Baby back ribs (b) Chocolate cake (c) Hot & sour soup

(d) Caesar salad (e) Eggs benedict (f) Mussels

Figure 5.1: Typical examples of our data set and correspondingmined components.

fully addressed by the computer vision community. This is not dueto the lack of challenges. In contrast to scene classification or ob-ject detection, food typically does not exhibit any distinctive spatiallayout: while we can decompose an outdoor scene with a groundplane, a horizon and a sky region, or a human as a trunk with ahead and limbs, we cannot find similar patterns relating ingredi-ents of a mixed salad. The point of view, the lighting conditions,but also (and not least) the very realization of a recipe are amongthe sources of high intra-class variations. On the bright side, the na-ture of dishes is often defined by the different colors and textures ofits different local components, such that humans can identify themreasonably well from a single image, regardless of the above vari-ations. Hence, food recognition is a specific classification problemcalling for models that can exploit local information.

As a consequence, we aim at identifying discriminative image re-gions which help distinguish each type of dish from the others. Werefer to those as components and show a few examples in Figure 5.1.

80

5.1 introduction

To mine for such components, we introduce a weakly-supervisedmining method which relies on Random Forests [54, 17]. It is similarin spirit to previously proposed mid-level discriminative patch min-ing work [35, 117, 128, 81, 38, 69, 111, 135]. Our Random Forest min-ing framework differs from all these works in the following points:First, it mines for discriminative components simultaneously for allclasses, compared to independently. This speeds up the trainingprocess and allows to share knowledge between classes. Second, werestrict the search space for discriminative parts to patches alignedwith superpixels, instead of sampling random image patches, in aspirit similar to what has been successfully proposed in the contextof object detection [122, 51]. As a consequence, not only do we ma-nipulate regions that are consistent in color and texture, but we canafford extracting stronger visual features to improve classification.This also dramatically reduces the classification complexity on testimages as the numbers of component classifiers/detectors can befairly large (hundreds to several ten thousands): we typically useonly a few dozens of superpixels per image, compared to tens ofthousands of sliding windows.

This work also introduces a new, publicly available data set forreal-world food recognition with 101 000 images. We coin this dataset Food-101, as it consists of 101 categories. To the best of our knowl-edge, this is the first public database of its kind. So far, research onfood recognition has been either performed on closed, proprietarydata sets [55] or on small-scale image sets taken in a controlled lab-oratory environment [24, 133].

In summary, this work makes the following contributions:

• A novel discriminative part mining method based on RandomForests.

• A superpixel-based patch sampling strategy that prevents run-ning many detectors on sliding windows.

• A novel, large scale and publicly available data set for foodrecognition.

• Experiments showing that our approach outperforms state-of-the-art Improved Fisher Vectors (IFV) classifier [107] and the

81


part-based mining approach of [111] on Food-101. On the MIT-Indoor data set, our method compares nicely to very recentmining methods and is competitive with IFV.

We discuss related work in the next section. Our novel data set isdescribed in Section 5.3. In Section 5.4, we introduce our componentmining and classification framework. Our method is then evaluatedin Section 5.5, and we conclude in Section 5.6.

5.2 related work

Image classification is a core problem for computer vision, withmany recent advances coming from object recognition. Classicalapproaches exploit interest point descriptors, extracted locally oron a dense grid, then pooled into a vectorial representation to useSVM for classification. Recent advances highlight the importanceof nonlinear feature encoding, e.g., Fisher Vectors [107] or Super-Vectors [139] and spatial pooling [78].

A very recent and successful trend in classification is to try andidentify discriminative object (or scene) parts (or patches) [35, 117,128, 81, 38, 69, 111, 135], drawing on the success of DeformablePart-based Models (DPM) for object detection [42]. This can consistof (a) finding prototypes for regions of interest [105, 135], (b) min-ing patches whose associated binary SVM obtains good classificationaccuracy on a validation set [111], (c) clustering patches with amulti-instance Support Vector Machine (mi-SVM) [128] on a externaldata set [81], (d) optimizing part detectors in a latent SVM frame-work [117], (e) evaluating many exemplar-SVMs [38, 69] on slidingwindows, exploiting discriminative decorrelation [53] to speed-upthe process, or (f) identifying discriminative modes in the HOG fea-ture space [35].

While this work represents a variant of discriminative part min-ing, it differs in various ways from previous work. In contrast toall other discriminative part mining methods, we efficiently and si-multaneously mine for discriminative parts for all the categories inour data set thanks to the multi-class nature of Random Forests. Sec-ondly, while all other methods employ a computationally expensive(often multi scale) sliding window detection approach to produce

82

5.3 data set : food-101

the part score maps for the final classification step, our approachemploys a simple yet effective window selection by exploiting im-age superpixels.

Concerning food recognition, most works follow a classical recog-nition pipeline, focusing on feature combination and on specializeddata sets. [68] uses a private data set of Japanese food, later aug-mented with more features and classes [71]. Similarly, [25] jointlyclassifies and estimates quantity of 50 Chinese food categories usingprivate data. [89] uses DPM to locally pool features. Food imagesobtained in a controlled environment are also popular in the liter-ature. The Pittsburgh food data set (PFID) [24] contains 101 classes,but with only 3 instances per class and 8 images per instance. Yanget al. [133] propose to learn spatial relationships between ingredi-ents using pairwise features. This approach is bound to work onlyfor standardized meals.

We resort to Random Forests (RFs) [54, 17] for mining discrim-inative regions in images. They are a well-established clusteringand classification framework and proved successful for many vi-sion applications, including object recognition [10, 94, 110], objectdetection [46] and semantic segmentation [116, 110]. Our use of RF

is different compared to those works. Instead of directly using aRF for classification of patches [10] or learning specific locations ofinterest in images [135], we are using RF to discriminatively clustersuperpixels into groups (leaves), and then use the leaf statistics toselect the most promising groups (i.e., mine for parts). For this keystep, we have developed a distinctiveness measure for leaves, andensure that distinctive but near-duplicate leaves are merged. Onceparts are mined, the RF is entirely discarded and is not used at clas-sification time (in contrast to [10, 94, 135]). Instead we model themined components explicitly and directly using SVMs. At test time,only those SVM need to be evaluated on the image regions.


As noted above, to date, only the PFID data set [24] is publicly avail-able. However, it contains only standardized fast food images takenunder laboratory conditions. Therefore, we have collected a novelreal-world food data set by downloading images from foodspot-

83

http://www.foodspotting.com



Figure5.

2:Here

we

showone

example

eachclass

inthe

dataset

intthe

same

orderas

inTable

5.1.N

otethe

highvariance

infood

type,color,exposureand

levelofdetail,but

alsovisually

andsem

anticallysim

ilarfood

types.

84




App

lePi

eC

hoco

late

Cak

eFr

ench

Toas

tM

acar

ons

Ris

otto

Baby

Back

Rib

sC

hoco

late

Mou

sse

Frie

dC

alam

ari

Mis

oSo

upSa

mos

aBa

klav

aC

hurr

osFr

ied

Ric

eM

usse

lsSa

shim

iBe

efC

arpa

ccio

Cla

mC

how

der

Froz

enYo

gurt

Nac

hos

Scal

lops

Beef

Tart

are

Clu

bSa

ndw

ich

Gar

licBr

ead

Om

elet

teSe

awee

dSa

lad

Beet

Sala

dC

rab

Cak

esG

nocc

hiO

nion

Rin

gsSh

rim

pA

ndG

rits

Beig

nets

Cre

me

Brul

eeG

reek

Sala

dO

yste

rsSp

aghe

ttiB

olog

nese

Bibi

mba

pC

roqu

eM

adam

eG

rille

dC

hees

eSa

ndw

ich

Pad

Tha

iSp

aghe

ttiC

arbo

nara

Brea

dPu

ddin

gC

upC

akes

Gri

lled

Salm

onPa

ella

Spri

ngR

olls

Brea

kfas

tBu

rrit

oD

evile

dEg

gsG

uaca

mol

ePa

ncak

esSt

eak

Brus

chet

taD

onut

sG

yoza

Pann

aC

otta

Stra

wbe

rry

Shor

tcak

eC

aesa

rSa

lad

Dum

plin

gsH

ambu

rger

Peki

ngD

uck

Sush

iC

anno

liEd

amam

eH

otA

ndSo

urSo

upPh

oTa

cos

Cap

rese

Sala

dEg

gsBe

nedi

ctH

otD

ogPi

zza

Tako

yaki

Car

rot

Cak

eEs

carg

ots

Hue

vos

Ran

cher

osPo

rkC

hop

Tira

mis

uC

evic

heFa

lafe

lH

umm

usPo

utin

eTu

naTa

rtar

eC

hees

ecak

eFi

let

Mig

non

Ice

Cre

amPr

ime

Rib

Waf

fles

Che

ese

Plat

eFi

shA

ndC

hips

Lasa

gna

Pulle

dPo

rkSa

ndw

ich

Chi

cken

Cur

ryFo

ieG

ras

Lobs

ter

Bisq

ueR

amen

Chi

cken

Que

sadi

llaFr

ench

Frie

sLo

bste

rR

ollS

andw

ich

Rav

ioli

Chi

cken

Win

gsFr

ench

Oni

onSo

upM

acar

oniA

ndC

hees

eR

edVe

lvet

Cak

e

Tabl

e5

.1:F

ulll

ist

ofal

lcla

sses

inth

eFo

od-1

01

data

set.

85




ting.com. The site allows users to take images of what they are eat-ing, annotate place and type of food and upload these informationonline. We chose the top 101 most popular and consistently nameddishes and randomly sampled 750 training images. Additionally,250 test images were collected for each class, and were manuallycleaned. On purpose, the training images were not cleaned, andthus still contain some amount of noise. This comes mostly in theform of intense colors and sometimes wrong labels. We believe thatreal-world computer vision algorithms should be able to cope withsuch weakly labeled data if they are meant to scale well with thenumber of classes to recognise.

All images were rescaled to have a maximum side length of 512

pixels and smaller ones were excluded from the whole process. Thisleaves us with a data set of 101 000 real-world images in total, in-cluding very diverse but also visually and semantically similar foodclasses such as Apple pie, Waffles, Escargots, Sashimi, Onion rings, Mus-sels, Edamame, Paella, Risotto, Omelette, Bibimbap, Lobster bisque, Eggsbenedict, Macarons to name a few (see Table 5.1 for a full list). Exam-ples are shown in Figure 5.2. The data set is available for downloadat http://www.vision.ee.ethz.ch/datasets/food-101/.

86



http://www.vision.ee.ethz.ch/datasets/food-101/

5.4 random forest component mining

Sec. 5.4.1

Sec. 5.4.2

Sec. 5.4.3

Figure 5.3: Overview of our component mining. A Random Forestis used to hierarchically cluster superpixels of the train-ing set. Then, discriminative clusters of superpixels inthe leaves are selected and used to train the componentmodels. After mining, the RF is not used anymore.


In this section we show how we mine discriminative componentsusing Random Forests [54, 17] as visualized in Figure 5.3. This hastwo benefits: In contrast to [35, 117, 128, 81, 111], components canbe mined for all classes jointly because Random Forests are inher-ently multi-class learners. Compared to [35, 38, 69] which followa bottom-up approach and thus need to evaluate all of the severalthousands candidate component SVMs to assess how discriminantthey are, Random Forest mining instead employs top-down clus-tering to generate a set of candidate components (Section 5.4.1).Thanks to the class-entropy criterion for choosing split functions,the generation of components is directly related to their discrimi-native power. We refine the selection of robust discriminative com-ponents in a second step (Section 5.4.2) by looking at consistentclusters across the trees of the forest and train robust componentmodels afterwards (Section 5.4.3). The final classification step isthen detailed in Section 5.4.4.

87


5.4.1 Candidate Component Generation

For generating candidate clusters, we train a weakly supervisedRandom Forest on superpixels associated with the class label of theimage they stem from. By maximizing the information gain in eachnode, the forest will eventually separate discriminative superpixelsfrom ambiguous ones that occur in several classes. Discriminativesuperpixels likely end up in the same leaf while non-discriminativeones are scattered.

Let a forest T = {Tt} be a set of trees Tt, each one trained on arandom selection of samples (superpixels)

S = {si = (xi, y)} (5.1)

where xi ∈ Rd is the feature vector of the sample si and y the classlabel of the corresponding image. For each node n we train a binarydecision function

φn : Rd→{0, 1} (5.2)

that sends each sample to either the left or right sub-tree and splitsS into two sets, Sl and Sr.

While training, at each node, the decision function φ is chosenout of a set of randomly generated decision functions {φn} so as tomaximise the information gain criterion

I(S , φ) = H(S)−(|Sl ||S| H(Sl) +

|Sr||S| H(Sr)

), (5.3)

where H(·) is the class entropy of a set of samples. The training con-tinues to split the samples until either a maximum depth is reached,or when too few samples, or samples of a single class are left.

In this work we use linear classifiers [10] as decision functions,and more specifically resort to training binary SVMs:

φ(x) = 1[wᵀ x+b>0]. (5.4)

We generate different φ(x) by training them on randomly generatedbinary class partitions of the class labels in S .

After training the forest, each tree Tt has a set of leaves Lt ={l}. In the sequel, we denote by L = ∪tLt the set of all leaves in

88


the forest. They constitute the set of candidates for discriminativecomponents. In the next section, we describe how we select themost discriminative ones.

5.4.2 Mining Components

After training the forest as described in Section 5.4.1, the input spacehas been partitioned into a set L of leaves. However, not all leaveshave the same discriminative power and several leaves may carrysimilar information as they were trained independently. In this sec-tion, we propose a simple yet effective method to identify a diverseset of discriminative leaves for each class.

Based on the training data, each leaf l is associated with an em-pirical distribution of class labels p(y|l). Using a validation set, weclassify each sample s using the forest, and we define δl,s = 1 if thesample has reached the leaf l, and 0 otherwise. For each sample, wecan easily derive its class confidence score p(y|s) from the statisticsof the leaves it reached:

p(y|s) = 1|T | ∑

l∈Lδl,s p(y|l). (5.5)

Note that ∑l δl,s is equal to the number of trees in the forest, i.e., |T |,as a sample reaches a single leaf in each tree.

A high class confidence score implies that most trees were able toseparate the sample well from the other classes. To obtain compo-nents, we could use these discriminative samples directly in spiritof exemplar SVMs [86]. However, many discriminative samples arevery similar. For efficiency, i.e., to reduce the number of componentmodels, it makes sense to identify consistent clusters of discrimina-tive samples instead and train a single, more robust model for eachcluster.

This is readily possible by exploiting the leaves again. For a sin-gle class y, we can evaluate how many discriminative samples arelocated in each leaf l by considering the following measure:

distinctiveness(l|y) = ∑s

δl,s p(y|s). (5.6)

Leaves with high distinctiveness are those which collect many dis-criminative samples (i.e., that have a high class confidence score),

89


thus forming different clusters of discriminative samples. Note thatdiscriminative clusters that are identified by different trees can beeasily filtered out by a variation of non-maxima suppression: Aftersorting the leaves based on their distinctiveness, we ignore modelsthat consist of more than half of the same superpixels as any bet-ter scoring leaf. This way, we increase the diversity of componentswhile retaining the strongest ones. Although models with a verysimilar set of superpixels indicate a very strong component, diver-sity is more beneficial for classification as this provides richer inputto the final classifier.

In Fig. 5.1 and 5.8, we show such examples of mined componentsand study the influence of the number of trees and their depth, butalso the number N of discriminative components kept for each foodcategory in Section 5.5.2.

5.4.3 Training Component Models

For each class, we then select the top N leaves and train for eachone a linear binary SVM to act as a component model. For training, themost confident samples of class y of a selected leaf act as positiveset while a large repository of samples act as negative. To speed-upthis process, we perform iterative hard-negative mining. Note thatnothing prevents a single leaf to be selected by several classes. Thisis not a problem at all, since only samples of a single class are usedas positives for training a single model.

5.4.4 Recognition from Mined Components

For classifying an image, we only need to score all of its superpixelsusing the previously trained component models, instead of apply-ing multi-scale sliding window detectors [35, 117, 128, 81, 38, 69,111, 135]. This leaves us with a score vector of K×N component con-fidence scores for K classes and N components for each superpixelas illustrated in Figure 5.4. In case of a sliding window detectora standard approach is to max pool scores spatially and then usethis representation to train an SVM. We use a spatial pyramid with3 levels and adopt a slightly different approach for our superpixels:

90

5.5 experimental evaluation

Figure 5.4: At classification time, all superpixels of an input im-age are scored using the component models, afterwardsa multi-class SVM with spatial pooling predicts the fi-nal class. In this visualisation, we show the confidencescores of edamame, french fries, beignets and bruschetta.

Each superpixel fully contributes to each spatial region it is part of.The scores are then averaged within each region. This loose spatialassignment has proved significantly beneficial for the task of foodrecognition compared to more elaborate aggregation methods likesoft-assignment of superpixels to regions. For final classification,we train a structured-output multi-class SVM using the optimizedcutting plane algorithm [67], namely using DLib’s [72] implementa-tion.


In the following, we refer to our approach as Random Forest Dis-criminative Components (RFDC) and evaluate it against various meth-ods. For our novel Food-101 data set (Section 5.3), 750 images ofeach class are used for training and the remaining 250 are test im-ages. We measure performance with average accuracy, i.e. the frac-tion of test images that are correctly classified.

We first give details of our implementation in Section 5.5.1 andanalyze then the robustness of our approach with respect to its dif-ferent parameters in Section 5.5.2. In Section 5.5.3, we compareto baselines and alternative state-of-the-art component-mining algo-rithms for classification. As our approach is generic and can be

91


directly applied to other classification problems as well, we alsoevaluate on the MIT-Indoor data set [105] in Section 5.5.4.

5.5.1 Implementation Details

We first describe the parameters that we held constant during theevaluation and which had empirically little influence on the overallclassification performance.

Superpixels and Features.

In this work, we have used the graph-based superpixels of [43]. Inpractice, setting σ = 0.1, k = 300 and a minimum superpixel size of1 % of the image area yields around 30 superpixels per image, anda total of about 2.4 million superpixels in the training set. Changesin those parameters had limited impact on the classification perfor-mance.

For each superpixel, two feature types are extracted: Denselysampled SURF descriptors [6], which are transformed using signedsquare-rooting [4], and L*a*b color values. In our experiments ithas proved beneficial to also extract features around the superpix-els namely within its bounding box, to include more context. BothSURF and color values are encoded using Improved Fisher Vectors(IFV) [107] as implemented in VlFeat [123] and a GMM with 64 modes.We perform PCA-whitening on both feature channels. In the end thetwo encoded feature vectors are concatenated, producing a densevector with 8 576 values.

Component Mining.

For component mining, we randomly sample 200 000 superpixelsfrom the 2.4 million to use as a validation set. Each tree is thengrown on 200 000 randomly sampled superpixels from the remain-ing 2.2 million samples. At each node, we sample 100 binary par-titions by assigning a random binary label to each present class.For each partition, a binary SVM is learned, and the SVM that max-imizes the criterion in Equation 5.3 is kept. The training of SVMs

92


is performed using at most 20 000 superpixels. And the splitting isstopped if a node contains less than 25 samples.

5.5.2 Influence of Parameters for Component Mining

To measure the influence of the parameters of RFDC, we proceed byfixing the values of the parameters and vary one dimension at thetime. By default, we trained 30 trees and mined the parts at depth7. We then used the top 20 scored component models per class andtrain each of them using their top 100 most confident samples aspositive set.

Forest Parameters.

Figure 5.5 shows the influence of the number of trees, tree depth,number of samples per model and number of components per classon classification accuracy. RFDC is very robust with respect to thoseparameters. For instance, increasing the number of trees from 10

to 30 does not make a big difference in accuracy (see Figure 5.5a),and tree depth has also little influence beyond 4 levels (Figure 5.5b).Using more positive samples to train the component models (Sec-tion 5.4.3) improves classification performance of the system, buta plateau is reached beyond 200 samples (Figure 5.5c). However,using only 200 positive samples results in significant speed-ups intraining.

Similar to other approaches [111], Figure 5.5d shows that classi-fication performance improves as the number of components perclass grows. Also for this parameter the performance saturates.Moreover, the modest improvement in classification accuracy be-yond 20 components per class comes with a dramatic increase infeature dimensionality (only worsen by spatial pooling): from 42 420

for 20 components, the dimensionality reaches 106 050 for 50 com-ponents and thus heavily impacts memory usage and speed.

In conclusion, our RFDC method shows a very strong robustnesswith respect to its (hyper-)parameters. Fine-tuning of these parame-ters is therefore not necessary in order to achieve good classificationaccuracy.

93


1 5 10 15 20 25 3042

44

46

48

50

52

Avg

.Acc

.[%

]

(a) # of Trees

1 3 5 7 9 11 1342

44

46

48

50

52

(b) Tree depth

1025100400all42

44

46

48

50

52

Avg

.Acc

.[%

]

(c) # of samples per model

1 10 20 30 40 5042

44

46

48

50

52

(d) # of components per class

Figure 5.5: Influence of different parameters of RFDC on classifica-tion performance on the Food-101 data set.

94


Encoding & Features Avg. Acc. [%]

- HOG 8.85

BoW SURF@1024 33.47

BoW SURF@1024 + Color@256 38.83

IFV SURF@64 44.79

IFV Color@64 14.24

IFV SURF@64 + Color@64 49.40

Table 5.2: Classification performance for different feature types forRFDC. @K refers to the code book size.

On Features.

Using the standard settings as in the previous experiment, we com-pared different feature types for RFDC. For extracting HOG, we resizethe superpixel patches to 64 × 64 pixels. For BoW and IFV encoding,we use the dictionary sizes as shown in Table 5.2. Unsurprisingly,HOG is not well suited for describing food parts, as their patterns arerather specific. SURF with BoW encoding yield significant improve-ment only superseded by IFV encoding.

5.5.3 Comparison on Food-101

To compare our RFDC approach to different methods, we use 30 treeswith a max depth of 5. For mining, we keep 500 positive samplesper component and 20 components per class. We compare againstthe following methods:

Bag-of-Words Histogram (BoW).

As a baseline, we follow a classical classification approach using Bagof Words (BoW) histograms of densely-sampled SURF features, com-bined with a spatial pyramid [78]. We use 1 024 clusters learnedwith k-means as the visual vocabulary, and 3 levels of spatial pyra-mid. A structured-output multi-class SVM is then used for classifica-tion (as described in Section 5.4.4).

95


Improved Fisher Vectors (IFV).

To compare against a state-of-the-art classification methods we ap-ply Improved Fisher Vector encoding and spatial pyramids [107] toour problem. For this we employ the same parameters as in [69].We also use a multi-class SVM for classification.

Random Forest Classification (RF).

The Random Forest used for component mining (Section 5.4) canbe used directly to predict the food categories, as it is a multi-classclassifier. As in [10], we obtain the final classification by aggregatingthe class confidence score (Equation 5.5) of each superpixel si andthen classify an image I = {si} using

y∗ = argmaxy

∑si∈I

p(y|si). (5.7)

This will highlight the benefit and importance of component mining(Section 5.4.2) and having another SVM for final classification.

Randomized Clustering Forests (RCF).

The extremely randomized clustering forest approach of [94] canalso be adapted to our problem. The trained RF for componentmining can again be used to generate the feature vectors as in [94].To obtain the final classification, a multi-class SVM is trained on topof these features. This comparison also will show the importance ofdedicated component models.

Mid-Level Discriminative Superpixels (MLDS).

We implemented the recent Mid-Level Discriminative Patches ap-proach of [111] for comparison and replaced sliding HOG patcheswith superpixels. The negative set consists of 500 000 random super-pixels and all the superpixels from one class (around 22 500) formthe discovery set. We clustered the samples with k-means using aclusters/samples ratio of 1⁄3. For each one of the 101 classes, we dis-covered discriminative superpixels by letting their algorithm iterateat most 10 times and train each SVM on the top 10 members. For se-lecting the 20 components per class, we used the discriminativeness

96


Method Avg. Acc. [%]

GlobalBoW [78] 28.51

IFV [107] 38.88

CNN [74] 56.40

LocalRF [10] 32.72

RCF [94] 28.46

MLDS (≈ [111]) 42.63

RFDC (this work) 50.76

Table 5.3: Classification performance measured for the evaluatedmethods. All component mining approaches use 20 com-ponents per class.

measure as in [111]. Again, we use Section 5.4.4 for classification.This comparison will demonstrate the benefit of RF component min-ing.

Convolutional Neural Networks (CNN).

We also compare our approach with convolutional neural networks.To this end, we train a deep CNN on our data set using the architec-ture of [74] as provided by the Caffe [65] library until it converged(450’000 iterations).

Quantitative Results.

We report in Table 5.3 the classification accuracies obtained by thedifferent methods discussed above on the Food-101 data set. Amongglobal classifiers, Improved Fisher Vectors significantly outperformsthe standard BoW approach by 10 %.

Switching to local classification is beneficial for the Food-101 dataset. The MLDS approach [111] using strong features on superpixelsalready gives an improvement of 3.75 % with respect to IFV. Lookingat the results of Random Forests, we first observe that using themdirectly for classification performs similar to BoW (about 33 % accu-

97


1 5 10 15

40

50

60

70

80

90

K

Rec

all@

K[%

]

IFV [107]MLDS (≈ [111])RFDC (this work)CNN

Figure 5.6: Recall@K for the different algorithms.

racy). The bagging of the random trees is not able to recover fromthe potentially noisy leaves. Also Randomized Clustering Forestsperform at a similar accuracy level. As the number of samplesis very limited, the intermediate binary representation is probablytoo sparse. When using the discriminative component mining to-gether with multi-class SVM classification, we measure an accuracyof 50.76 %, an improvement of 8.13 % and 11.88 % compared to MLDS

and IFV, respectively. Also on this data set, CNN set the state of theart and RFDC are outperformed by a margin of 5.64%. This is paidby a considerably longer training time of six days on a NVIDIA TeslaK20X.

In Figure 5.6, we present results in terms of Recall@K, which isthe fraction of test images where the true class is among the top Kscores. As can be seen, our method consistently outperforms IFV

and MLDS for a wide range of K.

We also show classification accuracies for all classes in Figure 5.7.We obtain significant improvements for all classes compared to MLDS.Compared to the IFV baseline, our method outperforms this ap-proach for 89 classes and performs worse for 11 classes. Even com-pared to CNN, the RFDC approach performs better in 25 classes, whilebeing outperformed in 74 cases.

98


Edamame

Hotandsoursoup Pho Oysters

SeaweedsaladDumplings

Macarons Mussels

Misosoup Beignets

Spaghetticarbonara

Redvelvetcake

SpaghettibologneseBibimbap

Clamchowder SashimiPadthai

Cupcakes

Frozenyogurt

Beefcarpaccio

DeviledeggsFriedrice Pizza

Lobsterbisq

ue

Cremebrulee

CaesarsaladPancakesPrim

eribFrenchfrie

s PoutineGreeksaladGuacamole

Onionrings

Frenchonionsoup Waffles

EggsbenedictEscargots

CroquemadameTakoyaki Riso

tto

Clubsandwich Ramen Sushi

Capresesalad

Macaroniandcheese

Pekingduck

ChickenwingsTiramisu

Carrotcake

Friedcalamari

Strawberryshortcake Baklava Paella

BeetsaladGarlic

bread Cannoli Churros

Lobsterrollsandwich

Babybackribs

Springrolls Gyoza Nachos

Shrimpandgrits

Chickenquesadilla

Chocolatecake

Chickencurry Hotdog

Fishandchips

FrenchtoastPannacotta DonutsCheesecake

Cheeseplate FalafelLasagna

OmeletteBruschetta

BeeftartareHamburger

Icecream

PulledporksandwichHummusGnocchi Tacos Samosa

Filetmignon

Grilledcheesesandwich

Foiegras

HuevosrancherosScallops Ravioli Ceviche

Chocolatemousse

Breakfastburrito

Grilledsalmon

TunatartarePork

chopCrabcakes Steak

BreadpuddingApplepie

0102030405060708090100

Accuracy[%]R

FDC

MLD

S

CN

N

IFV

Figu

re5

.7:A

vera

geac

cura

cies

for

allt

hecl

asse

sin

the

data

set

sort

edby

decr

easi

ngR

FDC

perf

orm

ance

.The

impr

ovem

ents

for

sing

lecl

asse

sar

evi

sibl

ym

ore

pron

ounc

edth

ande

grad

atio

nsco

mpa

red

toIF

V

and

MLD

S.

99


Qualitative Results.

In Figs. 5.1 and 5.8 we show a few examples of classes and their cor-responding mined components. Note how the algorithm is able tofind subtle visual components like the fruit compote for the Cheesecake, single dumplings, or the strawberries of the Strawberry shortcake. For other classes, the discriminative visual components showmore distinct textures like in the case of Spaghetti carbonara, Friedrice or meat texture.

An interesting visualization is also possible thanks to superpix-els. For each class, one can aggregate the component scores andtherefore observe which regions in the images are responsible forthe final classification. We illustrate such heat maps in Figures 5.9and 5.10. Again, we observe a great correlation between the mostconfident regions with the actual distinctive elements of each dish.Confusions are often because of visual similarity (onion rings vs.french fries, carrot cake vs. chocolate cake), clutter (prime rib vs.spring roll) or ambiguous examples (steak vs. risotto).

100


Figu

re5

.8:E

xam

ples

ofdi

scov

ered

com

pone

nts.

For

each

row

,an

exam

ple

for

the

part

icul

ardi

shan

dex

am-

ples

ofdi

scov

ered

com

pone

nts

are

show

n.Fr

omto

ple

ftto

bott

omri

ght:

chee

seca

ke,s

pagh

etti

carb

onar

a,st

raw

berr

ysh

ortc

ake,

bibi

mba

p,be

efca

rpac

cio,

pho,

prim

eri

b,sa

shim

i,du

mpl

ings

,fr

ied

rice

,sea

wee

dsa

lad

and

pizz

a.

101


Beeftartare

Steak

Red velvetcake

Strawberryshortcake

Pizza Tiramisu

Cup cakes Macarons

Sashimi Beeftartare

Club sandwichCaesar salad

Chicken wings French toast

Caprese salad Greek salad

Beet salad Caprese salad

Pannacotta

Cheesecake

Waffles Frenchonion soup

Seaweedsalad

Clamchowder

Hamburger Frenchfries

Tiramisu Humus

Lobster roll Eggsbenedict

Grilled cheesesandwich

French toast

Spaghettibolognese

Humus

Edamame Hamburger

Figure 5.9: Examples of the final output of our method for correctlyclassified images are shown. We show the confidenceheat map of the true and the second most confident class.

102


Frenchfries

Onionrings

Spring roll Prime rib

Risotto Steak

Chocolatecake

Carrotcake

Beeftartare

Tuna tartar

Fried rice Huevosrancheros

Paella Risotto

Cheese cake Panna cotta

Prime rib Baby back rib

Chocolatecake

Tiramisu

Figure 5.10: Examples of the final output of our method for missclas-sified examples. The confidence map of the wronglypredicted class and the true class are shown.

103


5.5.4 Results on MIT-Indoor

For running the experiments on the MIT-Indoor data set, we use thesame settings as for Food-101 except, that we sample 100 000 sam-ples per bag. Additionally, we horizontally flip the images in thetraining set to generate a higher number of samples. For conduct-ing the experiments, we follow the original protocol of [105] withapproximately 80 training and 20 testing images per class (restrictedtrain set). As this is a rather low number of training examples, wealso report the performance on the original test set, but with train-ing on all available training images (full train set).

As summarized in Table 5.4, using 50 components per class ourmethod yields 54.40 % and 58.36 % average accuracy for the restrictedand full training set, respectively. While our approach does notmatch [35] on the restricted train set, the gap gets considerablysmaller when training on the full train set. While [35] achieves theirimpressive results with 200 components per class, HOG features andmulti scale sliding window detectors, our method evaluates only 50

components on typically 30 superpixels per image. It is also worthmentioning that [35] reports 2 000 CPU hours for the training on MIT-Indoor in their rebuttal,2 while our full pipeline uses around 250

CPU hours (including vocabulary training, network i/o etc.) withmany parallelizable tasks (segmentation, feature extraction and en-coding, training of single trees). Approximately 55 % of the time isspent on training the forest, 15 % for training the component modelsand 20 % for the training of the final classifier.

Compared to other recent approaches, RFDC significantly outper-forms [111] as well as all the other very recent sliding window meth-ods of [69, 81, 128, 117]. Note that some of them train their compo-nents on external data [81] or have a higher number of components(e.g., Sun and Ponce [117] use 73 components per class). Clearly, oneof the reasons for the achieved performance is the use of strongerfeatures. On the other hand, stronger features can be used here onlybecause our approach needs to evaluate only a small number of su-perpixels compared to thousands of sliding windows. Still, the fullclassification time (including feature extraction and Fisher encod-

2 http://papers.nips.cc/paper/5202-/

104

http://papers.nips.cc/paper/5202-/

5.6 conclusion

Method Avg. Acc. [%]

Part basedHOG Patches [111] 38.10

BoP [69] 46.10

mi-SVM [81] 46.40

MMDL [128] 50.15

D-Parts [117] 51.40

DMS [35] 64.03

Part based (this work)RFDC (restricted train set) 54.40

RFDC (full training set) 58.36

Global or mixedIFV [69] 60.77

IFV + BoP [69] 63.10

IFV + DMS [35] 66.87

Table 5.4: Recent results of discriminate part mining approachesand global approaches on the MIT-Indoor data set. Ourmethod outperforms all other recent method except for[35].

ing) of one image is around 0.8 seconds using 8 cores, where 70 %of the time is spent for Fisher encoding and 25 % for evaluating thepart models.

Interestingly, most previously proposed part-based classificationapproaches based on sliding windows (or patches) and HOG fea-tures typically did not outperform IFV on other data sets until veryrecently [35]. Our Food-101 data set (where RFDC outperforms IFV)therefore presents a bias significantly different from available sets,highlighting its interest as a novel benchmark.

5.6 conclusion

In this chapter, we have introduced a novel large-scale benchmarkdata set for recognition of food. We have also presented a novel

105


method based on Random Forests to mine discriminative visualcomponents and efficient classification. We have shown it to outper-form state-of-the-art methods on food recognition except for Con-volutional Neural Network (CNN) and obtaining competitive resultscompared to alternative recent part-based classification approacheson the challenging MIT-Indoor data set.

106

6C O N C L U S I O N S & O U T L O O K S

This thesis presented three applications in the context of large scalelearning for automatic image organization of consumer photos: par-ticularly the recognition and description of upper body apparel,recognition of events in social photo collections and the recognitionof food images. In the context of these applications, weakly labeleddata is used to learn discriminative sub-structures to improve classi-fication performance or exploit a mass of such data to train a morerobust learner.

In this chapter, a summarization of the specific contribution isgiven in Section 6.1 while directions of future research in these threeareas is laid out in Section 6.2.

6.1 contributions

This thesis makes contributions in three fields of large scale learn-ing for automatic image organization by studying three methods forimproving classification with the help of not fully labeled or noisydata. By releasing challenging data sets, we hope to foster broaderinterest of the computer vision community in these particular prob-lems.

Upper Body Apparel Recognition

In Chapter 3, we presented a complete pipeline for recognizingand describing upper body apparel and proposed a benchmarkdata set. The pipeline combines different state-of-the-art buildingblocks: Using an upper body detector, 78 attributes are recognizedin the detected image regions with Support Vector Machine classi-fiers trained on various feature channels. To recognize the 15 de-fined types of upper body apparel, a Random Forest with SVMs asdecision functions is evaluated on the upper body image region.

107

conclusions & outlooks

To robustly classify the upper body apparel, the large variety ofdifferent clothing types and large variance of appearance requestsfor a large data set. Assembling such large data sets in an ad-equate quality is generally very tedious and expensive. For thispurpose, we presented an extension to Random Forest for transferlearning from weakly labeled and possibly noisy data from differentdomains. Such data can be cheaply acquired with the help of im-age search engines. This proposed extension has shown to improveclassification performance.

Event Recognition in Personal Photo Collections

In Chapter 4, we curated a novel data set consisting of 807 realworld photo collections annotated with 14 classes with a total of61 000 images with complete EXIF meta-information. Photos in suchcollections are typically sparsely sampled over time but often comein bursts. To this end, we proposed the Stopwatch Hidden MarkovModel (SHMM) which interprets photo collections as sequential dataand models them as sequences of latent sub-events. To accountfor the non-uniform sampling in the time domain, the transitionprobabilities between sub-events are modeled as a function of thetime gap between consecutive images.

Sub-events are jointly and automatically learned together withthe SHMM’s parameters without requiring any further supervision.The SHMM integrates different multi-modal cues including low-levelvisual features as well as temporal information and higher level vi-sual information such as visual scene and human attributes. Theexperimental evaluation has shown that the proposed model out-performs approaches based only on feature pooling or a classicalHidden Markov Model.

Recognition of Food Images

In Chapter 5 we presented a novel method for discriminative partmining based on a Random Forest and use the mined parts for effi-cient classification of food images. Since Random Forests are inher-ent multi-class learners, the proposed method mines for discrimina-

108

6.2 perspectives

tive components jointly for all classes and allows for sharing knowl-edge between classes while mining. Instead of randomly samplingimage patches as common to other recent part mining methods, theintroduced method only considers image regions aligned with su-perpixels. As a consequence, the drastically reduced number ofregions allows for the extraction of more expensive and descriptivefeatures. Additionally this also greatly reduces the computationalcomplexity of classifying images at test time.

To establish a solid benchmark for food recognition, we intro-duced a novel, large-scale and challenging data set consisting of101 food categories, with in total 101 000 images, which is coinedFood-101. The presented method outperforms other state-of-the-artmethods on Food-101 and obtains competitive results on the chal-lenging MIT-Indoor data set compared to alternative part-based clas-sification approaches.

6.2 perspectives

In this section we elaborate on future directions of research in thecontext of the different applications.

Apparel Recognition

clothing parsing . An obvious direction of future research isto recognize and describe all worn garments on the full body. Thisproblem is often referred to as clothing parsing and was recentlyaddressed by Yamaguchi, Kiapour, and Berg [130]. This problemis heavily tied to human pose estimation: All approaches currentlyuse a human pose estimator to segment the image and recognizethe worn clothing independently. Since apparel strongly definesthe appearance of a human, we believe that pose estimation andclothing parsing should be addressed jointly.

apparel taxonomy. Another future direction is to incorporatethe semantic hierarchy of apparel into recognition. Currently, eachapparel type is recognized separately without exploiting that a poloshirt is semantically more similar to a t-shirt than to a cardigan or a

109


pair of jeans. The same applies to visual attributes, which often areclassified independently from the type of apparel. Including infor-mation on semantic relationships into classification would allow usto share knowledge between semantically similar classes, ultimatelymaking classification more robust.

fine grained fashion recognition. A more fine grainedtype and attribute classification opens the road to exciting applica-tions. Being able to describe and recognize subtle details wouldyield the recognition of different clothing styles and looks and ulti-mately enable numerous applications. Styles could automatically belearned from online sources such as fashion blogs or celebrity pic-tures or even movie pictures which enables novel recommendationsystems.

apparel retrieval . Another area of research is the retrieval ofsimilar looking fashion items given an image. In combination withthe fine grained recognition, the same or similar looking fashionproducts could be retrieved from a larger repository, e.g., from anfashion online shop. Applications for such a system are numerous,ranging from e-commerce to broader analytics. By analyzing dif-ferent fashion brands, fashion styles could automatically be related,ultimately generating a comprehensive fashion graph. Such a graphwould allow for novel visual style recommendation and surely haveits application in fashion commerce.

Event Recognition in Personal Photo Collections

semantic knowledge . While many events such as day at beachor hiking trip can be distinguished by the context they take place, dif-ferences for social events like birthday or party are more subtle. Thedifference between a party and a birthday is on a semantic level andoften only evidenced by the occurrence of a birthday cake (if at all)or the social roles of the different participants (e.g., the birthday per-son is the center of attention). Clearly, integrating richer semanticknowledge into the model (e.g., person recognition) is needed torobustly recognize such kind of events.

110

6.2 perspectives

event composition and hierarchies . Another direction offuture work is to include composition of events into the model, sinceevents are often composed of other events. For example, summerholidays in Italy might include days at the beach, hiking trips, din-ners, sight seeing trips, etc. Also, when speaking of events, model-ing semantic hierarchies of events would be useful for recognitionperformance. However, training such rich models would require amuch larger training set, given the big variance of particular eventgroups.

automatic summarization. From a user perspective, recog-nition of events is of limited value without summarization. Whilethis area is more popular for videos in the computer vision commu-nity, the summarization of photo collections, automatic identifica-tion of key persons etc., would be a nice feature of the next genera-tion’s photo management software.

collaborative collections . Photo collections are more andmore shared online. Although a few companies and startups al-ready try to provide solutions for collaborative photo collections,their product offerings focus mostly on simple sharing and aggre-gation. A deeper understanding of images through computer visionwould help to address the filter, summarization and organization as-pect of such a product, e.g., allowing us to treat and summarize arock concert differently from a child’s birthday party.

Food Image Recognition

food parsing . The method presented in Chapter 5 labels eachimage with one label. While this is justified to some extent since thelabel identifies the main dish on the image (e.g., hamburger), multiplelabels are required for a more accurate description (e.g., hamburgerwith french fries). As visible in Figures 5.9 and 5.10, our classificationmodel already provides a rich structure, which could allow for suchmore detailed food parsing, ultimately yielding a superpixel levelannotation.

111


nutritional value estimation. Accurate food parsing incombination with a few heuristics would allow for estimating thevolume of food and thus for automatic estimation of the nutritionvalue of a pictured dish. Applications for this from a consumer per-spective would be numerous as evidenced by the amount of smartphone applications which achieve the same functionality with thehelp of human annotators.

food taxonomy. Currently, all classes are assumed to be in-dependent at test time. Including possibly hierarchical relation-ships between classes would allow form more robust classifiers, thatcould e.g., recognize spaghetti bolognese and carbonara as a sub-class of spaghetti. Consequently, such a step is needed to allow forfine grained recognition of different dishes.

diversity of cuisines . Currently the defined 101 classes arespecific to a rather small subset of European and American dishes,since they dominated the image repository used for selecting thedata set. While compiling the Food-101 data set it became appar-ent, that this number of classes is not enough to adequately repre-sent western cuisine in its entirety, let alone all the different kindsof African, Asian, Indian and South American cuisines. Clearly, abetter representation of the different cuisines would increase theimpact of the data set and also open new problems.

112

B I B L I O G R A P H Y

[1] Alexandre Alahi, Raphael Ortiz, and Pierre Vandergheynst.“Freak: Fast retina keypoint.” In: IEEE Conference on ComputerVision and Pattern Recognition. 2012, pp. 510–517 (cit. on p. 10).

[2] Yali Amit and Donald Geman. “Shape Quantization AndRecognition With Randomized Trees.” In: Neural Computa-tion 9.7 (1997), pp. 1545–1588 (cit. on p. 26).

[3] Relja Arandjelovic and Andrew Zisserman. “All AboutVLAD.” In: IEEE Conference on Computer Vision and PatternRecognition. 2013, pp. 1578–1585 (cit. on p. 17).

[4] Relja Arandjelovic and Andrew Zisserman. “Three things ev-eryone should know to improve object retrieval.” In: IEEE

Conference on Computer Vision and Pattern Recognition. 2012,pp. 2911–2918 (cit. on pp. 20, 92).

[5] Kobus Barnard, Pinar Duygulu, David A. Forsyth, Nandode Freitas, David M. Blei, and Michael I. Jordan. “MatchingWords and Pictures.” In: Journal of Machine Learning Research3 (2003), pp. 1107–1135 (cit. on p. 55).

[6] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc VanGool. “Speeded-Up Robust Features (SURF).” In: Computer Vi-sion and Image Understanding 110.3 (2008), pp. 346–359 (cit. onpp. 10, 34, 69, 92).

[7] Tamara L. Berg, Alexander C. Berg, Jaety Edwards, MichaelMaire, Ryan White, Yee Whye Teh, Erik G. Learned-Miller,and David A. Forsyth. “Names and Faces in the News.”In: IEEE Conference on Computer Vision and Pattern Recognition.2004, pp. 848–854 (cit. on p. 55).

[8] Christopher M. Bishop. Pattern Recognition and Machine Learn-ing. Information Science and Statistics. Springer, 2006 (cit. onp. 22).

113

Bibliography

[9] Mogens Bladt and Michael Sørensen. “Statistical inferencefor discretely observed Markov jump processes.” In: Journalof the Royal Statistical Society: Series B (Statistical Methodology)67.3 (2005), pp. 395–410 (cit. on pp. 59, 64).

[10] Anna Bosch, Andrew Zisserman, and Xavier Munoz. “ImageClassification using Random Forests and Ferns.” In: IEEE In-ternational Conference on Computer Vision. 2007, pp. 1–8 (cit. onpp. 83, 88, 96, 97).

[11] Lukas Bossard, Matthias Dantone, Christian Leistner, Chris-tian Wengert, Till Quack, and Luc Van Gool. “Apparel Clas-sification with Style.” In: Asian Conference on Computer Vision.Springer Berlin Heidelberg, 2012, pp. 321–335 (cit. on pp. 4,29).

[12] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool.“Event Recognition in Photo Collections with a StopwatchHMM.” In: IEEE International Conference on Computer Vision.2013, pp. 1193–1200 (cit. on pp. 4, 55).

[13] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool.“Food-101 – Mining Discriminative Components with Ran-dom Forests.” In: European Conference on Computer Vision.Springer International Publishing, 2014, pp. 446–461 (cit. onpp. 4, 79).

[14] Y-Lan Boureau, Francis Bach, Yann LeCun, and Jean Ponce.“Learning mid-level features for recognition.” In: IEEE Con-ference on Computer Vision and Pattern Recognition. 2010,pp. 2559–2566 (cit. on p. 11).

[15] Y-Lan Boureau, Jean Ponce, and Yann LeCun. “A TheoreticalAnalysis of Feature Pooling in Visual Recognition.” In: Inter-national Conference on Machine Learning. 2010, pp. 111–118 (cit.on pp. 11, 18).

[16] Matthew Boutell and Jiebo Luo. “Bayesian fusion of cam-era metadata cues in semantic scene classification.” In: IEEE

Conference on Computer Vision and Pattern Recognition. 2004,pp. 623–630 (cit. on p. 55).

[17] Leo Breiman. “Random Forests.” In: Machine Learning. 2001,pp. 5–32 (cit. on pp. 26, 27, 34, 35, 81, 83, 87).

114

Bibliography

[18] Leo Breiman, Jerome H. Friedman, Richard A. Olshen, andCharles J. Stone. Classification and Regression Trees. Chapman& Hall, 1984 (cit. on p. 25).

[19] Liangliang Cao, Jiebo Luo, and Thomas S. Huang. “Anno-tating photo collections by label propagation according tomultiple similarity cues.” In: ACM International Conference onMultimedia. 2008, pp. 121–130 (cit. on p. 58).

[20] Liangliang Cao, Jiebo Luo, Henry Kautz, and Thomas S.Huang. “Image Annotation Within the Context of PersonalPhoto Collections Using Hierarchical Event and Scene Mod-els.” In: IEEE Transactions on Multimedia 11.2 (Feb. 2009),pp. 208–219 (cit. on p. 58).

[21] Rich Caruana, Nikolaos Karampatziakis, and Ainur Yesse-nalina. “An empirical Evaluation of Supervised LearningMethods in High Dimensions.” In: International Conference onMachine Learning. 2008, pp. 96–103 (cit. on p. 35).

[22] Ken Chatfield, Victor S. Lempitsky, Andrea Vedaldi, and An-drew Zisserman. “The devil is in the details: an evaluation ofrecent feature encoding methods.” In: British Machine VisionConference. 2011, pp. 1–12 (cit. on pp. 11, 13, 18).

[23] Hong Chen, Zi Jian Xu, Zi Qiang Liu, and Song Chun Zhu.“Composite Templates for Cloth Modeling and Sketching.”In: IEEE Conference on Computer Vision and Pattern Recognition.2006, pp. 943–950 (cit. on p. 31).

[24] Mei Chen, Kapil Dhingra, Wen Wu, Lei Yang, Rahul Suk-thankar, and Jie Yang. “PFID: Pittsburgh Fast-Food ImageDataset.” In: International Conference on Image Processing. 2009,pp. 289–292 (cit. on pp. 81, 83).

[25] Mei-yun Chen, Yung-hsiang Yang, Chia-ju Ho, Shih-hanWang, Shane-ming Liu, Eugene Chang, Che-hua Yeh, andMing Ouhyoung. “Automatic Chinese food identificationand quantity estimation.” In: SIGGRAPH Asia 2012 TechnicalBriefs. 2012, pp. 1–4 (cit. on p. 83).

115

Bibliography

[26] Matthew Cooper, Jonathan Foote, Andreas Girgensohn, andLynn Wilcox. “Temporal event clustering for digital photocollections.” In: ACM Transactions on Multimedia Computing,Communications, and Applications 1.3 (Aug. 2005), pp. 269–288

(cit. on pp. 57, 58).

[27] Antonio Criminisi, Jamie Shotton, and Ender Konukoglu.“Decision Forests: A Unified Framework for Classification,Regression, Density Estimation, Manifold Learning andSemi-Supervised Learning.” In: Foundations and Trends inComputer Graphics and Vision 7.2-3 (2012), pp. 81–227 (cit. onpp. 25, 36).

[28] Gabriella Csurka, Christopher R. Dance, Lixin Fan, JuttaWillamowski, and Cédric Bray. “Visual categorization withbags of keypoints.” In: Workshop on Statistical Learning in Com-puter Vision, ECCV. 2004, pp. 1–22 (cit. on pp. 11, 12).

[29] Navneet Dalal and Bill Triggs. “Histograms of Oriented Gra-dients for Human Detection.” In: IEEE Conference on ComputerVision and Pattern Recognition. Vol. 1. 2005, pp. 886–893 (cit.on pp. 9, 34).

[30] Matthias Dantone, Lukas Bossard, Till Quack, and Luc VanGool. “Augmented Faces.” In: IEEE International Conference onComputer Vision Workshops. 2011, pp. 24–31 (cit. on p. 5).

[31] Matthias Dantone, Juergen Gall, Gabriele Fanelli, and LucVan Gool. “Real-time facial feature detection using condi-tional regression forests.” In: IEEE Conference on Computer Vi-sion and Pattern Recognition. 2012 (cit. on p. 69).

[32] H. Daumé. “Frustratingly easy domain adaptation.” In: An-nual meeting-association for computational linguistics. Vol. 45.2007, p. 256 (cit. on p. 47).

[33] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, andFei-Fei Li. “ImageNet: A large-scale hierarchical image data-base.” In: IEEE Conference on Computer Vision and Pattern Recog-nition. 2009, pp. 248–255 (cit. on p. 39).

116

Bibliography

[34] Wei Di, Catherine Wah, Anurag Bhardwaj, Robinson Pira-muthu, and Neel Sundaresan. “Style Finder: Fine-GrainedClothing Style Detection and Retrieval.” In: IEEE InternationalConference on Computer Vision Workshops. 2013, pp. 8–13 (cit.on p. 31).

[35] Carl Doersch, Abhinav Gupta, and Alexei A. Efros. “Mid-level Visual Element Discovery as Discriminative Mode Seek-ing.” In: Advances in Neural Information Processing Systems.2013 (cit. on pp. 81, 82, 87, 90, 104, 105).

[36] Jian Dong, Qiang Chen, Wei Xia, ZhongYang Huang, andShuicheng Yan. “A Deformable Mixture Parsing Model withParselets.” In: IEEE International Conference on Computer Vision.2013, pp. 3408–3415 (cit. on p. 32).

[37] Marcin Eichner and Vittorio Ferrari. CALVIN Upper-body detec-tor for detection in still images. url: http://www.vision.ee.ethz.ch/~calvin/calvin_upperbody_detector/ (cit. onp. 33).

[38] Ian Endres, Kevin Shih, Johnston Jiaa, and Derek Hoiem.“Learning Collections of Part Models for Object Recogni-tion.” In: IEEE Conference on Computer Vision and Pattern Recog-nition. 2013, pp. 939–946 (cit. on pp. 81, 82, 87, 90).

[39] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-RuiWang, and Chih-Jen Lin. “LIBLINEAR: A Library for Large Lin-ear Classification.” In: Journal of Machine Learning Research 9

(2008) (cit. on p. 48).

[40] Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth.“Describing objects by their attributes.” In: IEEE Conference onComputer Vision and Pattern Recognition. 2009 (cit. on p. 32).

[41] Li Fei-Fei and Pietro Perona. “A Bayesian hierarchical modelfor learning natural scene categories.” In: IEEE Conference onComputer Vision and Pattern Recognition. Vol. 2. 2005, pp. 524–531 (cit. on p. 9).

117

http://www.vision.ee.ethz.ch/~calvin/calvin_upperbody_detector/

http://www.vision.ee.ethz.ch/~calvin/calvin_upperbody_detector/

Bibliography

[42] Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester,and Deva Ramanan. “Object detection with discriminativelytrained part-based models.” In: IEEE Transactions on PatternAnalysis and Machine Intelligence 32.9 (Sept. 2010) (cit. onp. 82).

[43] Pedro F. Felzenszwalb and Daniel P. Huttenlocher. “EfficientGraph-Based Image Segmentation.” In: International Journalof Computer Vision 59.2 (Sept. 2004), pp. 167–181 (cit. on p. 92).

[44] Vittorio Ferrari and Andrew Zisserman. “Learning visual at-tributes.” In: Advances in Neural Information Processing Sys-tems. 2008, pp. 433–440 (cit. on p. 32).

[45] Adrien Gaidon, Zaid Harchaoui, and Cordelia Schmid. “Rec-ognizing activities with cluster-trees of tracklets.” In: BritishMachine Vision Conference. 2012 (cit. on p. 57).

[46] Juergen Gall, Angela Yao, Nima Razavi, Luc Van Gool, andVictor Lempitsky. “Hough forests for object detection, track-ing, and action recognition.” In: IEEE Transactions on PatternAnalysis and Machine Intelligence 33.11 (2011), pp. 2188–2202

(cit. on p. 83).

[47] Andrew C. Gallagher and Tsuhan Chen. “Clothing coseg-mentation for recognizing people.” In: IEEE Conference onComputer Vision and Pattern Recognition. 2008 (cit. on pp. 31,32).

[48] Stephan Gammeter, Lukas Bossard, Till Quack, and Luc VanGool. “I Know What You Did Last Summer: Object-LevelAuto-Annotation of holiday snaps.” In: IEEE International Con-ference on Computer Vision. 2009, pp. 614–621 (cit. on p. 4).

[49] Stephan Gammeter, Alexander Gassmann, Lukas Bossard,Till Quack, and Luc Van Gool. “Server-Side Object Recogni-tion and Client-Side Object Tracking for Mobile AugmentedReality.” In: IEEE Computer Vision and Pattern Recognition Work-shops. 2010, pp. 1–8 (cit. on p. 4).

[50] Jan van Gemert, Jan-Mark Geusebroek, Cor J. Veenman, andArnold W. M. Smeulders. “Kernel Codebooks for Scene Cate-gorization.” In: European Conference on Computer Vision. 2008,pp. 696–709 (cit. on p. 13).

118

Bibliography

[51] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and JitendraMalik. “Rich Feature Hierarchies for Accurate Object Detec-tion and Semantic Segmentation.” In: IEEE Conference on Com-puter Vision and Pattern Recognition. 2014, pp. 580–587 (cit. onp. 81).

[52] Matthieu Guillaumin, Thomas Mensink, Jakob Verbeek, andCordelia Schmid. “Face recognition from caption-based su-pervision.” In: International Journal of Computer Vision 96.1(Jan. 2012), pp. 64–82 (cit. on pp. 55, 58).

[53] Bharath Hariharan, Jitendra Malik, and Deva Ramanan. “Dis-criminative Decorrelation for Clustering and Classification.”In: European Conference on Computer Vision. 2012, pp. 459–472

(cit. on p. 82).

[54] Tin Kam Ho. “Random decision forests.” In: InternationalConference on Document Analysis and Recognition. Vol. 1. 1995,pp. 278–282 (cit. on pp. 26, 81, 83, 87).

[55] Hajime Hoashi, Taichi Joutou, and Keiji Yanai. “Image Recog-nition of 85 Food Categories by Feature Fusion.” In: IEEE In-ternational Symposium on Multimedia. 2010, pp. 296–301 (cit.on p. 81).

[56] Zhilan Hu, Hong Yan, and Xinggang Lin. “Clothing segmen-tation using foreground and background estimation basedon the constrained Delaunay triangulation.” In: Pattern Recog-nition 41.5 (May 2008), pp. 1581–1592 (cit. on pp. 31, 32).

[57] Hamid Izadinia and Mubarak Shah. “Recognizing ComplexEvents using Large Margin Joint Low-Level Event Model.”In: European Conference on Computer Vision. 2012, pp. 430–444

(cit. on p. 59).

[58] Tommi Jaakkola and David Haussler. “Exploiting GenerativeModels in Discriminative Classifiers.” In: Advances in Neu-ral Information Processing Systems. 1998, pp. 487–493 (cit. onp. 15).

[59] Nataraj Jammalamadaka, Ayush Minocha, Digvijay Singh,and C.V. Jawahar. “Parsing Clothes in Unrestricted Images.”In: British Machine Vision Conference. 2013 (cit. on p. 32).

119

Bibliography

[60] Hervé Jégou and Ondrej Chum. “Negative Evidences andCo-occurences in Image Retrieval: The Benefit of PCA andWhitening.” In: European Conference on Computer Vision. 2012,pp. 774–787 (cit. on pp. 17, 21).

[61] Hervé Jégou, Matthijs Douze, and Cordelia Schmid. “On theburstiness of visual elements.” In: IEEE Conference on Com-puter Vision and Pattern Recognition. 2009, pp. 1169–1176 (cit.on p. 20).

[62] Hervé Jégou, Matthijs Douze, and Cordelia Schmid. “Prod-uct Quantization for Nearest Neighbor Search.” In: IEEE

Transactions on Pattern Analysis and Machine Intelligence 33.1(Jan. 2011), pp. 117–128 (cit. on p. 20).

[63] Hervé Jégou, Matthijs Douze, Cordelia Schmid, and PatrickPérez. “Aggregating local descriptors into a compact imagerepresentation.” In: IEEE Conference on Computer Vision andPattern Recognition. 2010, pp. 3304–3311 (cit. on pp. 16, 17).

[64] Hervé Jégou, Florent Perronnin, Matthijs Douze, JorgeSánchez, Patrick Pérez, and Cordelia Schmid. “AggregatingLocal Image Descriptors into Compact Codes.” In: IEEE Trans-actions on Pattern Analysis and Machine Intelligence 34.9 (2012),pp. 1704–1716 (cit. on pp. 16, 20).

[65] Yangqing Jia. Caffe: An Open Source Convolutional Architecturefor Fast Feature Embedding. http://caffe.berkeleyvision.org/. 2013 (cit. on p. 97).

[66] Thorsten Joachims. “Transductive Inference for Text Classifi-cation using Support Vector Machines.” In: International Con-ference on Machine Learning. 1999, pp. 200–209 (cit. on p. 36).

[67] Thorsten Joachims, Thomas Finley, and Chun-Nam JohnYu. “Cutting-Plane Training of Structural SVMs.” In: MachineLearning 77.1 (2009), pp. 27–59 (cit. on p. 91).

[68] Taichi Joutou and Keiji Yanai. “A food image recognition sys-tem with Multiple Kernel Learning.” In: International Confer-ence on Image Processing. 2009, pp. 285–288 (cit. on p. 83).

120

http://caffe.berkeleyvision.org/

http://caffe.berkeleyvision.org/

Bibliography

[69] Mayank Juneja, Andrea Vedaldi, C.V. Jawahar, and AndrewZisserman. “Blocks That Shout: Distinctive Parts for SceneClassification.” In: IEEE Conference on Computer Vision and Pat-tern Recognition. 2013, pp. 923–930 (cit. on pp. 81, 82, 87, 90,96, 104, 105).

[70] Yannis Kalantidis, Lyndon Kennedy, and Li-Jia Li. “Gettingthe look: clothing recognition and segmentation for auto-matic product suggestions in everyday photos.” In: ACM In-ternational Conference on Multimedia Retrieval. 2013, pp. 105–112 (cit. on p. 31).

[71] Yoshiyuki Kawano and Keiji Yanai. “Real-Time Mobile FoodRecognition System.” In: IEEE Conference on Computer Visionand Pattern Recognition Workshops. 2013, pp. 1–7 (cit. on p. 83).

[72] Davis E. King. “Dlib-ml: A Machine Learning Toolkit.” In:Journal of Machine Learning Research 10 (2009), pp. 1755–1758

(cit. on pp. 67, 91).

[73] Piotr Koniusz, Fei Yan, and Krystian Mikolajczyk. “Com-parison of mid-level feature coding approaches and poolingstrategies in visual concept detection.” In: Computer Visionand Image Understanding 117.5 (2013), pp. 479–492 (cit. onpp. 11, 18).

[74] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton.“ImageNet Classification with Deep Convolutional NeuralNetworks.” In: Advances in Neural Information Processing Sys-tems. 2012, pp. 1097–1105 (cit. on p. 97).

[75] Neeraj Kumar, Alexander C. Berg, Peter N. Belhumeur, andShree K. Nayar. “Attribute and simile classifiers for face ver-ification.” In: IEEE International Conference on Computer Vision.2009 (cit. on p. 32).

[76] John D. Lafferty, Andrew McCallum, and Fernando C. N.Pereira. “Conditional Random Fields: Probabilistic Modelsfor Segmenting and Labeling Sequence Data.” In: Interna-tional Conference on Machine Learning. Morgan KaufmannPublishers Inc., 2001, pp. 282–289 (cit. on p. 63).

121

Bibliography

[77] Christoph H. Lampert, Hannes Nickisch, and Stefan Harmel-ing. “Learning To Detect Unseen Object Classes by Between-Class Attribute Transfer.” In: IEEE Conference on Computer Vi-sion and Pattern Recognition. 2009, pp. 951–958 (cit. on p. 37).

[78] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. “Be-yond bags of features: Spatial pyramid matching for recog-nizing natural scene categories.” In: IEEE Conference on Com-puter Vision and Pattern Recognition. 2006, pp. 2169–2178 (cit.on pp. 19, 34, 69, 82, 95, 97).

[79] Christian Leistner, Amir Saffari, Jakob Santner, and HorstBischof. “Semi-Supervised Random Forests.” In: IEEE Interna-tional Conference on Computer Vision. 2009, pp. 506–513 (cit. onp. 36).

[80] Li-Jia Li and Fei-Fei Li. “What, where and who? Classifyingevents by scene and object recognition.” In: IEEE InternationalConference on Computer Vision. 2007, pp. 1–8 (cit. on pp. 58,69).

[81] Quannan Li, Jiajun Wu, and Zhuowen Tu. “Harvesting Mid-level Visual Concepts from Large-Scale Internet Images.” In:IEEE Conference on Computer Vision and Pattern Recognition.2013, pp. 851–858 (cit. on pp. 81, 82, 87, 90, 104, 105).

[82] Lingqiao Liu, Lei Wang, and Xinwang Liu. “In defense ofsoft-assignment coding.” In: IEEE International Conference onComputer Vision. 2011, pp. 2486–2493 (cit. on p. 13).

[83] Si Liu, Jiashi Feng, Csaba Domokos, Hui Xu, Junshi Huang,Zhenzhen Hu, and Shuicheng Yan. “Fashion Parsing WithWeak Color-Category Labels.” In: IEEE Transactions on Multi-media 16.1 (2014), pp. 253–265 (cit. on p. 32).

[84] Si Liu, Zheng Song, Guangcan Liu, Changsheng Xu, Han-qing Lu, and Shuicheng Yan. “Street-to-Shop: Cross-ScenarioClothing Retrieval via Parts Alignment and Auxiliary Set.”In: IEEE Conference on Computer Vision and Pattern Recognition.2012, pp. 3330–3337 (cit. on p. 31).

[85] David G. Lowe. “Distinctive Image Features from Scale-Invariant Keypoints.” In: International Journal of Computer Vi-sion 60.2 (Nov. 2004), pp. 91–110 (cit. on p. 10).

122

Bibliography

[86] Tomasz Malisiewicz, Abhinav Gupta, and Alexei A. Efros.“Ensemble of Exemplar-SVMs for Object Detection and Be-yond.” In: IEEE International Conference on Computer Vision.2011, pp. 89–96 (cit. on p. 89).

[87] Marcin Marszalek, Ivan Laptev, and Cordelia Schmid. “Ac-tions in context.” In: IEEE Conference on Computer Vision andPattern Recognition. 2009, pp. 2929–2936 (cit. on p. 57).

[88] Corby K. Martin, John B. Correa, Hongmei Han, H. Ray-mond Allen, Jennifer C. Rood, Catherine M. Champagne,Bahadir K. Gunturk, and George A. Bray. “Validity of theRemote Food Photography Method (RFPM) for estimating en-ergy and nutrient intake in near real-time.” In: Obesity 20.4(2011), pp. 891–899 (cit. on p. 79).

[89] Yuji Matsuda, Hajime Hoashi, and Keiji Yanai. “Multiple-Food Recognition Considering Co-occurrence EmployingManifold Ranking.” In: International Conference on PatternRecognition. 2012, pp. 2017–2020 (cit. on p. 83).

[90] Riccardo Mattivi, Jasper Uijlings, Francesco G.B. De Natale,and Nicu Sebe. “Exploitation of time constraints for (sub-)event recognition.” In: Joint ACM workshop on Modeling andrepresenting events. 2011, pp. 7–12 (cit. on pp. 57, 58, 71).

[91] Julian McAuley and Jure Leskovec. “Image labeling on anetwork: using social-network metadata for image classi-fication.” In: European Conference on Computer Vision. 2012,pp. 828–841 (cit. on p. 59).

[92] Mary Meeker. Internet Trends 2014. Kleiner Perkins Caufield& Byers. May 2014. url: http://www.kpcb.com/insights/2014 - internet - trends (visited on 05/28/2014) (cit. onp. 1).

[93] Krystian Mikolajczyk and Cordelia Schmid. “A performanceevaluation of local descriptors.” In: IEEE Transactions onPattern Analysis and Machine Intelligence 27.10 (Oct. 2005),pp. 1615–1630 (cit. on p. 10).

123

http://www.kpcb.com/insights/2014-internet-trends

http://www.kpcb.com/insights/2014-internet-trends

Bibliography

[94] Frank Moosmann, Eric Nowak, and Frederic Jurie. “Random-ized clustering forests for image classification.” In: IEEE Trans-actions on Pattern Analysis and Machine Intelligence 30.9 (2008),pp. 1632–1646 (cit. on pp. 83, 96, 97).

[95] Jon Noronha, Eric Hysen, Haoqi Zhang, and Krzysztof Z.Gajos. “Platemate: crowdsourcing nutritional analysis fromfood photographs.” In: ACM Symposium on UI Software andTechnology. 2011, pp. 1–12 (cit. on p. 79).

[96] Sebastian Nowozin and Christoph H. Lampert. “StructuredLearning and Prediction in Computer Vision.” In: Founda-tions and Trends® in Computer Graphics and Vision 6.3-4 (2011),pp. 185–365 (cit. on p. 62).

[97] Tim Ojala, M. Pietikainen, and David Harwood. “Perfor-mance evaluation of texture measures with classificationbased on Kullback discrimination of distributions.” In: Inter-national Conference on Pattern Recognition. 1994, pp. 582–585

(cit. on p. 34).

[98] Aude Oliva and Antonio Torralba. “Modeling the Shape ofthe Scene: A Holistic Representation of the Spatial Enve-lope.” In: International Journal of Computer Vision 42.3 (May2001), pp. 145–175 (cit. on p. 9).

[99] Paul Over, George Awad, Jonathan Fiscus, Brian Anton-ishek, Alan F. Smeaton, Wessel Kraaij, and Georges Quenot.“TRECVID 2010 – An Overview of the Goals, Tasks, Data, Eval-uation Mechanisms and Metrics.” In: Proceedings of TRECVID

2010. National Institute of Standards and Technology, 2011

(cit. on p. 57).

[100] Sinno Jialin Pan and Qiang Yang. “A Survey on TransferLearning.” In: IEEE Transactions on Knowledge and Data Engi-neering 22.10 (2010), pp. 1345–1359 (cit. on p. 37).

[101] Florent Perronnin and Christopher R. Dance. “Fisher Kernelson Visual Vocabularies for Image Categorization.” In: IEEE

Conference on Computer Vision and Pattern Recognition. 2007

(cit. on p. 15).

124

Bibliography

[102] Florent Perronnin, Jorge Sánchez, and Thomas Mensink. “Im-proving the Fisher Kernel for Large-Scale Image Classifi-cation.” In: European Conference on Computer Vision. 2010,pp. 143–156 (cit. on pp. 16, 20).

[103] James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, andAndrew Zisserman. “Object retrieval with large vocabulariesand fast spatial matching.” In: IEEE Conference on ComputerVision and Pattern Recognition. 2007 (cit. on p. 55).

[104] Danfeng Qin, Stephan Gammeter, Lukas Bossard, Till Quack,and Luc Van Gool. “Hello Neighbor: Accurate Object Re-trieval with K-Reciprocal Nearest Neighbors.” In: IEEE Confer-ence on Computer Vision and Pattern Recognition. 2011, pp. 777–784 (cit. on p. 4).

[105] Ariadna Quattoni and Antonio Torralba. “Recognizing in-door scenes.” In: IEEE Conference on Computer Vision and Pat-tern Recognition. 2009, pp. 413–420 (cit. on pp. 69, 82, 92, 104).

[106] Ariadna Quattoni, Sybor Wang, Louis-Philippe Morency, Mi-chael Collins, and Trevor Darrell. “Hidden conditional ran-dom fields.” In: IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 29.10 (2007), pp. 1848–1852 (cit. on p. 63).

[107] Jorge Sánchez, Florent Perronnin, Thomas Mensink, andJakob J. Verbeek. “Image Classification with the Fisher Vec-tor: Theory and Practice.” In: International Journal of ComputerVision 105.3 (2013), pp. 222–245 (cit. on pp. 11, 15, 16, 21, 81,82, 92, 96–98).

[108] Bernhard Schölkopf, Ralf Herbrich, and Alex J. Smola. “AGeneralized Representer Theorem.” English. In: Computa-tional Learning Theory. Vol. 2111. Lecture Notes in ComputerScience. Springer Berlin Heidelberg, 2001, pp. 416–426 (cit.on p. 24).

[109] Eli Shechtman and Michal Irani. “Matching Local Self-Simi-larities across Images and Videos.” In: IEEE Conference on Com-puter Vision and Pattern Recognition. 2007 (cit. on p. 34).

125

Bibliography

[110] Jamie Shotton, Matthew Johnson, and Roberto Cipolla. “Se-mantic texton forests for image categorization and segmenta-tion.” In: IEEE Conference on Computer Vision and Pattern Recog-nition. 2008 (cit. on p. 83).

[111] Saurabh Singh, Abhinav Gupta, and A. Alexei Efros. “Un-supervised Discovery of Mid-Level Discriminative Patches.”In: European Conference on Computer Vision. 2012, pp. 73–86

(cit. on pp. 81, 82, 87, 90, 93, 96–98, 104, 105).

[112] Josef Sivic and Andrew Zisserman. “Video Google: A TextRetrieval Approach to Object Matching in Videos.” In: IEEE

International Conference on Computer Vision. 2003, pp. 1470–1477 (cit. on pp. 9, 12).

[113] Zheng Song, Meng Wang, Xian-sheng Hua, and ShuichengYan. “Predicting occupation via human clothing and con-texts.” In: IEEE International Conference on Computer Vision.2011, pp. 1084–1091 (cit. on p. 31).

[114] Alexander Sorokin and David Forsyth. “Utility data annota-tion with Amazon Mechanical Turk.” In: IEEE Computer Soci-ety Conference on Computer Vision and Pattern Recognition Work-shops (2008), pp. 1–8 (cit. on p. 36).

[115] Michael Stark, Michael Goesele, and Bernt Schiele. “A Shape-Based Object Class Model for Knowledge Transfer.” In: IEEE

International Conference on Computer Vision. 2009, pp. 373–380

(cit. on p. 37).

[116] “Structured class-labels in random forests for semantic im-age labelling.” In: IEEE International Conference on ComputerVision. 2011, pp. 2190–2197 (cit. on p. 83).

[117] Jian Sun and Jean Ponce. “Learning Discriminative PartDetectors for Image Classification and Cosegmentation.”In: IEEE International Conference on Computer Vision. 2013,pp. 3400–3407 (cit. on pp. 81, 82, 87, 90, 104, 105).

[118] Kevin Tang, Li Fei-Fei, and Daphne Koller. “Learning la-tent temporal structure for complex event detection.” In: IEEE

Conference on Computer Vision and Pattern Recognition. 2012,pp. 1250–1257 (cit. on pp. 57, 59, 63).

126

Bibliography

[119] Engin Tola, Vincent Lepetit, and Pascal Fua. “DAISY: An Effi-cient Dense Descriptor Applied to Wide Baseline Stereo.” In:IEEE Transactions on Pattern Analysis and Machine Intelligence32.5 (2010), pp. 815–830 (cit. on p. 10).

[120] Shen-Fu Tsai, Liangliang Cao, Feng Tang, and Thomas S.Huang. “Compositional object pattern: a new model for al-bum event recognition.” In: ACM International conference onMultimedia. 2011, pp. 1361–1364 (cit. on p. 57).

[121] Tinne Tuytelaars, Christoph Lampert, Matthew Blaschko,and Wray Buntine. “Unsupervised Object Discovery: A Com-parison.” In: International Journal of Computer Vision 88.2(2009), pp. 284–302 (cit. on p. 58).

[122] Jasper R. R. Uijlings, Koen E. A. van de Sande, Theo Gevers,and Arnold W. M. Smeulders. “Selective Search for ObjectRecognition.” In: International Journal of Computer Vision 104.2(2013), pp. 154–171 (cit. on p. 81).

[123] Andrea Vedaldi and Brian Fulkerson. “Vlfeat: an open andportable library of computer vision algorithms.” In: ACM In-ternational Conference on Multimedia. 2010, pp. 1469–1472. url:http://www.vlfeat.org/ (cit. on p. 92).

[124] Andrea Vedaldi and Andrew Zisserman. “Efficient AdditiveKernels via Explicit Feature Maps.” In: IEEE Transactions onPattern Analysis and Machine Intelligence 34.3 (2011) (cit. onpp. 20, 25).

[125] Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, ThomasHuang, and Yihong Gong. “Locality-constrained linear cod-ing for image classification.” In: IEEE Conference on ComputerVision and Pattern Recognition. 2010, pp. 3360–3367 (cit. onp. 14).

[126] Nan Wang and Haizhou Ai. “Who Blocks Who: Simultane-ous clothing segmentation for grouping images.” In: IEEE In-ternational Conference on Computer Vision. 2011, pp. 1535–1542

(cit. on p. 31).

127

http://www.vlfeat.org/

Bibliography

[127] Xianwang Wang and Tong Zhang. “Clothes search in con-sumer photos via color matching and attribute learning.” In:ACM International Conference on Multimedia. 2011, pp. 1353–1356 (cit. on pp. 31, 32).

[128] Xinggang Wang, Baoyuan Wang, Xiang Bai, Wenyu Liu,and Zhuowen Tu. “Max-margin multiple-instance dictionarylearning.” In: Advances in Neural Information Processing Sys-tems. 2013, pp. 846–854 (cit. on pp. 81, 82, 87, 90, 104, 105).

[129] John M. Winn, Antonio Criminisi, and Thomas P. Minka.“Object Categorization by Learned Universal Visual Dictio-nary.” In: IEEE International Conference on Computer Vision.2005, pp. 1800–1807 (cit. on p. 20).

[130] Kota Yamaguchi, M. Hadi Kiapour, and Tamara L. Berg. “Pa-per Doll Parsing: Retrieving Similar Styles to Parse Cloth-ing Items.” In: IEEE International Conference on Computer Vision.2013, pp. 3519–3526 (cit. on pp. 32, 109).

[131] Kota Yamaguchi, M. Hadi Kiapour, Luis Ortiz, and TamaraL. Berg. “Parsing Clothing in Fashion Photographs.” In: IEEE

Conference on Computer Vision and Pattern Recognition. 2012,pp. 3570–3577 (cit. on p. 31).

[132] Jianchao Yang, Kai Yu, Yihong Gong, and Thomas Huang.“Linear spatial pyramid matching using sparse coding forimage classification.” In: IEEE Conference on Computer Visionand Pattern Recognition. 2009, pp. 1794–1801 (cit. on p. 13).

[133] Shulin Lynn Yang, Mei Chen, Dean Pomerleau, and RahulSukthankar. “Food recognition using statistics of pairwiselocal features.” In: IEEE Conference on Computer Vision and Pat-tern Recognition. 2010, pp. 2249–2256 (cit. on pp. 81, 83).

[134] Wei Yang, Ping Luo, and Liang Lin. “Clothing Co-Parsing byJoint Image Segmentation and Labeling.” In: IEEE Conferenceon Computer Vision and Pattern Recognition. 2014, pp. 3182–3189 (cit. on p. 32).

128

Bibliography

[135] Bangpeng Yao, Aditya Khosla, and Li Fei-Fei. “Combiningrandomization and discrimination for fine-grained imagecategorization.” In: IEEE Conference on Computer Vision and Pat-tern Recognition. 2011, pp. 1577–1584 (cit. on pp. 35, 45, 81–83,90).

[136] Chun-Nam John Yu and Thorsten Joachims. “Learning Struc-tural SVMs with Latent Variables.” In: International Conferenceon Machine Learning. 2009, pp. 1169–1176 (cit. on pp. 62, 65,67).

[137] Junsong Yuan, Jiebo Luo, Henry Kautz, and Ying Wu. “Min-ing GPS traces and visual words for event classification.”In: ACM International Conference on Multimedia Information Re-trieval. 2008, pp. 2–9 (cit. on pp. 55, 57, 58).

[138] Alan L. Yuille and Anand Rangarajan. “The Concave-ConvexProcedure.” In: Neural Computation 15.4 (Apr. 2003), pp. 915–936 (cit. on p. 67).

[139] Xi Zhou, Kai Yu, Tong Zhang, and Thomas S Huang. “Im-age classification using super-vector coding of local imagedescriptors.” In: European Conference on Computer Vision. 2010

(cit. on pp. 17, 82).

129

A C R O N Y M S

BoW Bag of Words

CCCP Concave-Convex Procedure

CNN Convolutional Neural Network

CRF Conditional Random Field

DPM Deformable Part-based Models

EM Expectation-Maximization

GLOH Gradient Location and Orientation Histogram

GMM Gaussian Mixture Model

HOG Histogram of Oriented Gradients

HMM Hidden Markov Model

IFV Improved Fisher Vectors

LLC Locality-constrained Linear Coding

LBP Local Binary Pattern

PCA Principal Component Analysis

PQ Product Quantization

RF Random Forest

RFDC Random Forest Discriminative Components

SC Sparse Coding

SIFT Scale-Invariant Feature Transform

SHMM Stopwatch Hidden Markov Model

SPM Spatial Pyramid Matching

131

acronyms

SSD Self-Similarity Descriptor

SURF Speeded Up Robust Features

SVM Support Vector Machine

TSVM Transductive Support Vector Machine

mi-SVM multi-instance Support Vector Machine

TL Transfer Learning

VLAD Vector of Locally Aggregated Descriptors

132

Date post:	18-Jun-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

In Copyright - Non-Commercial Use Permitted Rights ...47015/et… · photo booths, Valeria De Luca...

Documents