Learning Deep Features for Scene Recognition using Places Database
Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, Aude Oliva NIPS2014
Bora Çelikkale
INTRODUCTION
Human Visual Recognition
Samples world several times / sec
~millions images within a year
INTRODUCTION
Primate Brain
Hierarchical organization in layers of increasing processing complexity
Inspired CNNs
PROBLEM & MOTIVATION
Obj Classification have obtained astonishing performanace with large databases (ImageNet)
Iconic images do not contain the richness and diversity of visual info in scenes
CONTRIBUTIONS
Scene-centric database 60x larger than SUN
Comparison metrics for scene datasets:Density, Diversity
SCENE DATASETS
Scene15 (Lazebnik et al. 2006)
15 categories
~3000 imgs
MIT Indoor67 (Quatham & Torralba 2009)
67 categories of indoor places
15.620 imgs
SUN (Xiao et al. 2010)
397 (well-sampled) categories
130.519 imgs
Places (Zhou et al. 2014)
476 categories
7.076.580 imgs
PLACES DATASET
Same categories from SUN
696 popular adjectives in Eng
Google Images
Bing Images
Flickr
>40M imgs are downloaded
1
PLACES DATASET
PCA-based duplicate removal across SUN
2 Places & SUN have different images
Allows to combine Places & SUN
PLACES DATASET
Annotations (with AMT)
Questions (eg: is this a living room?)
Two round setup:1. Default answer is NO2. Default answer is YES
Imgs shown / round: 750 + 60 from SUN for control
3
Take >90% accuracy
COMPARISON METRICS
Relative Density
COMPARISON METRICS
Relative Density
Images have more similar neighbors
NN of a1
NN of b1
COMPARISON METRICS
Relative Diversity
Simpson Index: two random individual belong to same specie
NN of a1 NN of b
1
EXPERIMENTS
Density & Diversity Comparison (AMT)
1 Relative diversity vs. relative density per each category and dataset
Show 12 pairs of images
Workers select the most similar pair
Diversity: pairs are chosen random for each db
Density: 5th NN (avoid near duplicates) is chosen as pair with GIST
EXPERIMENTS
Cross Dataset Generalization
2 Training and testing across different datasets
ImageNet-CNN and linear SVM
EXPERIMENTS
Comparison with Hand-designed Features
3
EXPERIMENTS
Training CNN for Scene Recognition
2,5M imgs from 205 categories, on AlexNet 4
PLACES-CNNs
Hybrid-AlexNet
Places + ImageNet 3.5M imgs, 1183 categoriesAccuracy = 0.5230 on validation set
Places205-GoogLeNet (on 205 categories)
Accuracy: top1 = 0.5567, top5 = 0.8541 on validation set
Places205-VGG16 (on 205 categories)
Accuracy: top1 = 0.5890, top5 = 0.8770 on validation set
PLACES2 DATASET
400+ unique scene categories
>10M images
AlexNet top1 accuracy: 43.0%
VGG16 top1 accuracy: 47.6%
DEMO
http://places.csail.mit.edu/demo.html
http://places2.csail.mit.edu/demo.html
THANK YOU