1 Proximate Sensing: Inferring What-Is-Where From Georeferenced Photo Collections Daniel Leung and...

transcript

Proximate Sensing:Inferring What-Is-Where From

Georeferenced Photo Collections

Daniel Leung and Shawn Newsam

Electrical Engineering & Computer Science

University of California at MercedCVPR 2010June 17th, 2010

Remote sensing: using overhead images of distant scenes to derive geographic information.

satellite image (Google Maps) National Land Cover Database (USGS)

Proximate sensing: use ground-level images of close-by objects and scenes.

Land Cover Map 2000(UK Centre for Ecology &

Hydrology)

community-contributed photos(Geograph Britain and Ireland project)

study area: 100x100 km region in southeastern UK

(region TQ in National Grid)

community-contributed photos(Geograph Britain and Ireland project)

Proximate sensing: use ground-level images of close-by objects and scenes.

Proximate Sensing

• We conjecture that the visual content of georeferenced images can be used to derive maps of what-is-where on the surface of the earth.

• Motivation:– Such collections are becoming increasingly available,

e.g. Flickr (100+ million geotagged images), Panoramio, Picasa, Geograph, TrekEarth.

– Derive geographic information not possible through other means, e.g. land-use classification.

– Exciting new application of CV that not only provides another context to apply/revisit standard techniques but stands to motivate novel problems.

Proximate Sensing: Context

• Volunteered Geographic Information (Wikipedia):– VGI is the harnessing of tools to create,

assemble, and disseminate geographic data provided voluntarily by individuals (Goodchild, 2007).

– Goodchild, M. 2007. Citizens as Sensors: The World of Volunteered Geography.

proximate

sensing

citizen science

volunteeredgeographicinformation

VGI: Flickr• 103,679,986 geotagged items• 2.8 million things geotagged this month

VGI: Geograph• “The Geograph Britain and Ireland project aims to collect

geographically representative photographs and information for every square kilometre of Great Britain and Ireland, and you can be part of it.”

• 9,973 users have contributed 1,897,042 images covering 255,904 grid squares, or 77.1% of the total.

“Railway bridge crossing R. Rother

This is now a dismantled railway, further east it

becomesthe Kent & East Sussex

Railway.”

Objective

• Eventual goal is to use the visual content of georeferenced photos to produce land use/cover maps.

• Initial focus on simpler problem of binary classification into developed and undeveloped regions.

Related Work

• Other researchers have leveraged location information in georeferenced photo collections:– To annotate novel images [Quack et al., CIVR 2008;

Moxley et al., MIR 2008].– To geolocate novel images [Hays and Efros, CVPR

2008]. – To organize the collections themselves [Crandall et

al., WWW 2009].

• However, ours is the first work (to the best of our knowledge) to use the collections to infer what-is-where on the surface of the earth on a large scale.

Overview

fraction developed map

binary classification map

trainingimages

labelimages

trainclassifier

featureextraction

aggregate labelsin 1x1 km tiles

targetimages

featureextraction

classifytarget images

Ground Truth (1)• Land Cover Map 2000 (UK Centre

for Ecology & Hydrology)

LCM AC 10: Oceanic Seas

LCM AC 8: Standing open water

LCM AC 4: Improved grassland

LCM AC 7: Built up areas and gardens

LCM AC 3: Arable and horticulture

LCM AC 1: Broad-leaved / mixed woodland

LCM AC 9: Coastal

LCM AC 2: Coniferous woodland

LCM AC 5: Semi-natural grass

LCM AC 6: Mountain, heath, bog

Ground Truth (2)• Aggregate 10 land cover classes into 2

superclasses:– Developed: LCM AC:7 Built up areas and gardens– Undeveloped: other 9 classes

• Derive 2 ground truth maps:– Fraction map: percent developed for each 1x1 km

tile.– Binary classification map: apply 50% threshold to

fraction map.

Ground truth fraction map indicating percent developed

for each 1x1km tile.

Ground truth binary classification map indicating tiles labelled as developed

(white) or undeveloped (black).

Datasets (1)

• Downloaded 920K Flickr images for the TQ region.

• Distribution for 1x1 km tiles shown to left (log10 scale).

• 5,420 tiles contain no Flickr images.• 4,580 tiles contain average of 200,

median of 10, and maximum of 53,840 images.

Flickr

Datasets (2)

• Downloaded 120K images from the Geograph Britain and Ireland project

• Distribution for 1x1 km tiles shown to left (log10 scale).

• Only 614 tiles without images.• 9,386 tiles contain average of 13,

median of 5, and maximum of 1,458 images.

Geograph

Image Features

• Extract simple five dimensional edge histogram features for each image.

• Motivated by the observation that images of developed scenes typically have a higher proportion of horizontal and vertical edges than images of undeveloped scenes.

Image Classification

• Perform image level binary classification:– Developed.– Undeveloped.

• SVM classifier with Gaussian RBF kernel, five-fold cross validation, and grid search for optimal parameter selection.

Experiments (1)

fraction developed map

binary classification map

trainingimages

labelimages

trainclassifier

featureextraction

targetimages

featureextraction

aggregate labelsin 1x1 km tiles

classifytarget images

Experiments (2)• Fraction developed map: the fraction of

images classified as developed in each tile.

• Binary classification map: threshold applied to fraction map.

• Explore two types of thresholds:– Fixed at 0.5.– Adaptive so that 38.9% of the tiles are labelled as

developed (this represents prior knowledge on the distribution of developed vs. undeveloped regions).

Experiments (3)• Results are qualitatively evaluated by visually

comparing predicted maps with ground truth maps.• Results are quantitatively evaluated using ground

truth:– Binary classification: number of tiles with same label.– Fraction developed: correlation coefficient () over

tiles. Also, mean absolute difference (MAD) and root mean squared difference (RMSD).

• Quantitative results computed over 4,553 tiles for which there are both Flickr and Geograph images.– 38.9% of these tiles are developed in the ground truth

so that chance binary classification is 61.1% achievable by labelling all tiles as undeveloped.

Experiments (4)• Manual vs. weakly-supervised labelling

of training set.• Effect of photographer intent.• Relative importance of training vs. target

set.• Filtering out non-informative images.• Training set size.• Training set quality.

Results—Manually Labelled Training Set (1)

• Training set contains 2,740 Flickr images which have been manually labeled as depicting a scene that is developed or undeveloped.

• Developed ~ containing constructed materials such as used in houses, buildings, etc.

Ground Truth MapsMaps Generated Using

Flickr Images

Binary Maps

Fraction MapsOverall Class. Rate Avg. Class. Rate

Training Set

Target Set

Training Set Size

FixedThresh.

AdaptiveThresh.

FixedThresh.

AdaptiveThresh.

% MAD RMSD

Manual (Flickr) Flickr 2740 (0.51) 66.4 64.9 68.8 63 0.374 0.287 0.383

fraction of images labelled as developed

in the training set

• Performance is better than chance (61.1%)

• Labelled training set constructed in fully automated fashion:– Select 2 images at random from tiles with

4 or more images.– Label them with the majority label of the

tile in the ground truth map.

Results—Weakly-Supervised Training (1)

Results—Weakly-Supervised Training (2)

Binary Maps

Training Set

Target Set

Training Set Size

FixedThresh.

AdaptiveThresh.

FixedThresh.

AdaptiveThresh.

% MAD RMSD

Manual (Flickr) Flickr 2740 (0.51) 66.4 64.9 68.8 63 0.374 0.287 0.383

Weakly(Flickr) Flickr 5872 (0.52) 67.2 66.9 68.7 65.2 0.380 0.279 0.373

• Weakly-labelled training set outperforms manually-labelled one.– Suggests training sets can be generated

from regions for which maps exist and then used to train classifiers for mapping unmapped regions.

Results—Photographer Intent (1)

• Compare Flickr vs. Geograph results.

Ground Truth Maps

Maps GeneratedUsing

Flickr Images

Maps GeneratedUsing

Geograph Images

Binary Maps

Training Set

Target Set

Training Set Size

FixedThresh.

AdaptiveThresh.

FixedThresh.

AdaptiveThresh.

% MAD RMSD

Flickr Flickr 5872 (0.52) 67.2 66.9 68.7 65.2 0.380 0.279 0.373

Geograph Geograph 10576 (0.26) 68.2 74.0 60.8 72.6 0.520 0.271 0.358

• Photographer intent is a significant factor.

Results—Importance of Training vs. Target Set (1)

• Geograph training+target set outperforms Flickr training+target set.

• Investigate whether improvement is due to training or target set.

• Training and target sets from different collections.

Binary Maps

Training Set

Target Set

Training Set Size

FixedThresh.

AdaptiveThresh.

FixedThresh.

AdaptiveThresh.

% MAD RMSD

Flickrgood Flickr 5070 (0.49) 67.0 68.1 67.4 66.6 0.329 0.285 0.374

Geographgood Flickr 5603 (0.47) 60.7 68.3 53.8 66.6 0.330 0.294 0.381

Geographgood Geograph 5603 (0.47) 74.2 74.6 71.5 73.1 0.551 0.231 0.308

Flickrgood Geograph 5070 (0.49) 69.9 73.1 71.5 71.7 0.496 0.254 0.331

• Photographer intent is more important for target than training set.

Results—Importance of Training vs. Target Set (2)

Results—Filtering Out Non-informative Images (1)

• Investigate whether removing images with faces improves results.

• Motivation: photographs of people are less likely to be geographically informative, especially close-in portraits.

Results—Filtering Out Non-informative Images (2)

Binary Maps

Training Set

Target Set

Training Set Size

FixedThresh.

AdaptiveThresh.

FixedThresh.

AdaptiveThresh.

% MAD RMSD

Flickr Flickr 5872 (0.52) 67.2 66.9 68.7 65.2 0.380 0.279 0.373

FlickrFlickr

no faces 5872 (0.52) 66.8 66.7 66.8 64.2 0.367 0.301 0.414

Geograph Flickr 5603 (0.47) 60.7 68.3 53.8 66.6 0.330 0.294 0.381

GeographFlickr

no faces 5603 (0.47) 59.9 68.0 52.0 65.2 0.312 0.321 0.428

• Filtering out images with faces from the target set does not result in improved performance.

• Demonstrated that georeferenced community-contributed photo collections can be considered as a form of VGI.

• Maps of developed/undeveloped regions automatically generated using Flickr and Geograph images shown to be similar to ground truth maps.– Despite simple image features.

Discussion (1)

• Weakly-labelled training set outperforms manually-labelled training set.– Clear benefits for training classifiers.

• Photographer intent is significant, especially for target set.– Restricts what can be used as target sets.– Poses interesting research challenges such as how

to use the Geograph dataset to filter the “noisy” Flickr dataset.

• Initial results on filtering out images with faces inconclusive.

Discussion (2)

• Improved image features.– Gist.

• Integrate textual annotations.– Flickr tags.– Geograph descriptive text.

• Additional land-cover/use classes.• Spatial models:

– Tobler’s first law of geography: all things are related, but nearby things are more related than distant things.

Extensions

Come to our poster this afternoon

Thank you! and questions?

Acknowledgements:• This work was funded in part by the following

grants:– DOE Early Career Scientist and Engineer

Award/PECASE– NSF 0917069: IIS Core

• Thanks to Nathan Graves for implementing the edge histogram descriptors.

1 Proximate Sensing: Inferring What-Is-Where From Georeferenced Photo Collections Daniel Leung and...

Documents