Hands On: Multimedia Methods for Large Scale Video Analysis
Dr. Gerald Friedland, [email protected]
1
Sunday, August 26, 12
Multimedia in the Internet is Growing
Raphaël Troncy: Linked Media: Weaving non-textual content into the Semantic Web, MozCamp, 03/2009.
2
Sunday, August 26, 12
Multimedia in the Internet is Growing
• YouTube claims 65k 100k video uploads per day or 48 72 hours every minute
• Flickr claims 1M images uploads per day
• Twitter: up to 120M messages per day 3
Sunday, August 26, 12
Why do we care?
• Consumer-Produced Multimedia allows empirical studies at never-before seen scale in various research disciplines such as sociology, medicine, economics, environmental sciences, computer science...
• Recent buzzword: BIGDATA4
Sunday, August 26, 12
Problem
How can YOU effectively work on large scale multimedia data (without working at Google)?
5
Sunday, August 26, 12
What is this class about?
• Introduction to large-scale video processing from a CS point of view (application driven)
• Covers different modalities: Visual, Audio, Tags, Sensor Data
• Covers different side topics, such as Mechanical Turk for annotation
Sunday, August 26, 12
Content of this Class
• Visual methods for video analysis• Acoustic methods for video analysis• Meta-data and tag-based methods for video
analysis• Inferring from the social graph and collaborative
filtering• Information fusion and multimodal integration• Coping with memory and computational issues• Crowd sourcing for ground truth annotation• Privacy issues and societal impact of video
retrieval
•
Sunday, August 26, 12
Why you should you be interested in the Topic
• Processing of large scale, consumer produced videos is a brand-new emerging field driven by:– Massive government funding (e.g. IARPA
Aladdin, IARPA Finder, NSF BIGDATA)– Industry’s needs to make consumer videos
retrievable and searchable– Interesting academic questions
Sunday, August 26, 12
Some Examples of Research at ICSI
•Video Concept Detection•Forensic User Matching•Location Estimation•Fast Approaches in Parlab
Sunday, August 26, 12
Video2GPS: Multimodal Location Estimation of Flickr Videos
Gerald FriedlandInternational Computer Science InstituteBerkeley, [email protected]
Sunday, August 26, 12
Definition
11
• Location Estimation = estimating the geo-coordinates of the origin of the content recorded in digital media.
• Here: Regression task– Where was the media recorded in lattitude,
longitude, and altitude?
G. Friedland, O. Vinyals, and T. Darrell: "Multimodal Location Estimation", pp. 1245-1251, ACM Multimedia, Florence, Italy, October 2010.
Sunday, August 26, 12
Motivation 1
12
Training data comes for free on a never-before seen scale!
Portal % TotalYouTube (estimate) 3.0 3MFlickr 4.5 180M
Allows tackling of tasks of never-before seens difficulty?
Sunday, August 26, 12
Motivation 2
13
Location-based services are awesome!
Geocoordinates+Time = Unique ID Sunday, August 26, 12
De-Motivation
14
Geo-Location enables Cybercasing!
G. Friedland and R. Sommer: "Cybercasing the Joint: On the Privacy Implications of Geotagging", Proceedings of the Fifth USENIX Workshop on Hot Topics in Security (HotSec 10), Washington, D.C, August 2010.
Sunday, August 26, 12
Placing TaskAutomatically guess the location of a Flickr video:• i.e., assign geo-coordinates (latitude and
longitude)• Using one or more of:
– Visual/Audio content– Metadata (title, tags, description, etc)– Social information 15
Sunday, August 26, 12
Consumer-Produced, Unfiltered Videos...
Sunday, August 26, 12
Data Description (2011)• Training Data
– 10k flickr videos/metadata/visual keyframes (+features)/geo-tags
– 6M flickr photos/metadata/visual features
• Test Data– 5k flickr video/metadata/visual
keyframes (+features)– no geotags
• Test/Training split: by UserID 17
Sunday, August 26, 12
Your Best Guess???
Sunday, August 26, 12
Our Approach
19
Tag-based Approach+ Visual Approach+ Acoustic Approach-------------------Multimodal Approach
Sunday, August 26, 12
Intuition for the Approach
Investigate `Spatial Variance’ of a feature: – Spatial variance is small: Features is
likely location-indicative– Spatial variance is high: Feature is
likely not indicative.
20
Sunday, August 26, 12
Example Revisited
Tag: ['pavement', 'ucberkeley', 'berkeley', 'greek', 'greektheatre', 'spitonastranger', 'live', 'video‘]
Sunday, August 26, 12
Tag-based Approach
• ‘greektheatre’ would be the correct answer: however, it was not seen in the development dataset
• ‘ucberkeley’ is the 2nd-best answer: estimation error : 0.332 km
Tag Matches in Training set
Spatial Variance
pavement 2 5.739
ucberkeley 4 0.132
berkeley 14 68.138
greek 0 N/A
greektheatre 0 N/A
spitonastranger 0 N/A
live 91 6453.109
video 2967 6735.844
Sunday, August 26, 12
Visual Approach
• Assumption: similar images similar locations• Based off related work: Reduce location estimation to an image retrieval problem•Use median frame of video as keyframe
Sunday, August 26, 12
Visual Features
• Color Histograms – L*a*b* (4, 14, 14 bins, respectively)– Chi-square distance
• GIST – 5x5 spatial resolution, 6 orientations and 4 scales– Euclidean distance
See: A. Oliva, and A. Torralba: Building the Gist of a Scene: The Role of Global Image Features in Recognition, Progress in Brain Research Volume: 155, Issue: 1, Publisher: Elsevier, Pages: 23-36, 2006
Sunday, August 26, 12
25
Sunday, August 26, 12
26
Sunday, August 26, 12
Audio Approach
• Idea: Different places have different acoustic “signatures”
• Due to data sparsity: Initially only used for major cities
• Classification system similar to GMM/UBM Speaker ID system using MFCC19 features.
Sunday, August 26, 12
• Which city was this recorded in? Pick one of: Amsterdam, Bangkok, Barcelona, Beijing, Berlin, Cairo, CapeTown, Chicago, Dallas, Denver, Duesseldorf, Fukuoka, Houston, London, Los Angeles, Lower Hutt, Melbourne, Moscow, New Delhi, New York, Orlando, Paris, Phoenix, Prague, Puerto Rico, Rio de Janeiro, Rome, San Francisco, Seattle, Seoul, Siem Reap, Sydney, Taipei, Tel Aviv, Tokyo, Washington DC, Zuerich
• Solution: Tokyo, highest confidence score!•
An ExperimentListen!
Sunday, August 26, 12
Audio-Only Results
1 2 5 10 20 40 60 80 90 95 98 99 1 2
5
10
20
40
60
80
90
95
98 99
DET curve for common−user city identification with 540 videos and 9,700 trials
False Alarm probability (in %)
Mis
s pr
obab
ility
(in %
)
EER = 22%
Sunday, August 26, 12
Problem: Bias
Sunday, August 26, 12
Geo-‐tagging: an es-ma-on-‐theore-c
{berkeley, sathergate, campanile}
{berkeley, haas} {campanile} {campanile, haas}
Observa(ons:
Images:
Tags: , , ,
{tk1} {tk2} {tk3} {tk4}, , ,Es(mate:
Geoloca-ons:
x1 x2 x3 x4, , ,Sunday, August 26, 12
Interpre-ng tradi-onal approaches
Loca-ons are random variables: {x1, x2, ....., xN}
Tradi-onal approaches es-mate:p(xi|{tki }) �
Y
k
p(xi|tki )
wherep(xi|tki ) is obtained from the training set
Example: the distribu-on for the tag “washington” is depicted here
Loca-on es-mate:Z
xi p(xi|{tki })dxi
Probability of loca-on given tags
Sunday, August 26, 12
DrawbacksData sparsity: Not all tags in test set are available in training set. Hence es-mate of can be bad
p(xi|tki )Sub-‐op(mality: The approaches are subop-mal given the data.
What we ideally want:p(x1, x2, ....., xN |{tk1}, {tk2}, ..., {tkN})
Mean of the above distribu-on gives the best es-mate of the loca-onsi.e. for each image we want
p(xi|{tk1}, {tk2}, ...., {tkN})
Tradi-onal algorithms only give:p(xi|{tki })
Sunday, August 26, 12
Coopera-ve geo-‐taggingIntui-on: Images in the training set having common tags have correlated geo-‐loca-ons captured by the joint distribu-onJoint probability modeling:
p(x1, x2, ....., xN |{tk1}, {tk2}, ..., {tkN}) �Y
i
p(xi|{tki })Y
(i,j)
p(xi, xj |{tki } ⇥ {tkj })
Pairwise distribu-on given at least one common tag
is obtained from the training set as before
p(xi, xj |{tki } � {tkj }) Modeled as an indicator func-on I(xi = xj)If the common tag has low spa-al variance or occurs infrequently, e.g. if the common tag is “haas”, its very likely the loca-ons are the same
Ques-on: How to es-mate to op-mal marginal distribu-on ?p(xi|{tk1}, {tk2}, ...., {tkN})
p(xi|{tki })
Sunday, August 26, 12
Bayesian graphical framework{berkeley, sathergate, campanile}
{berkeley, haas}
{campanile} {campanile, haas}
Node: Geoloca-on of the image
Edge: Correlated loca-ons (e.g. common tag)
Edge Poten(al: Strength of an edge, (e.g. posterior distribu-on of loca-ons given common tags)
p(xi, xj |{tki } � {tkj })
p(xj |{tkj })p(xi|{tki })
Sunday, August 26, 12
Belief propaga-on updatesp(xi|{tk1}, {tk2}, ...., {tkN})Itera-ve algorithm to approximate
the posterior distribu-on
Gaussian modeling p(xi|{tki }) � N (µi,�2i )
At itera-on 0 each node calculates (µi,�2i )
At itera-on t each node updates its loca-on as a weighted mean of its previous loca-on and that of its neighbors
µ(t)i =
1
(�(t)i )2
µ(t�1)i +
Pk⇥N (i)
1
(�(t)k )2
µ(t)k
(�(t)i )2
1
(�(t)i )2
=1
(�(t�1)i )2
+X
k2i
1
(�(t�1)k )2
The weights reflect the confidence in that measurements, i.e. higher the spa-al variance lower is the weight
Sunday, August 26, 12
Belief propaga-on
(µ1,�21)
(µ2,�22)
(µ3,�23)
Posterior mean and variance assuming Gaussian beliefs
Audio visual features are incorporated in modeling the edge and node poten-als
Sunday, August 26, 12
Incorpora-ng Audio-‐Visual features• GIST features are extracted for the images.• MFCC features are extracted for the audio.• These are now incorporated into the node and edge poten-als as
exponen-al distribu-ons.
p(xi, xj |ai, aj) ⇥ exp(� ||xi � xj ||�||ai � aj ||
)
ai are the audio features associated with image i
The intui-on is that closer the audio features are, higher the probability that the geo-‐loca-ons are closer.Similarly this can be included in the node poten-als as well as for the visual features.
Sunday, August 26, 12
Multimodal Results: MediaEval 2011Dataset
0"
10"
20"
30"
40"
50"
60"
70"
80"
90"
10m" 100"m" 1"km" 5"km" 10"km" 50"km" 100"km"
Coun
t"[%]"
Distance"between"es>ma>on"and"ground"truth"
Figure 3: The resulting accuracy of the algorithm
as described in Section 5.
used as the final distance between frames. The scaling of theweights was learned by using a small sample of the trainingset and normalizing the individual distance distributions sothat each the standard deviation of each of them would besimilar. We use 1-NN matching between the test frame andthe all the images in a 100 km radius around the 1 to 3 co-ordinates from the previous step. We pick the match withthe smallest distance and output its coordinates as a finalresult.
This multimodal algorithm is less complex than previousalgorithms (see Section 3), yet produces more accurate re-sults on less training data. The following section analysesthe accuracy of the algorithm and discusses experiments tosupport individual design decisions.
6. RESULTS AND ANALYSISThe evaluation of our results is performed by applying
the same rules and using the same metric as in the MediaE-val 2010 evaluation. In MediaEval 2010, participants wereto built systems to automatically guess the location of thevideo, i.e., assign geo-coordinates (latitude and longitude) tovideos using one or more of: video metadata (tags, titles),visual content, audio content, and social information. Eventhough training data was provided (see Section 4), any “useof open resources, such as gazetteers, or geo-tagged articlesin Wikipedia was encouraged” [16]. The goal of the taskwas to come as close as possible to the geo-coordinates ofthe videos as provided by users or their GPS devices. Thesystems were evaluated by calculating the geographical dis-tance from the actual geo-location of the video (assignedby a Flickr user, creator of the video) to the predicted geo-location (assigned by the system). While it was importantto minimize the distances over all test videos, runs werecompared by finding how many videos were placed within athreshold distance of 1 km, 5 km, 10 km, 50 km and 100 km.For analyzing the algorithm in greater detail, here we alsoshow distances of below 100m and below 10m. The lowestdistance category is about the accuracy of a typical GPSlocalization system in a camera or smartphone.
First we discuss the results as generated by the algorithmdescribed in Section 5. The results are visualized in Figure 3.The results shown are superior in accuracy than any system
0"10"20"30"40"50"60"70"80"90"
10m" 100"m" 1"km" 5"km" 10"km" 50"km" 100"km"
Coun
t"[%]"
Distance"between"es>ma>on"and"ground"truth"
Visual"Only" Tags"Only" Visual+Tags"
Figure 4: The resulting accuracy when comparing
tags-only, visual-only, and multimodal location esti-
mation as discussed in Section 6.1.
presented in MedieEval 2010. Also, although we added ad-ditional data to the MediaEval training set, which was legalas of the rules explained above, we added less data thanother systems in the evaluation, e.g. [21]. Compared to anyother system, including our own, the system presented hereis the least complex.
6.1 About the Visual ModalityProbably one of the most obvious questions is the impact
of the visual modality. As a comparison, the image-matchingbased location estimation algorithm in [8] started reportingaccuracy at the granularity of 200 km. As can be seen inFigure 4, this is consistent with our results: Using the lo-cation of the 1-best nearest neighbor in the entire databasecompared to the test frame results in a minimum accuracyof 10 km. In contrast to that, tag-based localization reachesaccuracies of below 10m. For the tags-only localization wemodified the algorithm from Section 5 to output only the 1-best geo-coordinates centroid of the matching tag with low-est spatial variance and skip the visual matching step. Whilethe tags-only variant of the algorithm performs already well,using visual matching on top of the algorithm decreases theaccuracy in the finer-granularity ranges but increases over-all accuracy, as in total more videos can be classified below100 km. Out of the 5091 test videos, using only tags 3774videos can be estimated correctly with an accuracy better100 km. The multimodal approach estimates 3989 correctlyin the range below 100 km.
6.2 The Influence of Non-Disjoint User SetsEach individual person has his own idiosyncratic method
of choosing a keyword for certain events and locations whenthey upload videos to Flickr. Furthermore, the spatial vari-ance of the videos uploaded by one user is low on average.At the same time, a certain amount of users uploads manyvideos. Therefore taking into account to which user a videobelongs seems to have a higher chance of finding geographi-cally related videos. For this reason, videos in the MediaEval2010 test dataset were chosen to have a disjoint set of usersfrom the training dataset. However, the additional trainingimages provided for MediaEval 2010 are not user disjoint
Sunday, August 26, 12
How is the Class Organized?
• Once a week: lectures fundamental introduction (from me)
• Once a week: Project meetings• Guest lectures by YouTube, Yahoo!,
Intel, TU Delft.
•
Sunday, August 26, 12
Hands-On Part
• Do a project on a large scale data set• Data accessible through ICSI accounts • Additional computation: $100 in
Amazon EC 2 money•
Sunday, August 26, 12
Lecture Material
• Background material for lectures:G. Friedland, R. Jain: Introduction to Multimedia Computing, to appear at Cambridge University Press, 2012.
• (Constantly changing) draft availablehttp://www.mm-creole.org
Sunday, August 26, 12
How do you receive Credit?
• 3 Credits (graded or ungraded) based on:– 2 quizzes– a project
•
Sunday, August 26, 12