Date post: | 18-Dec-2014 |
Category: |
Documents |
Upload: | mediaeval2012 |
View: | 585 times |
Download: | 2 times |
A Multimodal Approach for Video Geocoding
(UNICAMP at Placing Task MediaEval 2012)
Lin Tzy Li. Jurandy Almeida. Daniel Carlos Guimaraes Pedronette. Otavio A. B. Penatti. and Ricardo da S. Torres
Institute of Computing - University of Campinas (UNICAMP) Brazil
Multimodal geocoding proposal
Textual features
• Similarity functions: Okapi & Dice
• Video metadata (run 1)
– Title + description + keywords (Okapi_all)
– Only description: Okapi_desc & Dice_desc
– Combined result in run 1:
• Okapi_all + Okapi_desc + Dice_desc
• Photo tags (run 5)
– Okapi function
keywords (test video) X tags (3,185,258 Flickr photos)
Video similarity
computation
Geocoded Video
Video feature
extraction
Video with lat/long. Tags. title. description. etc… Candidates lat/long
+ match score
The location (lat/long) of the most similar video used as candidate for the test video.
Geocoding Visual Content
Development Set (15,563 videos)
V1
Vk
Test Video
Rankedlist of similar videos
Bag of Scenes (photos)
Histograms of Motion Patterns
Visual Features (HMP): Extracting
• Histograms of Motion Patterns
• Keyframes: Not used
• Applying an algorithm to compare video sequence (1) partial decoding;
(2) feature extraction;
(3) signature generation.
“Comparison of video sequences with histograms of motion patterns”, J. Almeida et al. ICIP, 2011.
Visual Features (HMP): overview
[Almeida et al., Comparison of video sequences with histograms of motion patterns. ICIP 2011]
HMP: Comparing Video
• Comparison of histograms can be performed by any vectorial distance function
– like Manhattan (L1) or Euclidean (L2)
• Video sequences compared with
– Histogram intersection defined as:
i
i
v
i
i
v
i
v
H
HH
vvd
1
1
21
),min(),(
2
ivH Histogram extracted from video Vi Output: range [0,1]
0 = not similar histogram 1= identical histogram
Dictionary of
local descriptions
Feature
vector
...
Dictionary of
scenes
...
[Penatti et al.. A Visual Approach for Video Geocoding using Bag-of-Scenes. ACM ICMR 2012]
Visual Features: Bag-of-Scenes (BoS)
Dictionary of
scenes
Scenes Feature vectors Visual words
selection
…
Feature vectors Frames Video
Assignment Pooling
Video feature vector
(bag-of-scenes)
Cre
ating t
he d
ictionary
U
sin
g t
he d
ictionary
…
…
…
Data Fusion – Rank Aggregation
• In multimedia applications, an approach for information fusion (considering different modalities) is essential for obtaining high effectiveness performance
• Rank Aggregation:
– Unsupervised approach for Data Fusion
– Combination of different features using:
Multiplication approach inspired by the Naive Bayes classifiers (assuming conditional independence among features)
10
Data Fusion – Rank Aggregation
Combined ranked list
Visual Feature
Visual Feature
Textual
Feature
11
Data Fusion – Rank Aggregation
12
: query video
: similarity score between videos
Set of simiarlity functions defined by different
features:
New aggregated score computed by:
: dataset video
Runs Summary
Description Descriptor used
Run 1 Combine 3 Textual Okapi_all + Okapi_desc + Dice_desc
Run 2 Combine 2 textual &
2 visual Okapi_all + Okapi_desc + HMP +
BoS_CEDD5000
Run 3 Single visual: HMP HMP (last year visual)
Run 4 Combine 3 visual HMP + BoS5000 + BoS500
Run 5 Textual: Flickr photos
tags as geo-profile Okapi on keywords
Results for Test Set
Radius Run 1 Run 2 Run 3 Run 4 Run 5
1 21.40% 22.29% 15.81% 15.93% 9.28%
10 30.68% 31.25% 16.07% 16.09% 19.44%
100 35.39% 36.42% 16.62% 17.07% 24.13%
200 37.37% 38.40% 17.58% 17.86% 25.85%
500 41.77% 43.35% 19.68% 19.97% 29.29%
1,000 45.38% 47.68% 24.77% 25.47% 33.91%
2,000 53.32% 56.03% 33.48% 33.31% 46.05%
5,000 62.29% 66.91% 45.34% 45.34% 65.73%
10,000 85.27% 87.95% 81.95% 81.73% 87.69%
15,000 95.89% 96.80% 95.79% 95.70% 96.17%
Result of classical text vector space combined (run 1)
is twice as good as
the use of visual cues alone (run 4).
Results for Test Set
Radius Run 1 Run 2 Run 3 Run 4 Run 5
1 21.40% 22.29% 15.81% 15.93% 9.28%
10 30.68% 31.25% 16.07% 16.09% 19.44%
100 35.39% 36.42% 16.62% 17.07% 24.13%
200 37.37% 38.40% 17.58% 17.86% 25.85%
500 41.77% 43.35% 19.68% 19.97% 29.29%
1,000 45.38% 47.68% 24.77% 25.47% 33.91%
2,000 53.32% 56.03% 33.48% 33.31% 46.05%
5,000 62.29% 66.91% 45.34% 45.34% 65.73%
10,000 85.27% 87.95% 81.95% 81.73% 87.69%
15,000 95.89% 96.80% 95.79% 95.70% 96.17%
Run 5 results (photos metadata functioned as geo-profile)
worse than Run 4 (visual information only) at 1 km precision.
However, for other radii, run 5 is better than runs 3 and 4.
Results for Test Set Radius Run 1 Run 2 Run 3 Run 4 Run 5
1 21.40% 22.29% 15.81% 15.93% 9.28%
10 30.68% 31.25% 16.07% 16.09% 19.44%
100 35.39% 36.42% 16.62% 17.07% 24.13%
200 37.37% 38.40% 17.58% 17.86% 25.85%
500 41.77% 43.35% 19.68% 19.97% 29.29%
1,000 45.38% 47.68% 24.77% 25.47% 33.91%
2,000 53.32% 56.03% 33.48% 33.31% 46.05%
5,000 62.29% 66.91% 45.34% 45.34% 65.73%
10,000 85.27% 87.95% 81.95% 81.73% 87.69%
15,000 95.89% 96.80% 95.79% 95.70% 96.17%
Combination of different textual and visual descriptors (run 2)
leads to statistically significant improvements (confidence >= 0.99)
over results of using only on textual clues (run1)
3500 3600 3700 3800 3900 4000 4100 4200 4300 4400 4500
Dis
tan
ce (
km)
Confidence Interval (99%) run2
run 1
Results for Test Set
Radius Run 1 Run 2 Run 3 Run 4 Run 5
1 21.40% 22.29% 15.81% 15.93% 9.28%
10 30.68% 31.25% 16.07% 16.09% 19.44%
100 35.39% 36.42% 16.62% 17.07% 24.13%
200 37.37% 38.40% 17.58% 17.86% 25.85%
500 41.77% 43.35% 19.68% 19.97% 29.29%
1,000 45.38% 47.68% 24.77% 25.47% 33.91%
2,000 53.32% 56.03% 33.48% 33.31% 46.05%
5,000 62.29% 66.91% 45.34% 45.34% 65.73%
10,000 85.27% 87.95% 81.95% 81.73% 87.69%
2011’s result for HMP
0.21%
1.12%
2.71%
3.33%
6.08%
12.16%
22.11%
37.78%
79.45%
Run 3 using only HMP (our last year approach) -- performs much
better with this year's data set.
Why? Larger development set (+ 5,000 videos in 2012) = richer geo-profile ?
Conclusion
• Textual features: Okapi & Dice • Visual features: HMP & BoS
– HMP results: better in 2012 than 2011 – Is it due to bigger development set?
• Combined textual information (video) and visual features • Ranked lists • Promising results: combining yields better results than
single modality
• Future improvement by – other strategies for combining different modalities – other information sources filter out noisy data from ranked
lists (e.g., Geonames and Wikipedia)
Acknowledgements & contacts
• RECOD Lab @ Institute of Computing, UNICAMP
(University of Campinas)
• VoD Lab @ UFMG
(Universidade Federal de Minas Gerais)
• Organizers of Placing Task and MediaEval 2012
• Brazilian funding agencies
CAPES, FAPESP, CNPq
Contact email: {lintzyli, jurandy.almeida, dcarlos, penatti, rtorres}@ic.unicamp.br