University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013
An In-Depth Evaluation of MultimodalVideo Genre Categorization
Ionuț MIRONICĂ1
Bogdan IONESCU1,2
Peter KNEES3
Patrick LAMBERT2
patrick.lambert
@univ-savoie.fr
11th International Workshop on Content-Based Multimedia Indexing, CBMI 2013, Veszprém, Hungary, June 17-19, 2013.
University POLITEHNICA of Bucharest
12 3
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013 2
Presentation outline
• Introduction
• Video Content Description
• Fusion Techniques
• Experimental results
• Conclusions
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013 3
Problem StatementConcepts• Content Based Video Retrieval• Genre Retrieval
genrequery
Query DatabaseQuery Results
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013 4
video database
> challenge: find a way to assign (genre) tags to unknown videos;
> approach: machine learning paradigm;
train
classifier
unlabeled data
web food autos
…label data
labeled data
tagged video database
Global Approach
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013
• the entire proces relies on the concept of “similarity” computed between content annotations (numeric features),
objective 1: go multimodal (truly)
visual audio Text & metadata
objective 2: test a broad range of classifiers
• We focus on:
Global Approach
objective 3: test a broad range of fusion techniques
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013
[B. Mathieu et al., Yaafe toolbox, ISMIR’10, Netherlands]
• Linear Predictive Coefficients,
• Line Spectral Pairs,
• Mel-Frequency Cepstral Coefficients,
• Zero-Crossing Rate,
+ variance of each feature over a certain window.
• spectral centroid, flux, rolloff, and kurtosis,
Standard audio features (audio frame-based)
f1 fn…f2
globalfeature = mean & variance
time
+ var{f2} var{fn}
Video Content Description - audio
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013
[OpenCV toolbox, http://opencv.willowgarage.com]
MPEG-7 & color/texture descriptors(visual frame-based)
• Local Binary Pattern,
• Autocorrelogram,
• Color Coherence Vector,
• Color Layout Pattern,
• Edge Histogram,
• Structure Color Descriptor,
• Classic color histogram,
• Color moments.
timef1 fn
…
globalfeature =mean & dispersion & skewness & kurtosis & median & root mean square
f2
Video Content Description - visual
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013
Feature descriptorsBag of Words•we train the model with 4,096 words•rgbSIFT and spatial pyramids (2x2)
Video Content Description - visual
[CIVR 2009, J. Uijlings et all]
Detection on interest points Codewords Dictionary
Bag-of-Visual-WordsBag-of-Visual-Words Framework Framework
Generate BoW histograms
Train classifier
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013
Histogram of oriented Gradients (HoG)•divides the image into 3x3 cells and for each of them builds a pixel-wise histogram of edge orientations.
Video Content Description - visual
[CITS 2009, O. Ludwig,et all]
Feature descriptors
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013
Video Content Description - visual
10
[IJCV, C. Rasche’10]
+ Appearance parameters:
: mean, std.dev. of intensity along the contour;sm cc ,
: fuzziness, obtained from a blob (DOG) filter: I * DOGsm ff ,
Contour properties:
: degree of curvature (proportional to the maximum amplitude of the bowness space); – straight vs. bow
b
: degree of circularity; – ½ circle vs. full circle: edginess parameter – zig-zag vs. sinusoid;e
edginess
: symmetry parameter – irregular vs. “even”y
symmetry
Objective: describe structural information in terms of contours and their relations;
Structural descriptors
University Politehnica of Bucharest
TF-IDF descriptors(Term Frequency-Inverse Document Frequency)
Text sources: ASR and metadata
1. remove XML markups,
2. remove terms <5%-percentile of the frequency distribution,
3. select term corpus: retaining for each genre class m terms (e.g. m = 150 for ASR and 20 for metadata) with the highest χ2 values that occur more frequently than in complement classes,
4. for each document we represent the TF-IDF values.
Video Content Description - text
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013 12
ClassifiersWe test a broad range of classifiers:
• SVM with linear, RBF and Chi kernels
• 5-NN
• Random Trees and Extremely Random Trees
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013 13
Fusion Techniques
Global DescriptorGlobal Descriptor
Feature concatenation
DecisionDecision
Global Confidence
score
Global Confidence
score
Obtain the Global Confidence Score
Descriptor 1
Descriptor 2 Descriptor 2
Descriptor n Descriptor n
Feature extraction
Descriptor 1 normalized
Descriptor 1 normalized
Descriptor n normalized
Descriptor n normalized
Descriptor 2 normalized
Descriptor 2 normalized
Feature Normalization
ClassifierClassifier
Classification
Step
Early Fusion
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013 14
Fusion Techniques
Late Fusion
Confidence value 1(normalized)
Confidence value 1(normalized)
Confidence value 2(normalized)
Confidence value 2(normalized)
Confidence value n(normalized)
Confidence value n(normalized)
Confidence Scores Normalization
Descriptor 1
Descriptor 2 Descriptor 2
Descriptor n Descriptor n
Feature extraction
Classifier 1Classifier 1
Classifier 2Classifier 2
Classifier nClassifier n
Classification Step
DecisionDecisionGlobal Confidencescore
Global Confidencescore
Global Confidence Score
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013 15
Fusion Techniques
Late Fusion
where - cvi is the confidence value of classifier i for class q , d is the current video, i are some weights and N is the number of classifiers to be aggregated.- rank() represents the rank of classifier i.
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013 16
Experimental Setup MediaEval 2012 Dataset - Tagging Task• 14,838 episodes from 2,249 shows ~ 3,260 hours of data• splited into Development and Test sets 5,288 for development / 9,550 for test• focuses on semi-professional video on the Internet
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013 17
Experimental Setup MediaEval 2012 Dataset
• 26 Genre labels26 Genre labels1000 art 1001 autos_and_vehicles 1002 business 1003 citizen_journalism 1004 comedy 1005 conferences_and_other_events 1006 default_category 1007 documentary 1008 educational 1009 food_and_drink 1010 gaming 1011 health 1012 literature 1013 movies_and_television 1014 music_and_entertainment 1015 personal_or_auto-biographical 1016 politics 1017 religion 1018 school_and_education 1019 sports 1020 technology 1021 the_environment 1022 the_mainstream_media 1023 travel 1024 videoblogging 1025 web_development_and_sites
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013
18
Experimental Setup
• Mean Average Precision summarizes rankings from multiple queries by averaging average precision
• Classifier’s parameters and late fusion weights were optimized on training dataset
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013 19
Evaluation(1) Classification performance on individual modalitiesFeature SVM
LinearSVM RBF SVM - Chi 5-NN Random
ForestExt. Random
Forests
Hog 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44%
Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32%
MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17%
Structural Descriptors
7.55 % 17.17% 22.76% 8.65% 13.85% 14.85%
Audio descriptors 20.68% 24.52% 35.56% 18.31% 34.41% 42.33%
TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93%
TD-IDF on Metadata 56.33 % 58.14% 47.95% 57.19% 58.66% 57.52%
(MAP values)
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013 20
Evaluation(1) Classification performance on individual modalities (visual)
Feature SVMLinear
SVM RBF SVM - Chi 5-NN Random Forest
Ext. Random Forests
HOG 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44%
Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32%
MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17%
Structural Descriptors
7.55 % 17.17% 22.76% 8.65% 13.85% 14.85%
Audio descriptors 20.68% 24.52% 35.56% 18.31% 34.41% 42.33%
TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93%
TD-IDF on Metadata 56.33 % 58.14% 47.95% 57.19% 58.66% 57.52%
(MAP values)Visual Performance - Best performance with MPEG-7 (ERF) and HOG (SVM-RBF) - Bag-of-Visual-Words is not performing very well
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013 21
Evaluation(1) Classification performance on individual modalities (audio)
Feature SVMLinear
SVM RBF SVM - Chi 5-NN Random Forest
Ext. Random Forests
HOG 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44%
Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32%
MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17%
Structural Descriptors
7.55 % 17.17% 22.76% 8.65% 13.85% 14.85%
Audio descriptors 20.68% 24.52% 35.56% 18.31% 34.41% 42.33%
TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93%
TD-IDF on Metadata 56.33 % 58.14% 47.95% 57.19% 58.66% 57.52%
(MAP values)Audio Performance - Best performance with Extremely Random Forests (42.33%) - Provide higher discriminative power than visual features
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013 22
Evaluation(1) Classification performance on individual modalities (text)
Feature SVMLinear
SVM RBF SVM - Chi 5-NN Random Forest
Ext. Random Forests
HOG 9.08 % 25.63% 22.44% 17.92% 16.62% 23.44%
Bag of Words 14.63 % 17.61% 19.96% 8.55% 14.89% 16.32%
MPEG 7 6.12 % 4.26% 17.49% 9.61% 20.90% 26.17%
Structural Descriptors
7.55 % 17.17% 22.76% 8.65% 13.85% 14.85%
Audio descriptors 20.68% 24.52% 35.56% 18.31% 34.41% 42.33%
TD-IDF on ASR 32.96 % 35.05% 28.85% 12.96% 30.56% 27.93%
TD-IDF on Metadata 56.33 % 58.14% 47.95% 57.19% 58.66% 57.52%
(MAP values)Text Performance - Best performance with Metadata and Random Forests (58.66%) - ASR provide lower performance than audio - Metadata features outperformes all the features
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013 23
Evaluation
CombSUM
CombMean
CombMNZ
CombRank
EarlyFusion
All Visual 35.82% 36.76% 38.21% 30.90% 30.11%
All Audio 43.86% 44.19% 44.50% 41.81% 42.33%
All Text 62.62% 62.81% 62.69% 50.60% 55.68%
All 64.24% 65.61% 65.82% 53.84% 60.12%
(2) Performance on Multimodal Integration
(MAP values)
Fusion Techniques Performance - late fusion provide higher performance than early fusion - CombMNZ tends to provide the best accurate results
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013 24
Evaluation
Team Modality Method MAP
proposed all Late Fusion CombMNZ with all descriptors 65.82%
proposed text Late Fusion CombMean with TF-IDF of ASR and metadata 62.81%
TUB text Naive Bayes with Bag of Words on text (metadata) 52.25%
proposed all Late Fusion CombMNZ with all descriptors except for metadata 51.9%
proposed audio Late Fusion CombMean with standard audio descriptors 44.50%
proposed visual Late Fusion CombMean with MPEG-7 related, structural, HoG and B-o-VW with rgbSIFT
38.21%
ARF text SVM linear on early fusion of TF-IDF of ASR and metadata 37.93%
TUD visual & text Late Fusion of SVM with B-o-W (visual word, ASR & metadata) 35.81%
KIT visual SVM with Visual descriptors (color, texture, B-o-VW with rgbSIFT) 35.81%
TUD-MM text Dynamic Bayesian networks on text (ASR & metadata) 25.00%
UNICAMP - UFMG
visual Late fusion (KNN, Naive Bayes, SVM, Random Forests) with BOW (text ASR)
21.12%
ARF audio SVM linear with block-based audio features 18.92%
(3) Comparison to MediaEval 2012 Tagging task results
(MAP values)
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013 25
Conclusions> we provided an in-depth evaluation of truly multimodal video description in the context of a real-world genre-categorization scenario;
> we proved that late fusion can boost performance of automated content descriptors to achieve close performance;
> we demonstrated the potential of appropriate late fusion to genre categorizationand achieve very high categorization performance;
> we setup a new baseline for the Genre Tagging Task by outperforming the performance of the other participants;
Acknowledgement: we would like to thank Prof. Nicu Sebe and Dr. J. Uijlings from University of Trento for their support.
We also acknowledge the 2012 Genre Tagging Task of the MediaEval Multimedia Benchmark for the dataset (http://www.multimediaeval.org/).
University Politehnica of Bucharest
CBMI 2013Monday, April 29, 2013 26
Thank you!
Questions?