The CERTH-UNITN Participation @ Verifying Multimedia Use 2015 Christina Boididou1, Symeon Papadopoulos1, Duc-Tien Dang-Nguyen2, Giulia Boato2, and Yiannis Kompatsiaris1
MediaEval 2015 Workshop, Sept 14-15, 2015, Wurzen, Germany
This task is supported by the REVEAL EC FP7 Project.
1Information Technologies Institute (ITI), CERTH, Greece 2University of Trento, Italy
Overview
2
Approach Use of tweet-, user-based and forensics features
Supervised learning (SL) scheme
Semi-Supervised learning scheme called Agreement-based retraining technique (SSL-AR)
Aim Predict if a tweet that shares multimedia content is fake or real
Features Features used in the experiments
3
Feature Set Description
TB–base Baseline tweet-based
TB–ext Extended tweet-based
UB–base Baseline user-based
UB–ext Extended user-based
FOR Forensics
Types • Tweet-based: information coming from the tweet and its metadata • User-based: information and metadata about the user posting (or retweeting) the tweet • Multimedia forensics: based on the image that accompanies the tweet.
Sets • Baseline (base) set: Features shared by the task • Extended (ext) set: New features extracted • Forensics (FOR) set: Both distributed by the task and some additional ones
Additional Features
4
Tweet-based User-based Forensics
Contains word please Account age AJPG-BAG combined
Has external link Number of media content NAJPG-BAG combined
Number of slang words Shares location
Number of nouns Shares location that exists1
Readability2
Web Of Trust (WOT) score
In-degree centrality3
Harmonic centrality3
Alexa rankings
For the links
1Geonames dataset (http://download.geonames.org/export/) 2Flesch Reading Ease method, which computes the complexity of a piece of text as a score in the interval [0; 100] 3Common Crawl WWW Ranking (http://wwwranking.webdatacommons.org/more.html)
Additional Forensics Features
5
AJPG map Binary map
‘Object’
Mask BAG
AJPG-BAG
combined
‘Object’
features
‘Background’
features
thresholding
• NAJPG-BAG was combined in the same way from NAJPG and BAG features.
Agreement-based retraining method
6
• Make the initial model adaptable • Predict more accurately the values of the disagreed samples
Bagging
7
Training set
• N=9 • Equal number of samples from each class • Average result of numerous predictors
Submitted Runs
Run Learning Features
RUN-1 SL TB-base
RUN-2 SL TB-base + FOR
RUN-3 SSL-AR (TB-base + FOR) + UB-base
RUN-4 SL TB-ext + UB-ext + FOR
RUN-5 SSL-AR (TB-ext + FOR) + UB-ext
8
• RUN1, RUN2 & RUN4 plain classification model • RUN3 & RUN5 agreement-based retraining technique
• Random Forest classifier used for all models
CL1 CL2
SL: Supervised Learning SSL-AR: Semi-supervised-Learning – Agreement Retraining
Results
Runs Recall Precision F-score
RUN-1 0.794 0.733 0.762
RUN-2 0.749 0.994 0.854
RUN-3 0.922 0.736 0.819
RUN-4 0.798 0.860 0.828
RUN-5 0.969 0.861 0.911
9
A. RUN5 achieved the best score B. Use of SSL-AR technique improves the performance a lot C. RUN2 better than RUN1 -> FOR features contribution D. RUN3 & RUN5 comparison -> ext features’ contribution
A B
C
D
Features
TB-base
TB-base + FOR
(TB-base + FOR) + UB-base
TB-ext + UB-ext + FOR
(TB-ext + FOR) + UB-ext
Examples
Fake example classified as real
10
Fake example classified as fake
Conclusions / Future Work
Features
• ext features perform better than base ones
• FOR features improve performance
Agreement-based retraining technique
• improves accuracy
• adapts to the new data
• requires a number of test samples to be applied
Future Ideas
• Experiment with other set of features
• Perform feature selection
• Adapt the method to be applied with fewer samples
11
Questions
12
Thank you for your attention!