Post on 06-Jan-2018
description
transcript
Improved Video Categorization from Text Metadata and User Comments
ACM SIGIR 2011:Research and development in Information Retrieval
- Katja Filippova - Keith B. Hall
PresenterViraja Sameera Bandhakavi
1
Contributions• Analyze sources of text information like title,
description, comments, etc and show that they provide valuable indications to the topic
• Show that a text based classifier trained on imperfect predictions of weakly supervised video content-based classifier is not redundant
• Demonstrate that a simple model combining the predictions of the two classifiers outperforms each of them taken independently
2
3
Research question not answered by related work
• Can a classifier learn from imperfect predictions of a weakly supervised classifier? Is the accuracy comparable to the original one? Can a combination of two classifiers outperform either one?
• Do the video and text based classifiers capture different semantics?
• How useful is user provided text metadata? Which source is the most helpful?
• Can reliable predictions be made from user comments? Can it improve the performance of the classifier?
3
4
Methodology
• Builds on top of the predictions of Video2Text• Uses Video2Text:– Requires no labeled data other than video metadata– Clusters similar videos and generates a text label for each
cluster– The resulting label set is larger and better suited for
categorization of video content on YouTube
4
5
Video2Text
• Starts from a set of weak labels based on the video metadata• Creates a vocabulary of concepts (unigrams or bigrams from
the video metadata)• Every concept is associated with a binary classifier trained
from a large set of audio and video signals• Positive instances- videos that mention the concept in the
metadata• Negative instances-videos which don’t mention the concept in
the metadata
5
6
Procedure• Binary classifier is trained for every concept in the vocabulary
– Accuracy is assessed on a portion of a validation dataset– Each iteration uses a subset of unseen videos from the validation set– The classifier and concept are retained if precision and recall are
above a threshold (0.7 in this paper)• The remaining classifiers are used to update the feature vectors of
all videos• Repeated until the vocabulary size doesn’t change much or the
maximum number of iterations is reached• Finer grained concepts are learned from concepts added in the
previous iteration• Group together labels related to news, sports, film, etc resulting in
the final set of 75 two level categories
6
7
Categorization with Video2Text
• Use Video2Text to assign two-level categories to videos
• Total number of binary classifiers (hence labels) limited to 75
• Output of Video2Text represented as a list of strings: (vi , cj, sij, )
7
8
Distributed MaxEnt
• Approach automatically generates training examples for the category classifier
• Uses conditional maximum entropy optimization criteria to train the classifiers
• Results in a conditional probability model over the classes given the YouTube videos.
8
9
Data and Models• Text models differ regarding the text
sources from which the features are extracted: title, description, comments, etc
• Features used are all token based• Infrequent tokens are filtered out to
reduce feature space• Token frequencies are calculated over 150K
videos• Every unique token is counted once per
video• Threshold token frequency of 10 is used • Tokens are prefixed with the first letter of
where it was foundeg: T:xbox, D:xbox, U:xbox, C:xbox, etc
9
10
Combined Classifier• Used to see if the combination of the two views –
video and text based, is beneficial• A simple meta classifier is used, which ranks the
video categories based on predictions of the two classifiers
• Video based predictions are converted to a probability distribution
• The distribution from the video based prediction and from MaxEnt(Maximum Entropy classifier) are multiplied
• This approach proved to be effective • Idea: Each classifier has a veto power• The final prediction for each video is the one with
the highest product score
10
11
Experiments- Evaluation of Text Models
• Training data set containing 100K videos which get high scoring prediction
• Correct prediction – score of at least 0.85 from Video2Text• Text based prediction must be in the set of video-assigned
categories• Evaluation was done on two sets of videos:– Videos with at least one comment– Videos with at least 10 comments
11
12
Experiments- Evaluation of Text Models Contd…
• The best model is TDU+YT+C for both sets• This model is used for comparison against Video2Text model
with human raters• This model is also used in the Combination model
12
13
Experiments with Human Raters• Total of 750 videos are extracted equally from the 15 YouTube
categories• Human rater rates (video, category) as -fully correct (3),
partially correct(2), somewhat related(1) or off topic (0) • Every pair received from 3 human raters• The three ratings are summed and normalized (by dividing by
9) and rounded off to get the resultant score
13
14
Experiments with Human Raters Contd…
• Score of at least 0.5 – correct category
• Text based model performs significantly better than video model
• Combination model improved accuracy• Accuracy of all models increases with number of comments
14
15
Conclusion• Text based approach for assigning categories to videos• Competitive classifier trained on high-scoring predictions made by a
weakly supervised classifier (video features)• Text and video models provide complementary views on the data• Simple combination model outperforms each model on its own• Accurate predictions from user comments • Reasons for impact of comments:
– Substitute for a proper title– Disambiguate the category– Help correct wrong predictions
• Future work: Investigate usefulness of user comments for other tasks
15