© 2007 IBM Corporation
From Tens to Thousands: Efficient Methods for Learning Large-Scale Video Concepts
Rong Yan
IBM T. J. Watson Research CenterHawthorne, NY 10532 USAEmail: [email protected]
© 2007 IBM Corporation2 04/20/23
The growth in video search has potential to benefit both enterprise and consumer segments across the world
0102030405060708090
100110120130
2006 2007 2008 2009 2010
0
1
2
3
4
5
6
7
8
9
Vid
eo S
trea
ms
Ser
ved
(B
)
Video Streams Served
Advertising Spending
Advertising S
pending ($B)
Growth in Online Video (U.S.)Video Streams Served and Online Video Advertising
Spending
Sources: eMarketer Research, Veronis Suhler StevensonResearch, AccuStream Market research
© 2007 IBM Corporation3 04/20/23
Though there are numerous video-search options, none of them have yet proven to be reliable and accurate
Google Video — “Basketball”
Scope of results– Does not broadly search the Web
– Does not search inside video
– Cannot distinguish matches showing basketball
Favors Google silo– YouTube videos prominent
YouTube — “Basketball”
Scope of results– Similar results to Google Video
Favors own silo
Video quality mixed– User-generated content and
user provided video
SearchVideo (AOL) — “Basketball”
Scope of results– Top matches all related to Imus
comments
– Again limited by inability to detect basketball scenes
Favors own silo– Preference for AOL and AOL
partner content
SearchVideo (Blinkx) — “Basketball”
Scope of results– 214,000 matches related to
“basketball”
– No way to limit results to relevant scenes showing basketball games
Favors own silo– Results limited to partner
Issues of Current Video Search Systems
• Based on text metadata and/or manual tags, which are not available for lots of video
• Unable to search inside video clips, which are typically associated with clip-level metadata
© 2007 IBM Corporation4 04/20/23
Concept-based Video Search
Exciting new direction
– Visual indexing with semantic concept detection
– (Semi-)automatically produce frame-level indexing based on statistical learning techniques
– Search by text keywords without text metadata
(Courtesy of Lyndon Kennedy)
© 2007 IBM Corporation5 04/20/23
Concept-based Video Search
© 2007 IBM Corporation6 04/20/23
Thousands of video concepts are required to produce good performance for concept-based video retrieval
Need ~3000 video concepts to have similar performance with web retrieval
– Extrapolate search results on 3 standard large-scale video collections
– Concept detection accuracy and combination strategies are calibrated with the state-of-the-art results
– Details: [Hauptmann, Yan and Lin]
© 2007 IBM Corporation7 04/20/23
Challenges: Efficient (and Effective) Approaches to Detect Thousands of Video Concepts are yet to be Developed
Case study: TRECVID’05-’07
– 39 video concepts are defined
– A baseline SVM classifier takes ~7 days to learn on 100,000 frames for 39 concepts using a 2.16GHz Dual-Core CPU
– It takes ~3.5 days to generate predictions on 1 million testing frames for 39 concepts
– Need 30 machines for 39 concepts if processing 100 frames per second
TREC Concepts
Program
Weather
Entertain
Sports
Location
Office
Meeting
Studio
Outdoor
Road
Sky
Snow
Urban
Water
Mountain
Desert
People
Crowd
Face
Person
Roles
Objects
Flag-US
Animal
Screen
Vehicle
Airplane
Car
Boat
Bus
Building
Plants
Court
G. Leader
C. Leader
Police
Prisoner
MilitaryTruck
Activities
People
Walk
March
Events
Explosion/Fire
Natural Disaster
Graphics
Maps
Chart
© 2007 IBM Corporation8 04/20/23
New Approaches for a Wide Spectrum of Video Concepts
DomainDependentConcepts
Domain Independent Concepts
Digital item
Learnability
Domain-IndependentConcepts
Concepts that can be learned across multiple domains
Size Tens
Example Sky, Urban, NightDomain-Dependent
Concepts
Concepts that can be learned on some specific domains
Size Hundreds
Example Anchor, Basketball
Out-of-Domain Concepts
Out-of-Domain Concepts
Concepts that are difficult to be learned with low-level features
Size Thousands
Example Paris, Grandma
Automatic: Model-shared
subspace boosting
Automatic: Model-shared
subspace boosting
Semi-automatic: Cross-domain
Concept Adaptation
Semi-automatic: Cross-domain
Concept Adaptation Semi-manual:
Learning-based Hybrid Manual
Annotation
Semi-manual: Learning-based Hybrid Manual
Annotation
© 2007 IBM Corporation9 04/20/23
Roadmap
Motivation and Challenges: Why Efficiency?
(Automatic) Model-shared Subspace Boosting [KDD’07]
(Semi-automatic) Cross-domain Concept Adaptation [MM’07]
(Semi-manual) Learning-based Hybrid Annotation [Submitted]
Conclusions
© 2007 IBM Corporation10 04/20/23
Prior Art on Automatic Concept Detection
Standard multi-label classification [City U., 07] [IBM, 07] [Tsinghua, 07]
– Need to learn an independent classifier for every possible label using all the data examples and the entire feature space.
Other image annotation methods [Snoek et al., 05] [Torralba et al, 04]
– No mechanisms to reduce the redundancy among labels other than making use of the multi-label relations.
Multi-task learning [Ando and Zhang, 05] [Caruana, 97] [Zhang et al., 05]
– Treat each label as a single task and use them in an iterative process
– Complex and inefficient inference effort to estimate the task parameters
© 2007 IBM Corporation11 04/20/23
Related Work: Random Subspace Bagging
Improve computational efficiency by removing the redundancy in both data space and feature space
1. For each concept, select a number of bags of training examples, where each bag is randomly sampled from training data as well as feature space
2. Learn a base model on each bag of training examples using arbitrary learning algorithms
3. Add them into a composite classifier
A.k.a. asymmetrical bagging and random subspace, or random forest (w. decision trees)
Features
Training
Exam
ples
M1M1 M2M2
ClassifiersClassifiers
© 2007 IBM Corporation12 04/20/23
Missing Pieces for Random Subspace Bagging
RSBag learns classifiers for each concept separately, which thus cannot reduce information redundancy across multiple concepts
– It is possible to share and re-use some base models for different concepts
M1M1 M2M2
Label 1: Car Label 1: Car
M1M1 M2M2
Label 2: RoadLabel 2: Road
M2M2 M2M2
© 2007 IBM Corporation13 04/20/23
Model-shared subspace boosting [with J. Tesic and J. Smith]
Model-shared subspace boosting (MSSBoost) iteratively finds the most useful subspace base models, shares them across concepts, and combines them into a composite classifier for each concept.
– MSSBoost follows the formulation of LogitBoost [Friedman et al., 1998]
– The base models are learned from bootstrapped data samples and selected feature space, which can be trained from any algorithms
– The classifier for each concepts is an ensemble of multiple base models
– The base models are shared across multiple concepts, so that the same base models can be re-used in different decision functions.
© 2007 IBM Corporation14 04/20/23
MSSBoost Algorithm: Overview
Step 1 (model initialization)
– initialize a number of base models for each label, where each base model is learned on a label using random subspace and data bootstrapping
Step 2 (iterative update)
1. Search the model pool for the optimal base model and its weight by minimizing a "joint logistic loss function" over all the concepts
2. Update the classifier of every concept by sharing and combining the selected model
3. Replace this model by a new subspace model learned on the same concept
11
22
33
11
22
11
11
22
33
F1 F2
22
F3
11
Base Models
L1 L2 L3
11 11
1111
Composite Classifiers
Labels
,
min log 1 exp ( ( ) ( )t t
t til l i i
h il
y F x h x
( ) ( ) ( ) 1..t tl l lF x F x h x l L
© 2007 IBM Corporation15 04/20/23
Experiments
Two large-scale image/video collections – TRECVID’05 sub-collection, including 6525 keyframes with 39 concepts– Consumer collection, including 8390 images with 33 concepts
Low-level visual features– 166-dimensional color correlogram– 81-dimensional color moments – 96-dimentional co-occurrence texture
RBF-kernel support vector machinesas base models for MSSBoost
75%-25% training-testing split
© 2007 IBM Corporation16 04/20/23
Concept Detection Performance
MSSBoost outperforms baseline SVMs using a small number of base models (100
for 39 labels) w. small data/feature sample ratio (~0.1)
MSSBoost consistently outperforms RSBag and non-sharing boosting (NSBoost)– e.g. # of models to achieve 90% baseline MAP is only 60% of that of RSBag / NSBoost
TRECVID Collection Consumer Collection
© 2007 IBM Corporation17 04/20/23
Concept Detection Efficiency
MSSBoost vs. baseline SVMs (with the same classification performance)
– 60-fold / 170-fold speedup on training and 20-fold / 25-fold speedup on testing
Training Time Testing Time
0
100
200
300
400
500
600
700
TREC Photo
Pre
dic
tio
n T
ime
(sec
.)
Baseline MSSBoost (100%)
TRECBaseline MSSBoost
5898 sec. 94 sec.
PhotoBaseline MSSBoost
5158 sec. 31 sec.
© 2007 IBM Corporation18 04/20/23
Roadmap
Motivation and Challenges: Why Efficiency?
(Automatic) Model-shared Subspace Boosting [KDD’07]
– Automatically exploit information redundancy across the concepts
(Semi-automatic) Cross-domain Concept Adaptation [MM’07]
(Semi-manual) Learning-based Hybrid Annotation [Submitted]
Conclusions
© 2007 IBM Corporation19 04/20/23
Cross-domain Concept Detection
Adapt concept classifiers from one domain to other domains
– Domains can be genres, data sources, programs, e.g., “CNN”, “CCTV”
– Adapt from auxiliary dataset(s) to a target dataset
Adaptation is more critical for video (than text)
– Bigger semantic gap, e.g., “tennis”
– More sensitive to domain change
– e.g., average precision of “anchor” drops from 0.9 on TREC’04 to 0.5 on TREC’05
© 2007 IBM Corporation20 04/20/23
Prior Art on Cross-Domain Detection
Data-level adaptation [Wu et al., 04] [Liao et al., 05] [Dai et al., 07]
– Combine auxiliary and target data for training a new classifier
– Computational expensive due to the large number of training data
Parametric-level adaptation [Marx et al., 05] [Raina et al., 06] [Zhang et al., 06]
– Use the model parameters of auxiliary data as prior distribution
– Model must be parametric and be of the same type
Incremental Learning [Syed et al., 99] [Cauwenberghs and Poggio, 00]
– Continuously update models with subsets of data
– Assume the same underlying distributions without any domain changes
Sample bias correction, concept drift, speaker adaptation...
© 2007 IBM Corporation21 04/20/23
Function-level Adaptation [with J. Yang and A. Hauptmann]
Function-level adaptation: modifies the decision function of old models
– Flexibility: auxiliary classifier can be “black-box” classifiers of any type
– Efficiency: auxiliary data is NOT involved in training
– Applicability: even if the auxiliary data is not accessible
auxiliary classifier
adapted classifier
delta function
=+
target data
auxiliary data
© 2007 IBM Corporation22 04/20/23
Learning “Delta Function”: Risk Minimization
General framework: regularized empirical loss minimization
(1) classification errors measured by any loss function L(y,x), and
(2) complexity (norm) of Δf(x), which equals to the distance between auxiliary and adapted classifier in the function space.
© 2007 IBM Corporation23 04/20/23
Illustration of function-level adaptation
Intuition: seek the new classification boundary that (1) is close to the original boundary and (2) can correctly classify the labeled examples
– Cost factor C to determine the contribution of auxiliary classifiers
??
?
?
?
???
??
??
?
??
?
??
?
? ?
??
?
?
?
Auxiliary data Target data
© 2007 IBM Corporation24 04/20/23
Adaptive SVMs
Adaptive SVMs: a special case of adaption with hinge loss function
A quadratic programming (QP) problem solved by modified sequential minimal optimization (SMO) algorithm
Training cost: similar to SVMs other than the one-time cost of computing auxiliary prediction
Adapted classifier:
© 2007 IBM Corporation25 04/20/23
Experiments
TREC Video Retrieval Evaluation (TRECVID) 2005
– 74,523 video shots, 39 labels, 13 programs from 6 channels
– Adapt concepts learned from one program to another program
Name Training data Algorithm
Our approach Adapted classifier (Adapt)
Target-Prog (labeled) Adaptive SVMs
Baseline approach
Auxiliary classifier (Aux)
Aux-Prog SVMs
Target classifier (Target)
Target-Prog (labeled) SVMs
Competing approach
Aggregation classifier (Aggr)
Aux-Prog + Target-Prog (“early fusion”)
SVMs
Ensemble classifier (Ensemble)
Aux-Prog + Target-Prog (“late fusion”)
SVMs
© 2007 IBM Corporation26 04/20/23
Cross-Domain Detection Performance
Average Precision: Adapt > Aggr ≈ Ensemble > Aux > Target
– Using knowledge of auxiliary data almost always help in this setting
More classification results in the paper [MM’07]
© 2007 IBM Corporation27 04/20/23
Cross-Domain Detection Efficiency
Total training time for 39 concepts and 13 programs
Training cost: Target = Ensemble < Adapt << Aggr
Adaptive SVMs achieve good tradeoff between concept detection effectiveness and efficiency
© 2007 IBM Corporation28 04/20/23
Roadmap
Motivation and Challenges: Why Efficiency?
(Automatic) Model-shared Subspace Boosting [KDD’07]
– Automatically exploit information redundancy across the concepts
(Semi-automatic) Cross-domain Concept Adaptation [MM’07]
– Function-level adaptation with high efficiency and flexibility
(Semi-manual) Learning-based Hybrid Annotation [Submitted]
Conclusions
© 2007 IBM Corporation29 04/20/23
Manual Concept Annotation
Limitations of automatic annotation
– Needs to have sufficient training data
– Sometimes hard to learn from low-level visual features
Popularity of manual annotation
– High annotation quality and social bookmarking functionality
– Labor-expensive and time-consuming
– “Vocabulary mismatch” problem
“Book” (Flickr)
How about speeding up manual annotation?
Let users drive, but computers suggest the right words / images /
interface to annotate
How about speeding up manual annotation?
Let users drive, but computers suggest the right words / images /
interface to annotate
© 2007 IBM Corporation30 04/20/23
Related Work on Efficient Manual Annotation
Active learning: maximizing automatic annotation accuracy with a minimal amount of manual annotation effort
– Aim to optimize the learning performance, instead of annotation time
– Automatically annotate most images (inaccurate), of which the learning performance largely depends on underlying low-level features
– Annotate most “ambiguous” images, leading to poor user experience
Leveraging other modalities: e.g., speech recognition, semantic network, time/location
– Require support from other information sources
© 2007 IBM Corporation31 04/20/23
Challenges and Proposed Work
Challenges on investigating manual annotation
– No formal time models exist for manual annotation
– Require large-scale user study, which can result in a time-consuming annotation process and a high user variance
– Provide no guidance on developing better manual annotation approaches
Proposed work
– Formal time models for two annotation approaches: tagging / browsing
– A much more efficient annotation approach based on these models
© 2007 IBM Corporation32 04/20/23
Manual Annotation (I) : Tagging
Allow users to associate a single image at a time with one or more keywords (the most widely used manual annotation approaches)
Advantages
– Freely choose arbitrary keywords to annotate
– Only need to annotate relevant keywords
Disadvantages
– “Vocabulary mismatch” problem
– Inefficient to design and type keywords
Suitable for annotating rare keywords
© 2007 IBM Corporation33 04/20/23
Formal Time Model for Tagging
Annotation time for one image:
– Factors: number of keywords K, time for kth word t’fk , setup time for new image t’s
Total expected annotation time for an image collection
– Assumption: the expected time to annotate the kth word is constant tf
User study on TRECVID’05 development data
– manually tag 100 images using 303 keywords
– If the model is correct, a linear fit should be found in the results
– The annotation results fit the model very welltf = 6.8sec, ts = 5.6sec
1 ...f fk s fk sk
T t t t t t
( )l
l
total fk s l f sl k l
E T E t E t K t t
© 2007 IBM Corporation34 04/20/23
Manual Annotation (II) : Browsing
Allow users to associate multiple images with a single word at the same time
Advantages
– Efficient to annotate each pair of images / words
– No “vocabulary mismatch”
Disadvantages
– Need to judge both relevant and irrelevant pairs
– Start with controlled vocabulary
Suitable for annotating frequent keywords
© 2007 IBM Corporation35 04/20/23
Formal Time Model for Browsing
Annotation time for all images w.r.t. a keyword:
– # of relevant images Lk , annotation time for an (ir)relevant image t’p (t’n)
Total expected annotation time for an image collection – Assumption: the expected time to annotate a relevant (irrelevant) image is constant
User study on TRECVID’05 development data
– Three users to manually browse images in 15 minutes ( for 25 keywords )
– A linear fit should be found in the results
– The annotation results fit the model for all userson average, tp = 1.4sec, tn = 0.2sec
1 1
k kL L L
pl nll l
T t t
( )k k
k k
total pl nl k p k nk l l k
E T E t E t L t L L t
© 2007 IBM Corporation36 04/20/23
Learning-based Hybrid Annotation [with A. P. Natsev and M. Campbell]
Combine both tagging and browsing interfaces to optimize the annotation time for manually annotating the image/video collections
– Formally model the annotation time as functions of word frequency, time per word, and annotation interfaces
– Learning the visual patterns of existing annotation on the fly
– Automatically suggest the right images, keyword, and annotation interface (tagging vs. browsing) to the users to minimize overall annotation time
– Combine the advantages of both tagging and browsing
© 2007 IBM Corporation37 04/20/23
An Illustrative Example for Hybrid Annotation
Users start annotation process from the tagging interface
– No limitation on the keywords
(Automatically) switch to the browsing interface to annotate a set of selected images
– Predicted as relevant to a given keyword with high confidence
– Spend much less time to annotate images without re-typing the same keyword
Switch to the tagging interface when necessary
TaggingBrowsing
© 2007 IBM Corporation38 04/20/23
Simulation Results
Results on two large-scale collections: TRECVID and Corel
– More accurate than automatic annotation (100% accurate)
– More efficient than tagging / browsing annotation (2-fold speedup)
– More effective than tagging / browsing in a given amount of time
TRECVID Collection Corel Collection
© 2007 IBM Corporation39 04/20/23
Empirical Results
A user spend 1 hour in annotating 10 TRECVID videos using tagging, browsing and hybrid annotation
– The proposed time models correctly estimate the true annotation time
– Hybrid annotation provides much better annotation results
Method Estimate True
Tag 3649s 3600s
Browse 3603s 3608s
Hybrid 3478s 3601s
Empirical PerformanceEstimated Annotation Time
© 2007 IBM Corporation40 04/20/23
Conclusions: Efficient Approaches for Learning Large-Scale Video Concepts
Automatic: Model-shared Subspace Boosting
– Automatically exploit information redundancy across concepts
– Orders of magnitude speedup on both training and testing process
Semi-automatic: Cross-domain Concept Adaptation
– Function-level adaptation with high efficiency and flexibility
– Fast cross-domain model update with limited number of training data
Semi-manual: Learning-based Hybrid Annotation
– Optimize overall annotation time using formal annotation time models
– Significantly faster that simple tagging or browsing with accurate annotation
© 2007 IBM Corporation
Thank You!
© 2007 IBM Corporation42 04/20/23
Backup
© 2007 IBM Corporation43 04/20/23
Properties of MSSBoost
Adaptive Newton methods to compute the combination weights αt by minimizing the joint logistic loss function [Proposition 1]
The learning process is guaranteed to converge after a limited number of steps under some general conditions [Theorem 3]
Computational complexity can be considerably reduced by using small sampling ratios, and sharing base models across labels
– 100 base models w. data/feature sampling ratio 20% for 40 concepts
– Achieve a 50-fold speedup for training, a 10-fold speedup for testing
1( )1
1 1 1
( ),..., ( ) 1
( (1 ),..., (1 )) ( ,..., )
l iT F xt t
N il il il il
l l l Nl Nl l l Nl
where H h x h x p e z y p
W diag p p p p Z Z Z
© 2007 IBM Corporation44 04/20/23
Problem Formulation
Adapt classifiers trained on auxiliary datasets to a target dataset
– Assumption 1: target data follows a different but related distribution
– Assumption 2: limited target examples are additionally collected
Auxiliary data Target data
Auxiliary classifier
Adapted classifier
biased
high variance
new classifier
trainapply
train apply
adapt
Bias-variance tradeoff
© 2007 IBM Corporation45 04/20/23
Example: Synthetic Data Examples
2-D data examples
1000 data points w. 3 labels and optimal decision boundary
© 2007 IBM Corporation46 04/20/23
Example: Results of Random Subspace Bagging
Random Subspace Bagging
– Base model: decision stump (1-level tree)
– 8 base models
RS-Bag cannot model the decision boundary well with such a small number of base models
© 2007 IBM Corporation47 04/20/23
Example: Results of MSSBoost
Model-Shared Subspace Boosting
– 8 base models
MSSBoost can model the decision boundary much better than its non-shared counterpart with the same number of base models
In other words, it can improve classification efficiency without hurting performance