From Image Analysis to Content Extraction:
Are We There Yet?
From Image Analysis to Content Extraction:
Are We There Yet?
Tsuhan ChenCarnegie Mellon University
Pittsburgh, [email protected]
Tsuhan Chen
A Journey of 10+ Years A Journey of 10+ Years
• Multimedia Signal Processing (MMSP) Technical Committee
– Founding Chair 1996~1999
• MMSP Workshops
– Princeton 1997, Los Angeles 1998, Copenhagen 1999, Cannes 2001, St. Thomas 2002, Siena 2004, Shanghai 2005, Victoria 2006…
• IEEE Transactions on Multimedia
– Editor-in-Chief: 2002~2004
• International Conference on Multimedia and Expo (ICME)
– New York 2000, Tokyo 2001, Lausanne 2002, Baltimore 2003, Taipei2004, Amsterdam 2005, Toronto 2006, Beijing 2007…
• IEEE Fellow, 2007~, “…multimedia signal processing”
• IEEE Distinguished Lecturer, 2007~2008
Signal vs. ContentSignal vs. Content
Tsuhan Chen
[Baker and Kanade]
What is “content”?What is “content”?
population worldhistory human36524606030 ××××××>>
Number of all possible 16×12 images 812162 ××=
“Content” is based on signals, i.e., prior, statistics, data-driven…
Tsuhan Chen
ThoughtsThoughts
• “The most compelling shapes are those near to our hearts: people’s faces, a gracefully moving body, a natural scene with rustling leaves and flowing water. Evolution has tuned us to these sights…”
[Lengyel, 1998]
• How do we see such “objects of interest”?
• Content extraction is more than processing bits…it’s signal processing + statistical learning
[Chen, 2007]
Tsuhan Chen
Sample Projects in Content Retrieval Sample Projects in Content Retrieval
Beyond digital images/videos…
Hand-Drawn Query
Retrieved Trademarks
[Leung&Chen ICME’02]Trademark RetrievalTrademark Retrieval
Tsuhan Chen
Sketch RetrievalSketch RetrievalUser sketches a query…
QuerySketch
SimilarSketch
Page Stored in Database
[Leung&Chen ICME’03]
Tsuhan Chen
3D Object Retrieval3D Object Retrieval[Zhang&Chen ACM MM’01]
Tsuhan Chen
3D Protein Retrieval3D Protein Retrieval[Chen&Chen ICIP’02]
Tsuhan Chen
Object DiscoveryObject Discovery
Object Discovery ≠ Object Detection
Tsuhan Chen
Object DetectionObject Detection
Training Data (Labeled)
Test Data
[BioID Face Database]
Tsuhan Chen
Object DiscoveryObject Discovery
[Caltech Face+Background Dataset]
Discover = Categorize + Localize
How did we do that?
Tsuhan Chen
Object DiscoveryObject Discovery
[UIUC Car Dataset]
Discover = Categorize + Localize
How did we do that?
Tsuhan Chen
Discovering Objects in VideoDiscovering Objects in Video
Discover = Categorize + Localize[YouTube/Google Video]
How did we do that?
Tsuhan Chen
The ApproachThe Approach
Feature Extraction
Statistical Learning
Tsuhan Chen
Feature ExtractionFeature Extraction
Maximally Stable Extremal Regions (MSER)[Matas et al., 02]
“Patch”
Tsuhan Chen
Scale Invariant Feature Transform (SIFT)Scale Invariant Feature Transform (SIFT)[Lowe, 04]
• Robust to viewpoint, illumination, blurring, rotation, and scale changes
Tsuhan Chen
Quantization into Visual WordsQuantization into Visual Words
Visual Words
Discrete symbols128-dim SIFT features
[Leung and Malik, 01]
K-means
Every images becomes a bag of words…
Tsuhan Chen
Statistical LearningStatistical Learning
FeatureExtraction
StatisticalLearning
Single Image
Collectionof Images
Video
Tsuhan Chen
GoalGoal
• Label each patch as background or object of interest
r = (200;200)
z = object of interest
z = background
r = (300;100)
w = w2
w = w3
“Location”
“Appearance”
“Location”
“Appearance”
Tsuhan Chen
Probabilistic ModelProbabilistic Model
0.7
0.3z1
z2
Image Characteristic
Gaussian
uniform
Location Semantics
p(rjz2)p(rjz1)p(z)
= p(z)p(rjz)p(wjz)
0.40.1
0.40.0
0.20.9
z1 z2
w1
w2
Topic Appearance
p(wjz)
w3
p(z; r; w) = p(z)p(r; wjz) r Locationw Appearancez Obj/Bg
Tsuhan Chen
Posterior ProbabilityPosterior Probability
r = (300;100)
r = (200;200)
w = w3
w = w2
p(zjr; w) = p(z; r; w)Xzp(z; r; w)
=p(z)p(rjz)p(wjz)Xzp(z)p(rjz)p(wjz)
z = argmaxz
p(zjr; w)
z = argmaxz
p(zjr; w)
Posterior Probabilities ~ (Soft) Labels
Tsuhan Chen
Only half of the story…Only half of the story…
p(wjz)p(z)
p(rjz)p(zjr; w)
r Locationw Appearancez Obj/Bg
Tsuhan Chen
p(z = z1) =1
4+3
4=2 = 1=2
p(z = z1) = 1=2
How to estimate :• If label is known
• If is known
Estimate Image CharacteristicEstimate Image Characteristic
p(z)
z = z1
z = z1
p(zjw; r)
r Locationw Appearancez Obj/Bg
Tsuhan Chen
p(w = w1jz = z1) =1
2=1
2= 1
p(w = w1jz = z1) =
0@34 + 0
2
1A= 1
2=3
4
Estimate Topic AppearanceEstimate Topic Appearance
How to estimate :• If label is known
• If is known
p(wjz)
w1
w1
w2
w2
z = z1
z = z1
p(zjw; r)
r Locationw Appearancez Obj/Bg
Tsuhan Chen
How to estimate mean and var of :• If label is known
• If is known
Estimate Location SemanticsEstimate Location Semantics
p(rjz = z1)z = z1
p(zjw; r)
Tsuhan Chen
An Iterative AlgorithmAn Iterative Algorithm
p(wjz)p(z)
LocationEstimation p(rjz)
p(zjr; w)
r Locationw Appearancez Obj/Bg
Can start anywhere, can seed anyhow…
Tsuhan Chen
Collection of ImagesCollection of Images
0.4
0.6
0.8
0.2
d1 d2
z1
z2
p(rjz1; d1)
p(rjz1; d2)
p(zjd)
p(z; r; wjd) = p(zjd)p(rjz; d)p(wjz; d)
p(z; r; w) = p(z)p(rjz)p(wjz)
= p(zjd)p(rjz; d)p(wjz)
p(wjz)
0.40.1
0.40.0
0.20.9
z1 z2
w1
w2
w3
r Locationw Appearancez Obj/Bgd Image
Tsuhan Chen
An Iterative AlgorithmAn Iterative Algorithm
p(wjz)
LocationEstimation
p(zjd)
p(rjz; d)p(zjr; w; d)
r Locationw Appearancez Obj/Bgd Image
Same as before, but location/characteristics are image-dependent
Tsuhan Chen
An ExampleAn Example
[Caltech Face+Background Dataset]
Tsuhan Chen
Location Semantics Topic AppearancePosteriorp(rjz = z1; d)
p(wjz = z1)
p(wjz = z2)p(zjr; w; d)
r Locationw Appearancez Obj/Bgd Image
Tsuhan Chen
Video ≠ Collection of ImagesVideo ≠ Collection of Images
Time
Smooth trajectory expected
Tsuhan Chen
Tsuhan Chen
Tsuhan Chen
Motion Information
Tsuhan Chen
( )iν
Tsuhan Chen
( )iν
Tsuhan Chen
( )iν
Tsuhan Chen
),0|( )()(
)()(
SN ii
ii
i
νβ
νβν
∝
≡∑
ν
[Bar-Shalom, 80]
Tsuhan Chen
),,|(),0|(
),0|()()(
1)()(
)()(
)()(
drwzzpSN
SNiiii
ii
ii
i
=∝
∝
≡∑
νβ
νβ
νβν
[Bar-Shalom, 80]
ν
Tsuhan Chen
ν
νWss += −+ ˆˆ
+s−s
[ ]( )2tmeasuremensystem )ˆ(,, +−ΣΣ= ssEfW
Tsuhan Chen
ν
+s−s
νWss += −+ ˆˆ[ ]( )2
tmeasuremensystem )ˆ(,, +−ΣΣ= ssEfW
Tsuhan Chen
An Iterative AlgorithmAn Iterative Algorithm
p(wjz)
LocationEstimation
p(zjd)p(zjw; r; d)
p(rjz; d)Motion
Modeling
Tsuhan Chen
An Iterative AlgorithmAn Iterative Algorithm
p(wjz)
MotionModeling
p(zjd)p(zjw; r; d)
p(rjz; d)
• Knowledge of appearance improves location estimate
Tsuhan Chen
An Iterative AlgorithmAn Iterative Algorithm
p(wjz)
MotionModeling
p(zjd)p(zjw; r; d)
p(rjz; d)
• Knowledge of location improves appearance estimate
Tsuhan Chen
ApplicationsApplications
• Object localization
• Categorization– Video skimming
• Keyframe extraction– Video summarization
Tsuhan Chen
Input VideoInput Video
CMU dataset
Tsuhan Chen
ComparisonComparisonAPP+LOC+MOTION
APP+LOCAPP
p(wjz)
MotionModel
p(zjd)p(zjw; r; d)
p(rjz; d)
p(wjz)
LocationEstim.
p(zjd)p(zjw; r; d)
p(rjz; d)
p(wjz)p(zjd)
p(zjw; d)
[Sivic et al. 05]
Tsuhan Chen
LocalizationLocalization
[CMU Dataset]
APP+LOC+MOTION
APP+LOCAPP
Tsuhan Chen
CategorizationCategorization
• Top 40 frames out of 181, according to p(z = z1jd)
[YouTube/Google Video]
Tsuhan Chen
CategorizationCategorization
[YouTube/Google Video]
• Top 40 frames out of 711, according to p(z = z1jd)
Tsuhan Chen
Keyframe Extraction on YouTubeKeyframe Extraction on YouTube
[YouTube/Google Video]
Tsuhan Chen
Keyframe Extraction – Our ResultKeyframe Extraction – Our Result
5 keyframes from top 40 frames, according to
181 frames. 2 frame/sec.
p(z = z1jd)
[YouTube/Google Video]
Tsuhan Chen
Keyframe Extraction on YouTubeKeyframe Extraction on YouTube
[YouTube/Google Video]
Tsuhan Chen
Keyframe Extraction – Our ResultKeyframe Extraction – Our Result
711 frames. 2 frame/sec.
5 keyframes from top 40 frames, according to p(z = z1jd)
[YouTube/Google Video]
Tsuhan Chen
ExtensionsExtensions
• Geometric Consistency
• Semi-supervised
• Multiple classes and instances
• Hierarchical semantics of objects
Tsuhan Chen
Geometric ConsistencyGeometric Consistency
– Background random, object consistent– Matched patches more likely from object of interest
[Caltech-4 data set]
Tsuhan Chen
Geometric ConsistencyGeometric Consistency
Correspondence Info
0.010.2011 ~ 15
0.970.360 ~ 5
0.000.07> 16
0.020.376 ~ 10
# matches z1 z2
p(mjz)
p(z; w; r;mjd) = p(zjd)p(wjz)p(rjz; d)p(mjz)
Tsuhan Chen
Semi-SupervisedSemi-Supervised
• User provides limited information– e.g., Label one frame
p(wjz)
LocationEstimation
p(zjd)
p(rjz; d)
pL(zjw; r; d)pU(zjw; r; d)
Tsuhan Chen
Multiple Classes and InstancesMultiple Classes and Instances
• Multiple classes
• Multiple instances of the same object class
– Parametric methods
– Nonparametric methods
Model selection with BIC [Schwartz 78]Variational Bayes [Attias 99]
Mean-shift [Comaniciu & Meer 01]
Tsuhan Chen
CHAIR
OFFICE
PHONE
MONITORKEYBOARD
computer
desk-area
Collection of images Corresponding hSO
Hierarchical Semantics of ObjectsHierarchical Semantics of Objects[Parikh&Chen CVPR’07]
Tsuhan Chen
SummarySummary
• Probabilistic framework for object discovery– Incorporate information from
appearance / location / motion / geometry– Multiple classes and multiple instances possible– Unsupervised and semi-supervised possible– Discovery of hierarchical semantics of objects
Tsuhan Chen
Finally…Finally…
Tsuhan Chen
Some Related WorkSome Related Work
Tsuhan Chen
Camera ArrayCamera Array
Tsuhan Chen
What can be done…What can be done…
[EyeVision]
[CMU 3D Dome]
[CMU CamArray]
Tsuhan Chen
Beyond Camera Array: “Active Sensing”Beyond Camera Array: “Active Sensing”
Tsuhan Chen
Tsuhan Chen
Advanced Multimedia Processing LabAdvanced Multimedia Processing Lab
Please visit us at:http://amp.ece.cmu.edu