Using Large-Scale Web Data to Facilitate Textual QueryBased Retrieval of Consumer Photos
Yiming Liu, Dong Xu, Ivor W. Tsang, Jiebo Luo
Nanyang Technological University & Kodak Research Lab
Motivation• Digital cameras and mobile phone cameras popularize
rapidly:– More and more personal photos;– Retrieving images from enormous collections of personal
photos becomes an important topic.
?How to retrieve?
Previous Work
• Content-Based Image Retrieval (CBIR)– Users provide images as queries to retrieve
personal photos.
• The paramount challenge -- semantic gap:– The gap between the low-level visual features and
the high-level semantic concepts.
…
Low-levelLow-levelFeature vectorFeature vector
Image with high-Image with high-level conceptlevel concept
queryquery resultresult
… …
Feature Feature vectorsvectorsin DBin DB
compare
SemanticGap
A More Natural Way For Consumer Applications
• Let the user to retrieve the desirable personal photos using textual queries.
• Image annotation is used to classify images w.r.t. high-level semantic concepts.
– Semantic concepts are analogous to the textual terms describing document contents.
• An intermediate stage for textual query based image retrieval.
queryquery
Sunset Annotation Result:high-level concepts
Annotation Result:high-level concepts
annotateannotate
compare
…
databasedatabase
…
resultresult
rankrank
Our Goal• Web images are accompanied by tags, categories and titles.
… …
building
people, family
people, wedding
sunset
… …
WebWebImagesImages
ContextualContextualInformationInformation
Web Images Consumer Photos
• Leverage information from web image to retrieve consumer photos in personal photo collection.
information
No intermediate image annotation
process.
No intermediate image annotation
process.
• A real-time textual query based consumer photo retrieval system without any intermediate annotation stage.
• When user provides a textual query,
TextualQuery
ClassifierAutomatic WebImage Retrieval
Automatic WebImage Retrieval
Large Collection of Web images
(with descriptive words)
Relevant/Irrelevant
Images
WordNet
RelevanceFeedback
RelevanceFeedback
Refined Top-Ranked
Photos
ConsumerPhoto Retrieval
ConsumerPhoto Retrieval
Raw Consumer Photos
Top-RankedConsumer
Photos
• It would be used to find relevant/irrelevant images in web image collections.
• Then, a classifier is trained based on these web images.
• And then consumer photos can be ranked based on the classifier’s decision value.
• The user can also gives relevance feedback to refine the retrieval results.
System Framework
“boat” InvertedFile
InvertedFile
Relevant Web Images
IrrelevantWeb Images
boat
ark barge
dredger houseboat
… …
… …
… …
… …
… …
Semantic Word TreesBased on WordNet
• For user’s textual query, first search it in the semantic word trees.
• The web images containing the query word are considered as “relevant web images”.
• The web images which do not contain the query word and its two-level descendants are considered as “irrelevant web images”.
Automatic Web Image Retrieval
Decision Stump Ensemble
• Train a decision stump on each dimension.
• Combine them with their training error rates.
Why Decision Stump Ensemble?
• Main reason: low time cost– Our goal: a (quasi) real-time retrieval system.– For basic classifiers: SVMs are much slower;– For combination: boosting is also much
slower.
• The advantage of decision stump ensemble:– Low training cost;– Low testing cost;– Very easy to parallelize;
Asymmetric Bagging
• Imbalance: count(irrelevant) >> count(relevant)– Side effects, e.g. overfitting.
• Solution: asymmetric bagging– Repeat 100 times by using different randomly sampled
irrelevant web images.
irrelevant images
relevant images
100 training sets
…
…
Relevance Feedback
• The user labels nl relevant or irrelevant consumer photos.– Use this information to further refine the
retrieval results;
• Challenge 1: Usually nl is small;
• Challenge 2: Cross-domain learning– Source classifier is trained on the web image
domain. – The user labels some personal photos.
Method 1: Cross-Domain Combination of Classifiers
• Re-train classifiers with data from both domain?– Neither effective nor efficient;
• A simple but effective method:– Train an SVM on the consumer photo domain with
user-labeled photos;– Convert the responds of source classifier and SVM
classifier to probability, and add them up;– Rank consumer photos based on this sum value.
• Referred as DS_S+SVM_T.
Method 2: Cross-Domain Regularized Regression (CDRR)
• Construct a linear regression function fT(x):– For labeled photos: fT(xi) ≈ yi;
– For unlabeled photos: fT(xi) ≈ fs(xi);
Source Classifier
Other images f T(x) should be f s(x)
• Design a target linear classifier f T(x) = wTx.
User-labeled images x1,…,xl
f T(x) should be the user’s label y(x)
A regularizer to control the complexity of A regularizer to control the complexity of the target classifier the target classifier ff TT((xx))
• This problem can be solved with least square solver.
Hybrid Method• A combination of two methods.• For labeled consumer photos:
– Measure the average distance davg to their 30 nearest unlabeled neighbors in feature space;
– If davg < ε: Use DS_S+SVM_T;
– Otherwise: Use CDRR.
• Reason: – For consumer photos which are visually similar to
user-labeled images, they should be influenced more by user-labeled images.
Experimental Results
Dataset and Experimental Setup
• Web Image Database:– 1.3 million photos from photoSIG.– Relatively professional photos.
• Text descriptions for web images:– Title, portfolio, and categories accompanied
with web images;– Remove the common high-frequency words;– Remove the rarely-used words.– Finally, 21377 words in our vocabulary.
Dataset and Experimental Setup
• Testing Dataset #1: Kodak dataset– Collected by Eastman Kodak Company:
• From about 100 real users.• Over a period of one year.
– 1358 images:• The first keyframe from each video.
– 21 concepts:• We merge “group_of_two” and
“group_of_three_or_more” to one concept.
Dataset and Experimental Setup
• Testing Dataset #2: Corel dataset– 4999 images
• 192x128 or 128x192.
– 43 concepts:• We remove all concepts in which there are fewer
than 100 images.
Visual Features
• Grid-Based color moment (225D)– Three moments of three color channels from each
block of 5x5 grid.
• Edge direction histogram (73D)– 72 edge direction bins plus one non-edge bin.
• Wavelet texture (128D)• Concatenate all three kinds of features:
– Normalize each dimension to avg = 0, stddev = 1– Use first 103 principal components.
Retrieval without Relevance Feedback
• For all concepts:– Average number of relevant images: 3703.5.
Retrieval without Relevance Feedback
• kNN: rank consumer photos with average distance to 300-nn in the relevant web images.
• DS_S: decision stump ensemble.
Retrieval without Relevance Feedback
• Time cost:– We use OpenMP to parallelize our method;– With 8 threads, both methods can achieve
interactive level.– But kNN is expected to cost much time on large-
scale datasets.
Retrieval with Relevance Feedback
• In each round, the user labels at most 1 positive and 1 negative images in top-40;
• Methods for comparison:– kNN_RF: add user-labeled photos into relevant
image set, and re-apply kNN;– SVM_T: train SVM based on the user-labeled images
in the target domain;– A-SVM: Adaptive SVM;– MR: Manifold Ranking based relevance feedback
method;
Retrieval with Relevance Feedback
• Setting of y(x) for CDRR:– Positive: +1.0;– Negative: -0.1;
• Reason:– The top-ranked negative images
are not extremely negative;– Positive: “what is”; Negative:
“what is not”.positiveimages
negativeimages
Retrieval with Relevance Feedback
• On Corel dataset:
Retrieval with Relevance Feedback
• On Kodak dataset:
Retrieval with Relevance Feedback
• Time cost:– All methods except A-SVM can achieve real-time
speed.
System Demonstration
Query: Sunset
Query: Plane
The User is Providing The Relevance Feedback …
After 2 pos 2 neg feedback…
Summary
• Our goal: (quasi) real-time textual query based consumer photo retrieval.
• Our method:– Use web images and their surrounding text
descriptions as an auxiliary database;– Asymmetric bagging with decision stumps;– Several simple but effective cross-domain
learning methods to help relevance feedback.
Future Work
• How to efficiently use more powerful source classifiers?
• How to further improve the speed:– Control training time within 1 seconds;– Control testing time when the consumer photo
set is very large.
Thank you!
• Any questions?