Visual Object Category Recognition
Ashish Gupta
Centre for Vision, Speech, and Signal Processing
Contents
• Introduction
• Related work
• Overview: Object recognition system
• Object classification & detection
• Conclusions
• Future work
Introduction
Research Topic: Visual object category recognition using weakly supervised learning.
DIPLECS: Artificial cognitive system for autonomous systems.
• Interested in object interactions determined by their functional properties.
• All objects in same category have the same functional properties.
• Recognition is based on object’s visual properties.
Introduction
Research Topic: Visual object category recognition using weakly supervised learning.
• A very large training set is required to learn the large appearance variation in a category.
• So we utilize huge image datasets like Flickr®
and GoogleTM Image.
• The images are corrupt and incompletely labelled.
• Therefore, weakly supervised learning is utilized which can handle corrupt and noisy training data.
Challenges
Intra-category appearance Pose Clutter Scale
Occlusion Illumination Articulation Camouflage
Background
Work done
Visual Recognition System
SIFT feature descriptor
Occurrence frequency of visual words is characteristic of the object
Object model : bag-of-visual words
Creating a visual codebook
Object model : bag-of-visual words
A test image can be classified based on the distance of its normalized codebook from the codebooks of positive and negative training samples.
Codebook positive samples Codebook negative samples Codebook test image
Object model : bag-of-visual words
Visual codebooks for positive and negative samples of ‘car’ category in PASCAL VOC 2006
Object model : bag-of-visual words
Visual codebooks for ‘car’ and ‘cow’ categories in PASCAL VOC 2009 dataset
Classification
ROC (Receiver Operating Characteristics): evaluating classification performance.
ROC for ‘car’ category in PASCAL VOC 2006
The linear kernel: K(x,y) = xTy, was used since it is fast.
Improve Classification
Larger Visual Codebook:
• More representative of category
• Higher computational cost
ROC of ‘car’ category in PACAL VOC
2006 for codebook sizes from 20 to
20000 visual words.
Improve Classification
Improve Classification
Training and test images in the dataset scaled down by same factor.
Training and test images scaled down by different factors.
Improve Classification
Training Samples Dataset 1 Training Samples Dataset 2Scale down factor
/1
/2
Y NY Y
Test Image Image classified correctly
Improve Classification
ROC for 20 visual categories in PASCAL VOC 2009
The PACAL VOC 2009 dataset is
larger and more challenging than the
2006 dataset.
Improve Classification
ROC for PASCAL VOC 2009 training and test images images scaled down by factor of 2
ROC for PASCAL VOC 2009 using a universal visual vocabulary
Object localization using sliding window
The poor localization results are due to:
• Lack of structural information in the bag-of-words object model
• Classifier learning object background
Visual codebook
Training images with bounding - boxes
Training images without bounding - boxes
Good Codebook with equal population of positive and negative visual words
Positive background different from negative images
Positive background similar to negative images
With no bounding-box
utilized, the codebook
consists of a majority of
negative visual words.
Visual codebook
Training images with bounding - boxes
Training images without bounding - boxes
Good Codebook with equal population of positive and negative visual words
Positive background different from negative images
Positive background similar to negative images
Classification based on
object context
(background) rather than
object features.
Improve Classification
The detection at each iteration estimates a bounding box which provides a better
visual codebook which in turn leads to better detection.
• Key-point configurations as features are a discriminativeobject feature set.
• A configuration of visual words appends structural informationto the bag-of-words model.
Object detection
• Harvest frequent and discriminative configurations.
• Encode configurations called transaction vectors.
• Association between a transaction vector and the
training type is an association rule.
• Apriori algorithm finds association rules with high
confidence in a support-confidence framework. Transaction vector encoding key-point configuration
Apriori algorithm
• Uses breadth-first search and tree structure.
• Longer configurations will have lower support as
they are infrequent but higher confidence as they
are more discriminative.
• Downward closure lemma: prune configurations
with infrequent sub-sets.
Object localization
Training Data Set
Test Data Set
Test Image
Generate Transactions Transactions Apriori data
miningAssociation
Rules
Generate Confidence for each Transaction
Threshold Confidence
Transactions
• A confidence is assigned to every
key-point in the image.
• Key-points with sufficiently high
confidence are retained.
• Key-points which occur on
common background objects like
doors and windows can have high
confidence.
Object classification using Apriori
Training Data Set
Test Data Set
Generate Transactions Transactions Apriori data
miningAssociation
Rules
Generate Confidence for each Transaction
Sum Confidence
TransactionsTest Images
ROC ‘car’ in PASCAL VOC 2006
The summed confidence score depends
upon object scale in the image, which
explains the comparatively poor
performance of this approach.
Conclusions
• The ‘bag-of-words’ model is good for classification, but poor for localization.
• Separate foreground-background for better visual codebooks.
• The good classification using PASCAL VOC 2006 dataset is attributed to
recognition of object context rather than object features.
• The dataset utilized should have sufficient variation in appearance of the
object and its background.
• Larger visual vocabulary gives slightly better classification, but is
computationally more expensive.
• The visual vocabulary built has majority of background visual words since
bounding-boxes are not utilized during training.
Conclusions
• Improving the proportion of visual words representing the object in the
vocabulary is vital for good classification.
• Incorporate object boundary contour to the descriptor.
• Use of frequent and discriminative key-point configurations is a promising
approach for object localization.
• A low quality dataset results in a weak visual codebook and classifiers biased
to the training data.
• Classification using key-point configurations was poor compared to ‘bag-of-
words’ for PASCAL VOC 2006.
Future Work
• Improve a visual codebook by increasing the proportion of visual words
pertaining to object features. Combine Apriori based localization and
clustering for visual word selection in an iterative approach.
•Model visual scene information (Use the GIST descriptor by Torralba). Learn
co-occurrence statistics of a scene and a visual category. Recognition of the
scene serves as prior for object presence and improves object recognition
performance.
• Improve object localization by using context priming.
• Model object contextual information to aid foreground-background
disambiguation for better object localization.
Future Work
• Share information of features between visual categories. The size of a
universal visual vocabulary should increase sub-linearly with increase in
number of visual categories.
• Combine image segmentation and classification to improve the object
model to provide better classification performance.
• Build a hierarchical framework for visual categorization:
• Representation: combine local and global features.
• Model: combine semantic and structural object models.
• Classification: combine generative and discriminative approaches.
Future Work
Questions?