Date post: | 21-Dec-2015 |
Category: |
Documents |
Upload: | ernest-rodgers |
View: | 213 times |
Download: | 0 times |
PANDA: Pose Aligned Networks for Deep Attribute ModelingNing Zhang1;2, Manohar Paluri1, Marc’Aurelio Ranzato1, Trevor Darrell2, Lubomir Bourdev11: Facebook AI Research , 2: EECS, UC Berkeley{fnzhang, trevor}@eecs.berkeley.edu {mano, ranzato, lubomir}@fb.com
Outline
• Introduction• Related work• Pose Aligned Networks for Deep Attribute modeling • Datasets• Results• Conclusion
Introduction
• Recognizing human attributes, such as gender, age, hair style, and
clothing style, has many applications.
• The signal associated with some attributes is subtle and the image is
dominated by the effects of pose and viewpoint.
• Deep learning methods, and in particular convolutional nets
[20], have achieved very good performance on several tasks.
• Moreover, Donahue et al. [8] show that features extracted from
the deep convolutional network trained on large datasets are
generic and can help in other visual recognition problems.
• [20] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. E. Howard,W. Hubbard, and L. D. Jackel. Backpropagation applied to hand-written zip code recognition. In Neural Computation,1989. 1, 2
• [8] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A Deep ConvolutionalActivation Feature for Generic Visual Recognition. In arXiv:1310.1531, 2013. 1, 4, 5
Introduction
• We conjecture that available training data, even ImageNet-scale, is
presently insufficient for learning pose normalization in a CNN.
• Part-based methods have gained significant recent attention
as a method to deal with pose variation and are the state-of- the-art
method for attribute prediction today.
Introduction
• Our method can use other parts and we show the performance
using DPM [12] as well. We demonstrate the effectiveness of
PANDA on attribute classification problems and present state-of-
the-art experimental results on three datasets
• [12] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010. 1, 2
Introduction
Related work• 2.1. Attribute classification
Attributes are used as an intermediate representation for
knowledge transfer in [17, 10] for object recognition tasks.
• There is also some related work in automatic attribute discovery:
Berg et al. [1] proposed automatic attribute vocabularies discovery
by mining unlabeled text and image data sampled from the web.
• [10] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing Objects by their Attributes. In CVPR, 2009. 2
• [17] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to Detect Unseen Object Classes by Between-Class Attribute
Transfer. In CVPR, 2009. 2
• [1] T. L. Berg, A. C. Berg, and J. Shih. Automatic attribute discovery and characterization from noisy web data. In ECCV,
2010. 2
Related work
• In [16], facial attributes such as gender, mouth shape, facial
expression, are learned for face verification and image
search tasks.
• A very closely related work on attribute prediction is Bourdev et
al. [4], which is a three-layer feed forward classification system.
• [16] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. Attribute and simile classifiers for face verification. In
ICCV, 2009. 2, 6, 7
• [4] L. Bourdev, S. Maji, and J. Malik. Describing people: A poselet-based approach to attribute classification. In ICCV,
2011. 1, 2, 4, 5, 6
Related work
• 2.2. Deep learning
• The most popular deep learning method for vision, namely the
convolutional neural network (CNN)
• Although very successful when provided very large labeled
datasets, convolutional nets usually generalize poorly on
smaller datasets
Pose Aligned Networks for Deep Attribute modeling
• We explore part-based models, specifically poselets, and deep
learning.
• Our goal is to use poselets for part localization and
incorporate these normalized parts into deep convolutional
nets in order to extract pose normalized representations.
Pose Aligned Networks for Deep Attribute modeling • Towards this goal, we leverage both the power of
convolutional nets for learning discriminative features from
data and the ability of poselets to simplify the learning task by
decomposing the objects into their canonical poses.
• Specifically, we start from poselet patches, resize them to
64x64 pixels (Figure 3), randomly jitter each patch and flip it
horizontally with probability 0.5 to improve generalization, and
train a CNN for each poselet.
Pose Aligned Networks for Deep Attribute modeling • The whole network is trained jointly by standard back propagation of
the error [24] and stochastic gradient descent[2] using as a loss
function the sum of the log-losses of each attribute for each training
sample.
• The details of the layers are given in Figure 2 and further implementation details can be found in [15].
• [24] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. In Nature,
1986.3
• [2] L. Bottou. Stochastic Gradient Descent Tricks. In G. Montavon,G. Orr, and K.-R. M¨uller, editors, Neural Networks:
Tricks of the Trade, volume 7700 of Lecture Notes in Computer Science, pages 421–436. Springer Berlin
Heidelberg,2012. 3• [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNetClassification with Deep Convolutional Neural Networks. InNIPS, 2012. 1, 3, 4
Pose Aligned Networks for Deep Attribute modeling • The overall convolutional net architecture is shown in Figure 2.
Pose Aligned Networks for Deep Attribute modeling • Based on our experiments, we find a more complex net is
needed for the whole-person region than for the part regions.
We extract deep convolutional features from the model
trained on Imagenet [15] using the open source package
provided by [8] as our deep representation of the full image
patch.
• [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural
Networks. In NIPS, 2012. 1, 3, 4
• [8] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A Deep Convolutional
Activation Feature for Generic Visual Recognition. In arXiv:1310.1531, 2013. 1, 4, 5
Datasets
• 4.1. The Berkeley Human
Attributes Dataset We tested our method on the Berkeley
Human Attributes Dataset [4]. This dataset consists of 4013
training, and 4022 test images collected from PASCAL and H3D
datasets.
• [4] L. Bourdev, S. Maji, and J. Malik. Describing people: A poselet-based approach to attribute classification. In
ICCV, 2011. 1, 2, 4, 5, 6
Datasets
• 4.2. Attributes 25K Dataset
Unfortunately the training portion of the Berkeley dataset is not
large enough for training our deep-net models (they severely
overfit when trained just on these images).
We collected an additional dataset from Facebook of 24963
people split into 8737 training, 8737 validation and 7489 test
examples.
Results
[4] L. Bourdev, S. Maji, and J. Malik. Describing people: A poselet-based approach to attribute classification. In ICCV,2011. 1, 2, 4, 5, 6 [27] N. Zhang, R. Farrell, F. Iandola, and T. Darrell. Deformable Part Descriptors for Fine-grained Recognition and Attribute Prediction. In ICCV, 2013. 1, 4, 5
Conclusion
• We presented a method for attribute classification of people
that improves performance compared with previously published
methods.
• Our feature representation is generic and we achieve state-of-
the-art results on the Berkeley Attributes of People dataset
and on LFW even if we train our CNNs on a different dataset.