Post on 03-Jun-2020
transcript
Learning to Rotate 3D Objects:Weakly-‐supervised Disentangling with
Recurrent Transformations
Jimei Yang1,3, Scott Reed2, Ming-‐Hsuan Yang1 and Honglak Lee21UC Merced
2U Mich Ann Arbor3Adobe Research
3D Vision from A Single Image
• 3D object recognition, Asthana et al. ICCV 2011
• 3D object manipulation, Banerjee, et al. SIGGRAPH 2014
3D Vision from A Single Image
• Challenges:– Partial observability inherent in projecting a 3D object onto the image space, and
– Ill-‐posedness of inferring object shape and pose• Classic approach– 3D object reconstruction
• Our approach …
Human 3D Vision
https://psychlopedia.wikispaces.com/mental+rotation
Human 3D Vision
https://psychlopedia.wikispaces.com/mental+rotation
Human 3D Vision
• Mental rotation of three dimensional objects, Shepard and Metzler, Science, 1971– People have the ability to rotate two objects in their consciousness to decide whether they are actually the same object in different perspectives
– The greater the angle that an object is rotated the longer it takes for people to identify
3D Vision from A Single Image
• Solution inspired by mental rotation– Jointly model 3D recognition and view synthesis– Learn distributed representations (neural networks) instead of recovering 3D model
Deep Convolutional Networks
• Convolutional networks have demonstrated remarkable ability of recognizing and generating objects– Discriminative CNNs, Krizhevsky et al. NIPS 2012– Generative CNNs, Dosovitskiy et al. CVPR 2015
Discriminative CNNs
• Given an image, the discriminative CNNs produce high-‐level feature representations
Generative CNNs
• Given high-‐level abstract representations, the generative CNNs produce object images
Action-‐driven Convolutional Encoder-‐Decoder Networks
• Input: an image of 3D object• Output: its rotated view• Latent units: pose and identity• Action units: [100],[010],[001]• Transforming autoencoder, Hinton et al.• What-‐where autoencoder, Zhao et al.
15o 30o
[001]
Recurrent Convolutional Encoder-‐Decoder Networks
• To enable long-‐term rotations, we allow the pose units to be recurrent (45o à {[001],[001],[001]})
• And fix the identity units across all the views in a sequence
Curriculum Training• We present sequences of continuous views from the same object (training as pose manifold traversal)
• We gradually increase the difficulty of training by increasing the trajectory length (Bengio, et al. ICML 2009)
One-‐step rotationRNN1
Two-‐step rotationRNN2
Four-‐step rotationRNN4
Multi-‐PIE Faces
• Data (Gross et al. IVC 2010)– 337 people, 7 viewpoints from -‐45o to 45o– 200 people for training, 137 people for test– 80x60x3 pixels per image
• Models: RNN1, RNN2, RNN4 and RNN6• Comparisons– 3D face morphable model for pose normalization (Zhu et al. CVPR 2015)
– Discriminative CNN for face recognition
3D View Synthesis for Novel Objects
Comparing to 3D Face MorphableModel for Pose Normalization
• Zhu et al., CVPR 2015
Comparing to 3D Face MorphableModel for Pose Normalization
• Zhu et al., CVPR 2015
Cross-‐View Face Recognition
• In the test set, one view as gallery and the other views as probes
• Results are measured by matching success rates at different angle offsets between gallery and probe views
• Compared to discriminative CNN– Train a 5-‐layered CNN with face identity labels– Extract features from the layer before labels for matching
Cross-‐View Face Recognition
Average success rates: RNN: 93.3CNN: 92.6
3D Chairs• Data (Aubry et al. CVPR 2014)– 809 chair instances, rendered from 31 azimuth angles and 2 elevation angles
– 500 instances for training, 409 instances for test– 64x64x3 pixels per image
• Models: RNN1, RNN2, RNN4, RNN8, RNN16• KNN baseline– Extract “fc7” features using VGG-‐16 CNN to retrieve K nearest neighbors in the training set
– Given the target rotation angles, calculate the means of corresponding views of retrieved K nearest neighbors
Comparing RNNs
• Perform 16-‐step rotations using RNNs at different curriculum training stages
Comparing RNNs with KNNs
KNN
RNN
KNN
RNN
KNN
RNN
KNN
RNN
Comparing RNNs with KNNs
Cross-‐View Chair Recognition• The same setup as the Multi-‐PIE dataset• Compared to VGG-‐16 CNN
Average success rates: RNN: 56.8CNN: 52.2
Class Interpolation and View Synthesis
• Given two chair images of same view from different classes, – the encoder is used to compute their identity units z1id, z2id and pose units z1pose, z2pose ,
– the interpolation is given by b = [0.0, 0.2, …, 0.8, 1.0]zid = b *z1id + (1-‐b) * z2id,zpose = b *z1pose + (1-‐b) * z2pose ,
– Zid and zpose are fed into the recurrent decoder to render novel images
Class Interpolation and View Synthesis
• Each column: pose manifold traversal• Each row: style manifold traversal
Concluding Remarks
• High-‐quality 3D view synthesis is achieved with general deep convolutional networks from a single image
• Disentangled representations are learned with recurrent transformations without class labels.– Cross-‐view object recognition– Chair interpolation
• Curriculum strategy helps with RNN training