Learning(to(Rotate(3D( Objects: Weakly8supervised ......Human(3D(Vision •...

transcript

Learning to Rotate 3D Objects:Weakly-‐supervised Disentangling with

Recurrent Transformations

Jimei Yang1,3, Scott Reed2, Ming-‐Hsuan Yang1 and Honglak Lee21UC Merced

2U Mich Ann Arbor3Adobe Research

3D Vision from A Single Image

• 3D object recognition, Asthana et al. ICCV 2011

• 3D object manipulation, Banerjee, et al. SIGGRAPH 2014

• Challenges:– Partial observability inherent in projecting a 3D object onto the image space, and

– Ill-‐posedness of inferring object shape and pose• Classic approach– 3D object reconstruction

• Our approach …

Human 3D Vision

https://psychlopedia.wikispaces.com/mental+rotation

Human 3D Vision

https://psychlopedia.wikispaces.com/mental+rotation

Human 3D Vision

• Mental rotation of three dimensional objects, Shepard and Metzler, Science, 1971– People have the ability to rotate two objects in their consciousness to decide whether they are actually the same object in different perspectives

– The greater the angle that an object is rotated the longer it takes for people to identify

• Solution inspired by mental rotation– Jointly model 3D recognition and view synthesis– Learn distributed representations (neural networks) instead of recovering 3D model

Deep Convolutional Networks

• Convolutional networks have demonstrated remarkable ability of recognizing and generating objects– Discriminative CNNs, Krizhevsky et al. NIPS 2012– Generative CNNs, Dosovitskiy et al. CVPR 2015

Discriminative CNNs

• Given an image, the discriminative CNNs produce high-‐level feature representations

Generative CNNs

• Given high-‐level abstract representations, the generative CNNs produce object images

Action-‐driven Convolutional Encoder-‐Decoder Networks

• Input: an image of 3D object• Output: its rotated view• Latent units: pose and identity• Action units: [100],[010],[001]• Transforming autoencoder, Hinton et al.• What-‐where autoencoder, Zhao et al.

15o 30o

Recurrent Convolutional Encoder-‐Decoder Networks

• To enable long-‐term rotations, we allow the pose units to be recurrent (45o à {[001],[001],[001]})

• And fix the identity units across all the views in a sequence

Curriculum Training• We present sequences of continuous views from the same object (training as pose manifold traversal)

• We gradually increase the difficulty of training by increasing the trajectory length (Bengio, et al. ICML 2009)

One-‐step rotationRNN1

Two-‐step rotationRNN2

Four-‐step rotationRNN4

Multi-‐PIE Faces

• Data (Gross et al. IVC 2010)– 337 people, 7 viewpoints from -‐45o to 45o– 200 people for training, 137 people for test– 80x60x3 pixels per image

• Models: RNN1, RNN2, RNN4 and RNN6• Comparisons– 3D face morphable model for pose normalization (Zhu et al. CVPR 2015)

– Discriminative CNN for face recognition

3D View Synthesis for Novel Objects

Comparing to 3D Face MorphableModel for Pose Normalization

• Zhu et al., CVPR 2015

Comparing to 3D Face MorphableModel for Pose Normalization

• Zhu et al., CVPR 2015

Cross-‐View Face Recognition

• In the test set, one view as gallery and the other views as probes

• Results are measured by matching success rates at different angle offsets between gallery and probe views

• Compared to discriminative CNN– Train a 5-‐layered CNN with face identity labels– Extract features from the layer before labels for matching

Cross-‐View Face Recognition

Average success rates: RNN: 93.3CNN: 92.6

3D Chairs• Data (Aubry et al. CVPR 2014)– 809 chair instances, rendered from 31 azimuth angles and 2 elevation angles

– 500 instances for training, 409 instances for test– 64x64x3 pixels per image

• Models: RNN1, RNN2, RNN4, RNN8, RNN16• KNN baseline– Extract “fc7” features using VGG-‐16 CNN to retrieve K nearest neighbors in the training set

– Given the target rotation angles, calculate the means of corresponding views of retrieved K nearest neighbors

Comparing RNNs

• Perform 16-‐step rotations using RNNs at different curriculum training stages

Comparing RNNs with KNNs

Cross-‐View Chair Recognition• The same setup as the Multi-‐PIE dataset• Compared to VGG-‐16 CNN

Average success rates: RNN: 56.8CNN: 52.2

Class Interpolation and View Synthesis

• Given two chair images of same view from different classes, – the encoder is used to compute their identity units z1id, z2id and pose units z1pose, z2pose ,

– the interpolation is given by b = [0.0, 0.2, …, 0.8, 1.0]zid = b *z1id + (1-‐b) * z2id,zpose = b *z1pose + (1-‐b) * z2pose ,

– Zid and zpose are fed into the recurrent decoder to render novel images

Class Interpolation and View Synthesis

• Each column: pose manifold traversal• Each row: style manifold traversal

Concluding Remarks

• High-‐quality 3D view synthesis is achieved with general deep convolutional networks from a single image

• Disentangled representations are learned with recurrent transformations without class labels.– Cross-‐view object recognition– Chair interpolation

• Curriculum strategy helps with RNN training

Learning(to(Rotate(3D( Objects: Weakly8supervised ......Human(3D(Vision •...

Documents