Deep Learning for Computer VisionSpring 2019
http://vllab.ee.ntu.edu.tw/dlcv.html (primary)
https://ceiba.ntu.edu.tw/1072CommE5052 (grade, etc.)
FB: DLCV Spring 2019
Yu-Chiang Frank Wang 王鈺強, Associate Professor
Dept. Electrical Engineering, National Taiwan University
2019/05/29
• Domain Adaptation• From one domain to another (e.g. USPS → SVHN)
• Multi-Source Domain Adaptation• A more practical scenario where training data are collected from
multiple sources
2
Final Challenge #1Visual Domain Adaptation Challenge (VisDA)
Final Challenge #2Wider Face and Person Challenge• ICCV 2019 Workshop Challenge:
WIDER Face Detection• http://wider-challenge.org/2019.html?fbclid=IwAR2tRgCDdF3OsSuvK6V-
D1vgUJEjGUuu4lphCm3G9Xe9BSDyRAARLYlUe00
• Since you’ve worked on the challenging HW #2…• To search for a person in a large database with just a single query image• In this challenge, you are given an image of a target cast and some candidates
(frames of a movie with person bounding boxes), you will need to search for all the instances belonging to that cast.
3Cast Candidates
Final Challenge #2Wider Face and Person Challenge• Remarks
• The dataset considered in this challenge is from movies or TV series.• The main cast (top 10 in the cast list of IMBb) are collected as the queries. • A candidate should be annotated as positive or negative, where positive
means that a candidate belongs to the cast of interest, while negative would be the incorrect/mismatch ones.
• Further produce the confidence score for each candidate according to a query (cast).
Query
4Candidates
What’s to be Covered …
• Learning Beyond Images• 2D/3D Visual Data• Depth Images
• 2 guest talks next week• Dr. Wei-Sheng Lai (b97), UC Merced (Google/MSR/Adobe Research/
Nvidia Research)• Dr. Shang-Hong Lai, Principal Researcher, MS AI R&D Center
5
From 2D to 3D Visual Data
• Robotics• Augmented Reality
• Autonomous Driving
6
http://cseweb.ucsd.edu/~haosu/slides/3ddl.pdfhttps://www.androidauthority.com/shop-amazon-augmented-reality-right-now-841238/https://arxiv.org/pdf/1711.08488.pdf
3D Deep Learning Tasks
• 3D geometry analysis• 3D synthesis
• 3D-assisted image analysis
7
3D Deep Learning Tasks
• What We Will Focus Today…
8
Representations of 3D Data
• Multi-view RGB(D) Images• Volumetric
• Polygonal Mesh• Point Cloud• Primitive-based CAD Models
9
Representations of 3D Data
• Multi-view RGB(D) Images• Volumetric
• Polygonal Mesh• Point Cloud• Primitive-based CAD Models
10
Representations of 3D Data
• Multi-view RGB(D) Images• Volumetric
• Polygonal Mesh• Point Cloud• Primitive-based CAD Models
11
Representations of 3D Data
• Multi-view RGB(D) Images• Volumetric
• Polygonal Mesh• Point Cloud• Primitive-based CAD Models
12
Representations of 3D Data
• Multi-view RGB(D) Images• Volumetric
• Polygonal Mesh• Point Cloud• Primitive-based CAD Models
13
Representations of 3D Data
• Multi-view RGB(D) Images• Volumetric
• Polygonal Mesh• Point Cloud• Primitive-based CAD Models
14
In this lecture, we mainly focus On…
• Multi-view RGB(D) Images• Volumetric
• Polygonal Mesh• Point Cloud• Primitive-based CAD Models
15
Can We Directly Apply CNN on 3D Data?
16
Can We Directly Apply CNN on 3D Data?
• What Kind of 3D Data?• (O) Multi-view RGB(D) Images, Volumetric• (X) Polygonal Mesh, Point Cloud, Primitive-based CAD Models
17
Can We Directly Apply CNN on 3D Data?
18
• Convolution for 2D images:• Convolution in 3D:
Multi-view Representation
● Able to leverage a huge amount of CNN literatures in image analysis● What viewpoints to select?● What if the input is noisy and incomplete?● Does not process invisible points● Aggregating view representations is challenging (not trivial)
19
Related Work on Multi-View Representation
• Classification• Multi-view CNN (MVCNN) for shape recognition [ICCV’15]
• Segmentation • 3D shape segmentation with projective conv nets [CVPR’17]
20
Classification: MVCNN
• State-of-the-art performances for 3D classification (>90%)• View pooling (all branches in the first stage of the network share the
same parameters in CNN1)
21
Classification: MVCNN
• Synthesize the info from all views into a single & compact 3D shape descriptor
• Element-wise max operation across views in the view-pooling layer
• Closely related to max-pooling and max-out layers, with the only difference in the dimensions which are performed
• Features are obtained after view pooling.
22
Related Work on Multi-View Representation
• Classification• Multi-view CNN (MVCNN) for shape recognition [ICCV’15]
• Segmentation • 3D shape segmentation with projective conv nets [CVPR’17]
23
Segmentation: ShapePFCN
• Combines image-based FCNs and surface-based CRFs (conditional random field)
• Preprocess 3D mesh data into shaded images and depth images
24
Segmentation: ShapePFCN
• FCN:• VGG-16 pretrained network• Two additional modifications1. Input is a 2-channel image (2-channel 3x3 filters instead of 3-channel RGB ones)2. Modify the output of the original FCN. The original FCN outputs L confidence maps
of size 64 x 64 pixels, followed by a conversion into L probability maps via softmax. Instead, ShapePFCN upsample the confidence maps to size 512 x 512 pixels through a transpose convolutional layer.
25
Segmentation: ShapePFCN
• Image2Surface Projection Layer• Aggregate the confidence maps across views, and project back to the 3D surface.• L confidence maps extracted from FCN are stacked into a M x 512 x 512 x L image.
The projection layer takes input this 4D image (M input images).• Also take the surface reference image, and stacked into a 3D M x 512 x 512 image.• The layer outputs a FS x L array, where FS is the number of polygons of the shape S.
26
Segmentation: ShapePFCN
• Surface CRF (Conditional Random Fields)• Convert label confidences (i.e., soft labels) into hard labels• In addition, due to upsampling in FCN, processing of discontinuities across complex
surface (via CRF) might be expected.
27
Segmentation: ShapePFCN
• Example results
28
Properties of Volumetric Data
• Easy to operate• Info Ioss in voxelization
• But low resolution…
29
Related Work on Volumetric/Voxel Data
• Classification• 3D ShapeNets: A Deep Representation for Volumetric Shapes [CVPR’15]
• 3D Reconstruction• 3D-R2N2: A Unified Approach for Single and Multi-View 3D Object Reconstruction
[ECCV’16]• Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction
without 3D Supervision [NIPS’16]• Weakly Supervised 3D Reconstruction with Adversarial Constraint [3DV’17]
30
Classification: 3DShapeNets
• Directly utilize 3D shape info
• Accuracy: ~77%
• MVCNN outperforms volumetric methodsin terms of classification.
31
Classification: 3DShapeNets
• Why MVCNN works better?• Leverage the capabilities of 2D-based DL/CNN models • With the help of a large amount of 2D image data (e.g., ImageNet) to pretrain the
CNN architectures.
32
Related Work on Volumetric/Voxel Data
• Classification• 3D ShapeNets: A Deep Representation for Volumetric Shapes [CVPR’15]
• 3D Reconstruction• 3D-R2N2: A Unified Approach for Single and Multi-View 3D Object Reconstruction
[ECCV’16]• Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction
without 3D Supervision [NIPS’16]• Weakly Supervised 3D Reconstruction with Adversarial Constraint [3DV’17]
33
3D Reconstruction: 3D-R2N2
• Supervised Learning
• Recurrent 3D CNN
• 3D LSTM units
• Single or multi-view 3D reconstruction
• Voxel-wise cross-entropy loss
34
3D Reconstruction: 3D-R2N2
• Supervised learning
• Ground truth 3D volume/voxels available
• Voxel-wise cross-entropy loss
35
3D Reconstruction: 3D-R2N2
• 3D LSTM units
• Input can be either single or a series images.
• Resolve multiple viewpoints seamlessly
36
3D Reconstruction: 3D-R2N2
• Example construction results (left: single image input, right: multiple image inputs)
37
Related Work on Volumetric/Voxel Data
• Classification• 3D ShapeNets: A Deep Representation for Volumetric Shapes [CVPR’15]
• 3D Reconstruction• 3D-R2N2: A Unified Approach for Single and Multi-View 3D Object Reconstruction
[ECCV’16]• Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction
without 3D Supervision [NIPS’16]• Weakly Supervised 3D Reconstruction with Adversarial Constraint [3DV’17]
38
3D Reconstruction: PTN
• Projecting 3D volume into 2D masks
39
3D Reconstruction: PTN
• Loss function:• Reconstruction loss
• Projection loss
40
Related Work on Volumetric/Voxel Data
• Classification• 3D ShapeNets: A Deep Representation for Volumetric Shapes [CVPR’15]
• 3D Reconstruction• 3D-R2N2: A Unified Approach for Single and Multi-View 3D Object Reconstruction
[ECCV’16]• Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction
without 3D Supervision [NIPS’16]• Weakly Supervised 3D Reconstruction with Adversarial Constraint [3DV’17]
41
Weakly Supervised 3D Reconstruction
• Image and approximated viewpoints as inputs
• 2D masks as supervision
• Raytrace pooling layer enables perspective projection and backpropagation
• Constrain 3D reconstruction to the manifold of unlabeled realistic 3D shapes
42
Weakly Supervised 3D Reconstruction
• Image and approximated viewpoints as inputs
• 2D masks as supervision
• Raytrace pooling layer enables perspective projection and backpropagation
• Constrain 3D reconstruction to the manifold of unlabeled realistic 3D shapes
43
Point Cloud
• Unordered point set
• The same object can be represented in different orders (see below).
• Need to be invariant to feature transformation (e.g., rotation, translation, etc.)
44
Related Work on Point Cloud
• Classification/segmentation• PointNet: Deep Learning on Point Sets for 3D Classification/Segmentation [CVPR’17]
• 3D Reconstruction• A Point Set Generation Net for 3D Object Reconstruction from a Single Image [CVPR’17]
• Unsupervised Learning (i.e., Autoencoder for Data Recovery)• FoldingNet: Interpretable Unsupervised Learning on 3D Point Clouds [CVPR’18]
45
Classification/Segmentation
• Classification/segmentation via Point Clouds
46
Classification/Segmentation via PointNet
• Unordered -> per-point perceptron + max pooling
• Interaction among points: concatenate local and global features
• Invariance under transformation
47
Related Work on Point Cloud
• Classification/segmentation• PointNet: Deep Learning on Point Sets for 3D Classification/Segmentation [CVPR’17]
• 3D Reconstruction• A Point Set Generation Net for 3D Object Reconstruction from a Single Image [CVPR’17]
• Unsupervised Learning (i.e., Autoencoder for Data Recovery)• FoldingNet: Interpretable Unsupervised Learning on 3D Point Clouds [CVPR’18]
48
3D Reconstruction via Point Cloud
• A Point Set Generation Net for 3D Object Reconstruction from a Single Image
49
3D Reconstruction via Point Cloud
• Settings:• Supervised learning with ground truth point clouds• 3D shapes in an unordered point set• Two-branch prediction: fully connected for intrinsic structure +
deconvolution for smooth surfaces• Loss function: Chamfer distance
50
3D Reconstruction via Point Cloud
• Example results
51
Related Work on Point Cloud
• Classification/segmentation• PointNet: Deep Learning on Point Sets for 3D Classification/Segmentation [CVPR’17]
• 3D Reconstruction• A Point Set Generation Net for 3D Object Reconstruction from a Single Image
[CVPR’17]
• Unsupervised Learning (i.e., Autoencoder for Data Recovery)• FoldingNet: Interpretable Unsupervised Learning on 3D Point Clouds [CVPR’18]
52
Unsupervised Learning for Point Clouds
• Autoencoder for 3D Point Clouds
53
FoldingNet for Recovering Point Clouds
• Graph representation for point clouds
• Analogous to convolution in images. Each pixel’s spatial ordering and neighborhood remain unchanged even when feature channels of the input image expands in top conv layers.
54
Encoder in FoldingNet
• n: number of points in input point cloud
• Use kNN graph, compute local covariance of k = 16 points along xyz
• Perceptron: per-point function
55
Encoder in FoldingNet
• n: number of points in input point cloud
• Use kNN graph, compute local covariance of k = 16 points along xyz
• Perceptron: per-point function
• Graph layer: perceptron + graph max pooling
56
Decoder in FoldingNet
57
Remarks for FoldingNet
• Transfer classification accuracy
• Efficiency of representation learning and feature extraction
• Use point clouds from ShapeNet to train autoencoder
• Classification:Train a SVM on another dataset using codewords obtained from the encoder
58
What’s to be Covered …
• Learning Beyond Images• 2D/3D Visual Data• Depth Images
59
Depth Estimation from a Single Image
• Depth estimation from a single image in a semi-supervised setting• Use supervised and unsupervised cues simultaneously
• Supervised cue: Sparse depth image• Unsupervised cue: Stereo pair consistency
60Semi-Supervised Deep Learning for Monocular Depth Map Prediction, CVPR 2017
• Overall loss function:
61Semi-Supervised Deep Learning for Monocular Depth Map Prediction, CVPR 2017
• The supervised loss measures the deviation of the predicted depth map from the ground truth depth values.
62Semi-Supervised Deep Learning for Monocular Depth Map Prediction, CVPR 2017
• The unsupervised loss quantifies the direct image alignment error in both directions:
63Semi-Supervised Deep Learning for Monocular Depth Map Prediction, CVPR 2017
• Network architecture: ResNet encoder + decoder
64Semi-Supervised Deep Learning for Monocular Depth Map Prediction, CVPR 2017
• Example results
65Semi-Supervised Deep Learning for Monocular Depth Map Prediction, CVPR 2017
Unsupervised Depth Estimation:Depth Image Estimation from a Single Image
• Estimate the depth image from a single RGB input image without supervision
• Render the disparity map from a single image• Depth info can be estimated from the disparity map• Disparity map is able to warp the left image to the right image (and vice versa)• Training data: stereo image pairs only
66Unsupervised Monocular Depth Estimation with Left-Right Consistency, CVPR 2017
• Simultaneously render the disparity maps from both views
• Enforce the consistency between both recovered disparity mapswhich lead to accurate results w/o supervision during training.
67Unsupervised Monocular Depth Estimation with Left-Right Consistency, CVPR 2017
Unsupervised Depth Estimation:Depth Image Estimation from a Single Image (cont’d)
68Unsupervised Monocular Depth Estimation with Left-Right Consistency, CVPR 2017
● Reconstruction loss:
● Smoothness loss:
● consistency loss:
Unsupervised Depth Estimation:Depth Image Estimation from a Single Image (cont’d)
• Estimate the depth image by inferring the disparity maps (left & right) from the single input image.
69Unsupervised Monocular Depth Estimation with Left-Right Consistency, CVPR 2017
Unsupervised Depth Estimation:Depth Image Estimation from a Single Image (cont’d)
• Example results
70Unsupervised Monocular Depth Estimation with Left-Right Consistency, CVPR 2017
Unsupervised Depth Estimation:Depth Image Estimation from a Single Image (cont’d)
• Goal: depth estimation from a single image/frame+ camera pose estimation from multiple consecutive frames
• Challenge• No supervision (i.e., ground truth info available)• Only input video sequence available
• Method• Use single-view depth and pose networks for video frame recovery
• Experiments• KITTI, CityScape, and Make3D datasets
71Unsupervised Learning of Depth and Ego-motion from Video, CVPR 2017 (oral)
Unsupervised Depth Estimation:Depth Image Estimation from a Video Sequence
Introduction
72
Goal (Inference)
• Estimate the depth image and camera-pose without supervision of ground-truth depth, stereo pair, and camera-pose information
Method• The key supervision signal for the proposed method come from
• View SynthesisSynthesize a new image of scene seen from a different camera pose
MethodOverview
Target Frame
DepthInformation
PoseInformation
Predicted Target Image
NearbyFrame
Training Sequence
PoseNN
Dep.NN
Method
74
Cam-Pose NN
Depth NN
Target View
Transformation matrix
Time
𝐼𝐼𝑡𝑡+1 = 𝐾𝐾�𝑇𝑇𝑡𝑡→𝑡𝑡+1�𝐷𝐷𝑡𝑡 𝐼𝐼𝑡𝑡 𝐾𝐾−1𝐼𝐼𝑡𝑡
PredictedCam-Pose
PredictedDepth
RGB imagein time t
RGB imagein time t+1
Cam. parameter
ℒ = �𝑡𝑡|𝐼𝐼𝑡𝑡+1 − 𝐼𝐼𝑡𝑡+1|
Concept Render the target
ExperimentDepth Estimation
2019/5/28 75
• Dataset• KITTI and Cityscape
Make3D
• KITTI and Cityscape
• 3D reconstruction from a single image (with GT 3D but no pose info)
• A single-image & pose-aware 3D reconstruction DL framework
• Extract camera pose info from 2D-3D self-consistency without supervision
76Liao, Yang, Lin, Chen, Kuo, Chiu, & Wang, Learning Pose-Aware 3D Reconstruction via 2D-3D Self-Consistency, IEEE ICASSP 2019.
Learning Pose-Aware 3D Reconstructionvia 2D-3D Self-Consistency
• Experiments
77Liao, Yang, Lin, Chen, Kuo, Chiu, & Wang, Learning Pose-Aware 3D Reconstruction via 2D-3D Self-Consistency, IEEE ICASSP 2019.
Learning Pose-Aware 3D Reconstructionvia 2D-3D Self-Consistency
(a) Input image (b) GT mask(c) Predicted mask (d) Projection of predicted shapes
(e) GT voxel (f) Predicted voxel(g) GT pose-aware mesh (h) Predicted pose-aware mesh
• Experiments
78Liao, Yang, Lin, Chen, Kuo, Chiu, & Wang, Learning Pose-Aware 3D Reconstruction via 2D-3D Self-Consistency, IEEE ICASSP 2019.
Learning Pose-Aware 3D Reconstructionvia 2D-3D Self-Consistency
Comparison with other fully supervised method in terms of IoU.
Quantitative result of pose-aware 3D shape reconstruction in term of IoU. pred means that shapes or poses are predicted.
Quantitative result of 3D-2D projection in term of IoU. We evaluate IoU between GT masks and different projections.
Quantitative result of pose estimation and mask segmentation.
• Unsupervised monocular depth estimation
• Joint exploitation of scene semantics via semantic segmentation
• No GT for the depth image
• Improved scene representation for joint depth estimation &semantic segmentation
79P. Chen, A. Liu, Y. Liu, & Y.-C. F. Wang, Towards Scene Understanding:Unsupervised Monocular Depth Estimation with Semantic-aware Representation. CVPR 2019
Towards Scene Understanding:Unsupervised Monocular Depth Estimation with Semantic-aware Representation
1National Taiwan University
Yen-Cheng Liu2 Yu-Chiang Frank Wang1Alexander H. Liu* 1Po-Yi Chen* 1
2 Georgia Institute of Technology
Introduction – Unsupervised Monocular Depth Estimation
Monocular Depth Estimation
Model
Unsupervised training with stereo view
ModelLeft view
Right view
ReconstructLeft view
2D Image Disparity Map(Depth)
wrapping
P. Chen, A. Liu, Y. Liu, & Y.-C. F. Wang, Towards Scene Understanding:Unsupervised Monocular Depth Estimation with Semantic-aware Representation. CVPR 2019
Introduction – Our Goal
Learning semantic-aware scene representation for depth estimation.
SemanticSegmentatio
n
SemanticUnderstanding
Content ConsistencyScene
Representation
2D Scene
DepthEstimation
GeometricUnderstanding
P. Chen, A. Liu, Y. Liu, & Y.-C. F. Wang, Towards Scene Understanding:Unsupervised Monocular Depth Estimation with Semantic-aware Representation. CVPR 2019
Methodology – Overview
- Scene representation from a single network- Multi-task learning on depth estimation and semantic segmentation- Refine depth estimation with semantic information
P. Chen, A. Liu, Y. Liu, & Y.-C. F. Wang, Towards Scene Understanding:Unsupervised Monocular Depth Estimation with Semantic-aware Representation. CVPR 2019
Methodology – Shared Scene Representation
Unified network architecture Controllable cross-modal prediction Multi-task learning from disjoint dataset
EncDec Softmax
Sigmoid
Pixel-wisedAvg. Pooling
+
Semantic labels
Disparity map
Task Identity
SceneRepresentation
P. Chen, A. Liu, Y. Liu, & Y.-C. F. Wang, Towards Scene Understanding:Unsupervised Monocular Depth Estimation with Semantic-aware Representation. CVPR 2019
Methodology – Self-supervised learning
1. Left-Right Semantic Consistency- Unsupervised learning relied on left-right consistency over color value (RGB image)- Such consistency may be affect by optical change (e.g., reflection on glass)- We proposed left-right consistency at the semantic level
Model
Reconstruct
Left view
Consistency
Left view
Right view
Left disparity
Model
ReconstructLeft semantic
Left view
Right view
Left disparity
Model
Right semantic
Left semantic
Consistency
P. Chen, A. Liu, Y. Liu, & Y.-C. F. Wang, Towards Scene Understanding:Unsupervised Monocular Depth Estimation with Semantic-aware Representation. CVPR 2019
Methodology – Self-supervised learning
2. Semantic-Guided Disparity Smoothness- Disparity should change smoothly on a single object- Pseudo object boundary can be obtained from semantic prediction- We proposed to regularize the disparity smoothness within the boundary
Semantic Prediction Object Boundary Regularize Smoothness
P. Chen, A. Liu, Y. Liu, & Y.-C. F. Wang, Towards Scene Understanding:Unsupervised Monocular Depth Estimation with Semantic-aware Representation. CVPR 2019
Experiments – Setup
- Dataset- Stereo pairs from KITTI dataset (for unsupervised depth
estimation)- Single-view from Cityscapes dataset (for supervised semantic
segmentation)- Model
- Encoder : 14-layered dilated residual network- Decoder : 8-layered transposed convolution network- Instead of using separated decoder for each view, we
introduce horizontally flipping technique
Experiments – Result on depth estimation
- State-of-the-art result on unsupervised depth estimation- Leveraging semantic segmentation, the performance can be further improved
P. Chen, A. Liu, Y. Liu, & Y.-C. F. Wang, Towards Scene Understanding:Unsupervised Monocular Depth Estimation with Semantic-aware Representation. CVPR 2019
Experiments – Study on multi-task learning
- Semantic segmentation & depth estimation benefits each other- Shared decoder & task identity improves the robustness of multi-task learning- A better scene representation by our framework
P. Chen, A. Liu, Y. Liu, & Y.-C. F. Wang, Towards Scene Understanding:Unsupervised Monocular Depth Estimation with Semantic-aware Representation. CVPR 2019
References
● 3D Deep Learning Tutorial○ http://3ddl.stanford.edu/CVPR17_Tutorial_Overview.pdf○ http://3ddl.stanford.edu/CVPR17_Tutorial_MVCNN_3DCNN_v3.pdf○ http://cseweb.ucsd.edu/~haosu/slides/3ddl.pdf○ https://cse291-i.github.io/
● List of 3D deep learning related projects
○ https://github.com/timzhang642/3D-Machine-Learning
89
What’s to Be Covered Next Week
• Guest Lecture• Title:
NTUEE系友有問必答系列CV領域求學、研究及實習經驗分享
• Speaker:Dr. Wei-Sheng Lai 賴威昇 (B97)Univ. California, Merced
• Time/Location:10am @ BL113 (i.e., the 2nd class)
90
What’s to Be Covered Next Week
• VIP Talk• Title:
Face Recognition & Anti-Spoofing for Identity Authentication
• Speaker:Dr. Shang-Hong LaiPrincipal Researcher, Microsoft AI R&D Center
• Time/Location:11am @ BL113 (i.e., the 3rd class)
91