Semantic Image Segmentation via Deep Parsing Network
Ziwei Liu*, Xiaoxiao Li*, Ping Luo, Chen Change Loy, Xiaoou Tang
Multimedia Lab, The Chinese University of Hong Kong
Problem
Problem
Person
Table
TV
Plant
Background
Previous Attempts
SVM SVM + MRF
CNN CNN + MRF ?
State-of-the-arts
Fully Convolutional Network [Long et al. CVPR 2015]
Learned Features ✓ Pairwise Relations ✗ Joint Training - # Iterations -
State-of-the-arts
DeepLab [Chen et al. ICLR 2015]
Learned Features ✓ Pairwise Relations ✓ Joint Training ✗ # Iterations 10
State-of-the-arts
CRF as RNN [Zheng et al. ICCV 2015]
Learned Features ✓ Pairwise Relations ✓ Joint Training ✓ # Iterations 10
State-of-the-arts
Deep Parsing Network (DPN)
Learned Features ✓ Pairwise Relations ✓ Joint Training ✓ # Iterations 1
Contributions
• Extend MRF to incorporate richer relationships • Formulate mean field inference of high-order MRF as CNN • Capable of joint training and one-pass inference
Revisit MRF
𝑝𝑝𝑖𝑖 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 =′ 𝑡𝑡𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑡 = 0.8
𝑈𝑈𝑈𝑈𝑙𝑙𝑈𝑈𝑈𝑈 = −� ln 𝑝𝑝𝑖𝑖(𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙)𝑖𝑖
Unary Term 𝑖𝑖
min 𝐸𝐸 = 𝑈𝑈𝑈𝑈𝑙𝑙𝑈𝑈𝑈𝑈 + 𝑃𝑃𝑙𝑙𝑖𝑖𝑈𝑈
Energy Function
Revisit MRF
𝑈𝑈𝑈𝑈𝑙𝑙𝑈𝑈𝑈𝑈 = −� ln 𝑝𝑝𝑖𝑖(𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙)𝑖𝑖
Unary Term
𝑃𝑃𝑙𝑙𝑖𝑖𝑈𝑈 = �𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡(𝑖𝑖) ∗ 𝑑𝑑𝑖𝑖𝑜𝑜𝑜𝑜(𝑖𝑖, 𝑗𝑗)𝑖𝑖,𝑗𝑗
Pairwise Term
𝑖𝑖 𝑗𝑗
min 𝐸𝐸 = 𝑈𝑈𝑈𝑈𝑙𝑙𝑈𝑈𝑈𝑈 + 𝑃𝑃𝑙𝑙𝑖𝑖𝑈𝑈
Energy Function 𝑑𝑑𝑖𝑖𝑜𝑜𝑜𝑜 𝑖𝑖, 𝑗𝑗 = , = 0.8 𝑖𝑖 𝑗𝑗
Appearance Consistency
Revisit MRF
𝑈𝑈𝑈𝑈𝑙𝑙𝑈𝑈𝑈𝑈 = −� ln 𝑝𝑝𝑖𝑖(𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙)𝑖𝑖
Unary Term
𝑃𝑃𝑙𝑙𝑖𝑖𝑈𝑈 = �𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡(𝑖𝑖) ∗ 𝑑𝑑𝑖𝑖𝑜𝑜𝑜𝑜(𝑖𝑖, 𝑗𝑗)𝑖𝑖,𝑗𝑗
Pairwise Term
𝑖𝑖 𝑗𝑗
min 𝐸𝐸 = 𝑈𝑈𝑈𝑈𝑙𝑙𝑈𝑈𝑈𝑈 + 𝑃𝑃𝑙𝑙𝑖𝑖𝑈𝑈
Energy Function 𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡 𝑖𝑖; 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 =′ 𝑡𝑡𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑡 = 0.1
Label Consistency
Richer Relationships in DPN
𝑈𝑈𝑈𝑈𝑙𝑙𝑈𝑈𝑈𝑈 = −� ln 𝑝𝑝𝑖𝑖(𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙)𝑖𝑖
Unary Term
𝑃𝑃𝑙𝑙𝑖𝑖𝑈𝑈 = �𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡(𝑖𝑖) ∗ 𝑑𝑑𝑖𝑖𝑜𝑜𝑜𝑜(𝑖𝑖, 𝑗𝑗)𝑖𝑖,𝑗𝑗
Pairwise Term
𝑖𝑖 𝑗𝑗
min 𝐸𝐸 = 𝑈𝑈𝑈𝑈𝑙𝑙𝑈𝑈𝑈𝑈 + 𝑃𝑃𝑙𝑙𝑖𝑖𝑈𝑈
Energy Function
Richer Relationships in DPN
𝑃𝑃𝑙𝑙𝑖𝑖𝑈𝑈 = �𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡(𝑖𝑖) ∗ 𝑑𝑑𝑖𝑖𝑜𝑜𝑜𝑜(𝑖𝑖, 𝑗𝑗)𝑖𝑖,𝑗𝑗
Pairwise Term
𝑖𝑖 𝑗𝑗
( , )
Triple Penalty
Richer Relationships in DPN
𝑃𝑃𝑙𝑙𝑖𝑖𝑈𝑈 = �𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡(𝑖𝑖) ∗ 𝑑𝑑𝑖𝑖𝑜𝑜𝑜𝑜(𝑖𝑖, 𝑗𝑗)𝑖𝑖,𝑗𝑗
Pairwise Term
𝑖𝑖 𝑗𝑗
( , , ) …
𝑧𝑧1
𝑧𝑧𝑛𝑛 𝑧𝑧1
𝑧𝑧𝑛𝑛
Triple Penalty
Richer Relationships in DPN
𝑃𝑃𝑙𝑙𝑖𝑖𝑈𝑈 = �𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡(𝑖𝑖) ∗�𝑑𝑑𝑖𝑖𝑜𝑜𝑜𝑜(𝑖𝑖, 𝑗𝑗; 𝑧𝑧)𝑧𝑧𝑖𝑖,𝑗𝑗
Pairwise Term
𝑖𝑖 𝑗𝑗
( , , ) …
𝑧𝑧1
Triple Penalty
𝑧𝑧𝑛𝑛 𝑧𝑧1
𝑧𝑧𝑛𝑛
Triple Penalty
Richer Relationships in DPN
𝑖𝑖
0.8
0.6
𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡 = 0.7
𝑡𝑡𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙
𝑙𝑙𝑏𝑏𝑜𝑜
𝑃𝑃𝑙𝑙𝑖𝑖𝑈𝑈 = �𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡(𝑖𝑖) ∗�𝑑𝑑𝑖𝑖𝑜𝑜𝑜𝑜(𝑖𝑖, 𝑗𝑗; 𝑧𝑧)𝑧𝑧𝑖𝑖,𝑗𝑗
Pairwise Term Mixture of Label Contexts
𝑖𝑖
Richer Relationships in DPN
𝑃𝑃𝑙𝑙𝑖𝑖𝑈𝑈 = �𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡(𝑖𝑖, 𝑗𝑗) ∗�𝑑𝑑𝑖𝑖𝑜𝑜𝑜𝑜(𝑖𝑖, 𝑗𝑗; 𝑧𝑧)𝑧𝑧𝑖𝑖,𝑗𝑗
Pairwise Term
𝑖𝑖 𝑗𝑗
0.8 𝑡𝑡𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙
0.6 𝑝𝑝𝑙𝑙𝑈𝑈𝑜𝑜𝑜𝑜𝑈𝑈
𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡 , = 0.2
Mixture of Label Contexts
𝑖𝑖 𝑗𝑗
Richer Relationships in DPN
𝑃𝑃𝑙𝑙𝑖𝑖𝑈𝑈 = �𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡(𝑖𝑖, 𝑗𝑗) ∗�𝑑𝑑𝑖𝑖𝑜𝑜𝑜𝑜(𝑖𝑖, 𝑗𝑗; 𝑧𝑧)𝑧𝑧𝑖𝑖,𝑗𝑗
Pairwise Term
𝑖𝑖 𝑗𝑗
0.8 𝑡𝑡𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙
0.6 𝑝𝑝𝑙𝑙𝑈𝑈𝑜𝑜𝑜𝑜𝑈𝑈
𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡 , = 0.2
Mixture of Label Contexts
𝑖𝑖 𝑗𝑗
Richer Relationships in DPN
𝑃𝑃𝑙𝑙𝑖𝑖𝑈𝑈 = �𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡(𝑖𝑖, 𝑗𝑗) ∗�𝑑𝑑𝑖𝑖𝑜𝑜𝑜𝑜(𝑖𝑖, 𝑗𝑗; 𝑧𝑧)𝑧𝑧𝑖𝑖,𝑗𝑗
Pairwise Term
𝑗𝑗
0.6
0.8
𝑝𝑝𝑙𝑙𝑈𝑈𝑜𝑜𝑜𝑜𝑈𝑈
𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡 , = 0.8
𝑡𝑡𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙
𝑖𝑖
Spatial Order
Mixture of Label Contexts
𝑗𝑗 𝑖𝑖
Richer Relationships in DPN
𝑃𝑃𝑙𝑙𝑖𝑖𝑈𝑈 = �𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡(𝑖𝑖, 𝑗𝑗) ∗�𝑑𝑑𝑖𝑖𝑜𝑜𝑜𝑜(𝑖𝑖, 𝑗𝑗; 𝑧𝑧)𝑧𝑧𝑖𝑖,𝑗𝑗
Pairwise Term
𝑖𝑖 𝑗𝑗
0.8 𝑡𝑡𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙
0.6 𝑝𝑝𝑙𝑙𝑈𝑈𝑜𝑜𝑜𝑜𝑈𝑈
𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡 , = 0.2
Mixture of Label Contexts
𝑖𝑖 𝑗𝑗
Richer Relationships in DPN
𝑖𝑖 𝑗𝑗
0.8 𝑡𝑡𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙
0.6 𝑝𝑝𝑙𝑙𝑈𝑈𝑜𝑜𝑜𝑜𝑈𝑈
𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡 , = 0.2
𝑃𝑃𝑙𝑙𝑖𝑖𝑈𝑈 = ��𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡𝑘𝑘(𝑖𝑖, 𝑗𝑗)𝑘𝑘
∗�𝑑𝑑𝑖𝑖𝑜𝑜𝑜𝑜(𝑖𝑖, 𝑗𝑗; 𝑧𝑧)𝑧𝑧𝑖𝑖,𝑗𝑗
Pairwise Term
𝑘𝑘
Mixture of Label Contexts
Mixture of Label Contexts
𝑖𝑖 𝑗𝑗
Solve High-order MRF as Convolution
𝑃𝑃𝑙𝑙𝑖𝑖𝑈𝑈 = ��𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡𝑘𝑘(𝑖𝑖, 𝑗𝑗)𝑘𝑘
∗�𝑑𝑑𝑖𝑖𝑜𝑜𝑜𝑜(𝑖𝑖, 𝑗𝑗; 𝑧𝑧)𝑧𝑧𝑖𝑖,𝑗𝑗
Pairwise Term
𝑝𝑝𝑖𝑖 ∝ 𝑙𝑙𝑒𝑒𝑝𝑝 − 𝑈𝑈𝑈𝑈𝑙𝑙𝑈𝑈𝑈𝑈𝑖𝑖 + �𝑃𝑃𝑙𝑙𝑖𝑖𝑈𝑈𝑖𝑖,𝑗𝑗 ∗ 𝑝𝑝𝑗𝑗𝑗𝑗
Mean Field Solver 𝑖𝑖
𝑗𝑗
Solve High-order MRF as Convolution
Iterative Updating Formula
𝑖𝑖 𝑗𝑗
Convolution Summation
𝑝𝑝𝑖𝑖 ∝ 𝑙𝑙𝑒𝑒𝑝𝑝 − 𝑈𝑈𝑈𝑈𝑙𝑙𝑈𝑈𝑈𝑈𝑖𝑖 + �𝑃𝑃𝑙𝑙𝑖𝑖𝑈𝑈𝑖𝑖,𝑗𝑗 ∗ 𝑝𝑝𝑗𝑗𝑗𝑗
𝑃𝑃𝑙𝑙𝑖𝑖𝑈𝑈𝑖𝑖,𝑗𝑗 : Different Types of Local and Global Filters
Deep Parsing Network
Max Pooling
Deconvolution Local Convolution Convolution
Unary Term Pairwise Term
Triple Penalty Label Contexts
Min Pooling
Deep Parsing Network Unary Term
Fine-tuned VGG-16 Network
Max Pooling Deconvolution Convolution
Deep Parsing Network
Original Image Unary Term Ground Truth
Deep Parsing Network
Unary Term Pairwise Term
Triple Penalty Label Contexts
Max Pooling
Deconvolution Local Convolution Convolution
Min Pooling
�𝑑𝑑𝑖𝑖𝑜𝑜𝑜𝑜 𝑗𝑗; 𝑧𝑧 ∗ 𝑝𝑝𝑧𝑧𝑧𝑧
Deep Parsing Network Triple Penalty
𝑃𝑃𝑙𝑙𝑖𝑖𝑈𝑈 = ��𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡𝑘𝑘 𝑖𝑖, 𝑗𝑗𝑘𝑘
∗�𝑑𝑑𝑖𝑖𝑜𝑜𝑜𝑜 𝑖𝑖, 𝑗𝑗; 𝑧𝑧 ∗ 𝑝𝑝𝑧𝑧𝑧𝑧𝑖𝑖,𝑗𝑗
�𝑑𝑑𝑖𝑖𝑜𝑜𝑜𝑜 𝑖𝑖, 𝑗𝑗; 𝑧𝑧 ∗ 𝑝𝑝𝑧𝑧𝑧𝑧
Deep Parsing Network Triple Penalty
j z
Unary Term
# classes
�𝑑𝑑𝑖𝑖𝑜𝑜𝑜𝑜 𝑗𝑗; 𝑧𝑧 ∗ 𝑝𝑝𝑧𝑧𝑧𝑧
𝑝𝑝𝑧𝑧 𝑑𝑑𝑖𝑖𝑜𝑜𝑜𝑜 𝑗𝑗; 𝑧𝑧
# classes
Local Conv
j
Deep Parsing Network
Original Image Unary Term Ground Truth
Triple Penalty
Deep Parsing Network
Unary Term Pairwise Term
Triple Penalty Label Contexts
Max Pooling
Deconvolution Local Convolution Convolution
Min Pooling
Deep Parsing Network Mixture of Label Contexts
𝑃𝑃𝑙𝑙𝑖𝑖𝑈𝑈 = ��𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡𝑘𝑘 𝑖𝑖, 𝑗𝑗 ∗�𝑑𝑑𝑖𝑖𝑜𝑜𝑡𝑡 𝑖𝑖, 𝑗𝑗; 𝑧𝑧 ∗ 𝑝𝑝𝑧𝑧𝑧𝑧𝑘𝑘𝑖𝑖,𝑗𝑗
��𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡𝑘𝑘 𝑖𝑖, 𝑗𝑗𝑘𝑘𝑖𝑖,𝑗𝑗
Deep Parsing Network Mixture of Label Contexts
Triple Penalty Result
# classes
𝑡𝑡𝑈𝑈𝑖𝑖
i j
𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡𝑘𝑘 𝑖𝑖, 𝑗𝑗
i
�𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡𝑘𝑘 𝑖𝑖, 𝑗𝑗 ∗ 𝑡𝑡𝑈𝑈𝑖𝑖(𝑗𝑗)𝑗𝑗
Min Pooling
��𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡𝑘𝑘 𝑖𝑖, 𝑗𝑗 ∗ 𝑡𝑡𝑈𝑈𝑖𝑖(𝑗𝑗)𝑘𝑘𝑗𝑗
class 1 class 1
Deep Parsing Network Mixture of Label Contexts
Triple Penalty Result
# classes
𝑡𝑡𝑈𝑈𝑖𝑖
i j
𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡𝑘𝑘 𝑖𝑖, 𝑗𝑗
i
�𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡𝑘𝑘 𝑖𝑖, 𝑗𝑗 ∗ 𝑡𝑡𝑈𝑈𝑖𝑖(𝑗𝑗)𝑗𝑗
Min Pooling
��𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡𝑘𝑘 𝑖𝑖, 𝑗𝑗 ∗ 𝑡𝑡𝑈𝑈𝑖𝑖(𝑗𝑗)𝑘𝑘𝑗𝑗
class 2 class 2
Deep Parsing Network Mixture of Label Contexts
Triple Penalty Result
# classes
𝑡𝑡𝑈𝑈𝑖𝑖
i j
𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡𝑘𝑘 𝑖𝑖, 𝑗𝑗
i
�𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡𝑘𝑘 𝑖𝑖, 𝑗𝑗 ∗ 𝑡𝑡𝑈𝑈𝑖𝑖(𝑗𝑗)𝑗𝑗
Min Pooling
��𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡𝑘𝑘 𝑖𝑖, 𝑗𝑗 ∗ 𝑡𝑡𝑈𝑈𝑖𝑖(𝑗𝑗)𝑘𝑘𝑗𝑗
class 3 class 3
# classes
Deep Parsing Network Mixture of Label Contexts
Min Pooling
��𝑐𝑐𝑜𝑜𝑜𝑜𝑡𝑡𝑘𝑘 𝑖𝑖, 𝑗𝑗 ∗ 𝑡𝑡𝑈𝑈𝑖𝑖(𝑗𝑗)𝑘𝑘𝑗𝑗
Deep Parsing Network
Original Image Unary Term Ground Truth
Triple Penalty Label Contexts
Deep Parsing Network Joint Tuning
Unary Term Pairwise Term
Triple Penalty Label Contexts
Deep Parsing Network
Original Image Unary Term Ground Truth
Triple Penalty Label Contexts Joint Tuning
FCN 62.2 DeepLab† 73.9
CRFasRNN† 74.7 BoxSup† 75.2
DPN† 77.5
(PASCAL VOC 2012 Challenge test set)
Overall Performance (Published Results)
Label Contexts Learned
favor
penalty
bkg
areo
bi
ke
bird
bo
at
bott
le
bus
car
cat
chai
r co
w
tabl
e do
g ho
rse
mbi
ke
pers
on
plan
t sh
eep
sofa
tr
ain
tv
bkg areo bike
tv train sofa
bird boat
bottle bus car cat
chair cow
table dog
horse
sheep
mbike person
plant
mbi
ke
bike
person
Label Contexts Learned
person : mbike chair : person
favor
penalty
Original Image Ground Truth FCN
DPN DeepLab CRFasRNN
Challenging Case
Original Image Ground Truth Our Result
Failure Case
car
Conclusions
• General framework of one-pass CNN to model high-order MRF
• Various types of pairwise terms are formulated as local and
global filters • High performance and easy to be speeded up
Semantic Image Segmentation via Deep Parsing Network
Thanks!
Project Page: http://personal.ie.cuhk.edu.hk/~lz013/projects/DPN.html