© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Performance Optimizations for Deep Image Matting in PhotoshopBetty Leong | Photoshop Engineering Manager, Adobe
Salil Tambe | Computer Vision Engineer, Adobe
Chris Hebert | DevTech Engineer, Nvidia
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Topics
2
▪ Introduction to Matting
▪ Deep Matting vs Photoshop Matting
▪ Deployment Challenges
▪ Optimization
▪ Results and Demo
▪ Conclusion
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 3
Select and Mask in Photoshop
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 4
Select and Mask in Photoshop
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Matting in Photoshop
5
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 6
Deep Matting
Image Input Output
Xu et al. Deep Image Matting. CVPR 2017.
Brian Price (GTC 2017)
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Topics
▪ Introduction to Matting
▪ Deep Matting vs Photoshop Matting
▪ Deployment Challenges
▪ Optimization
▪ Results
▪ Conclusion
7
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 8
Photoshop Matte Deep Matte
8
Image Image
Photoshop Matte Deep MatteCorrect matting for hairHair, should
be whiteTrimap
Photoshop Matte Deep Matte
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 9
Deep MattePhotoshop Matte
Background grass included
Grass excluded from matte
Trimap
Image
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Topics
▪ Introduction to Matting
▪ Deep Matting vs Photoshop Matting
▪ Deployment Challenges
▪ Optimization
▪ Results and Demo
▪ Conclusion
10
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 11
• Resolution (320 x 320)
• Model size (80 MB)
• Memory
• Run time performance
• Cross platform support
Tech Transfer Challenges
Image Credits: Deep Image Matting [Xu et. al; CVPR 2017]
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Challenges
▪ Interactive editing with Deep Matting should be possible, but it is computationally expensive!
Matting uses a VGG-Net based Encoder-Decoder network that requires > 1sec and 600mbmemory for inferencing a 320 x 320 image on Intel i-7 8650K CPU using caffe
▪ It should be deployable on all platforms supported by Photoshop
i.e. all combinations of Intel, AMD and Nvidia hardware on Mac and Windows
12
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Deep Matting Deep Dive
13
Input trimap
+
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Input trimap
+
Deep Matting Deep Dive
14
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Inference Per Tile
15
Framework used: Caffe (with CUDA and CUDNN)
Deconv5
Deconv4
Deconv2
Conv2
Conv5
Conv1[320 x 320 x 64]
Conv4
Alpha Prediction
Image Credits: Deep Image Matting [Xu et. al; CVPR 2017]
[160 x 160 x 128]
Conv3[80 x 80 x 256]
[40 x 40 x 512]
Conv4[20 x 20 x 512]
[20 x 20 x 512]
[40 x 40 x 256]
[80 x 80 x 128]
[160 x 160 x 64]
Deconv3
[320 x 320 x 64]
[320 x 320 x 1]
Deconv1
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Fine Network
16
+
+
Refinement Network
Final Matte
+
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Explorations/Experiments
▪ Collaborate with Nvidia DevTech ProVis Team to come up with better per tile inference performance
▪ Chris Hebert – DevTech Engineer
▪ Inference customization with CuDNN kernels for optimal performance
17
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Topics
▪ Introduction to Matting
▪ Deep Matting vs Photoshop Matting
▪ Deployment Challenges
▪ Optimization
▪ Results and Demo
▪ Conclusion
18
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Caffe: Inference Timeline(nv prof)
19
160ms
Titan-X
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Caffe: CPU Overhead
20
160ms
Titan-X
Layer weights transferred one a time
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Caffe: Uses FFT(memory intensive)
21
160ms
Titan-X
FFT
FFT
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
cuDNN – A bit like OpenGL for Neural Networks
▪ Networks for inferencing are not difficult to implement with cuDNN
▪ cuDNN Provides a set of common network operations
▪ Convolution
▪ Activation
▪ Tensor Ops – Add, multiply etc
▪ Highly optimized for respective HW architectures
▪ cuDNN is the backend for most frameworks that target NVIDIA Hardware
22
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Optimization 1: Better memory management
▪ Pre-allocate the max buffer size required for the workspace
23
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Output layers size
Conv1_1[320 x 320 x 64] Memory: 25mb
Conv1_2[320 x 320 x 64] Memory: 25mb
Conv2_1[160 x 160 x 128] Memory: 12.5mb
Conv2_2[160 x 160 x 128] Memory: 12.5mb
Conv3_1[80 x 80 x 256] Memory: 6.25mb
Conv3_2[80 x 80 x 256] Memory: 6.25mb
Conv3_3[80 x 80 x 256] Memory: 6.25mb
Conv4_1[40 x 40 x 512] Memory: 3.125mb
Conv4_2[40 x 40 x 512] Memory: 3.125mb
Conv4_2[40 x 40 x 512] Memory: 3.125mb
Conv4_3[20 x 20 x 512] Memory: 0.78125mb
Conv5_1[20 x 20 x 512] Memory: 0.78125mb
Conv5_2[20 x 20 x 512] Memory: 0.78125mb
Deconv1[40 x 40 x 256] Memory: 1.5625mb
Deconv2[80 x 80 x 128] Memory: 3.125mb
Deconv3[160 x 160 x 64] Memory: 6.25mb
Deconv4[320 x 320 x 64] Memory: 25mb
Deconv5[320 x 320 x 1] Memory: 0.390625mb
24
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Optimization 1(a): Pre-allocate the max buffer size required
Conv1_1[320 x 320 x 64] Memory: 25mb
Conv1_2[320 x 320 x 64] Memory: 25mb
Conv2_1[160 x 160 x 128] Memory: 12.5mb
Conv2_2[160 x 160 x 128] Memory: 12.5mb
Conv3_1[80 x 80 x 256] Memory: 6.25mb
Conv3_2[80 x 80 x 256] Memory: 6.25mb
Conv3_3[80 x 80 x 256] Memory: 6.25mb
Conv4_1[40 x 40 x 512] Memory: 3.125mb
Conv4_2[40 x 40 x 512] Memory: 3.125mb
Conv4_2[40 x 40 x 512] Memory: 3.125mb
Conv4_3[20 x 20 x 512] Memory: 0.78125mb
Conv5_1[20 x 20 x 512] Memory: 0.78125mb
Conv5_2[20 x 20 x 512] Memory: 0.78125mb
Deconv1[40 x 40 x 256] Memory: 1.5625mb
Deconv2[80 x 80 x 128] Memory: 3.125mb
Deconv3[160 x 160 x 64] Memory: 6.25mb
Deconv4[320 x 320 x 64] Memory: 25mb
Deconv5[320 x 320 x 1] Memory: 0.390625mb
25
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Optimization 1: Better memory management
▪ Pre-allocate the max buffer size required for the workspace
26
Conv1_1
A 25mb
B 25mb
Memory Pool2x Largest TensorConv1_2
Conv1_3
Pool1
Conv2_1
A
B
A
B
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Optimization 1: Better memory management
▪ Pre-allocate the max buffer size required for the workspace
▪ Carefully choose convolution algorithm for performance and memory requirements
▪ FFT – Fast, but requires a lot of device workspace memory
▪ GEMM – In place general matrix multiply
▪ Winograd – fast, but unstable for large filter sizes.
27
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Optimization 1: Better memory management
▪ Pre-allocate the max buffer size required for the workspace
▪ Carefully choose convolution algorithm for performance and memory requirements
▪ FFT – Fast, but requires a lot of device workspace memory
▪ GEMM – In place general matrix multiply
▪ Winograd – fast, but unstable for large filter sizes.
▪ Convolution algorithm chosen on a per layer basis
▪ According to per layer constraints
28
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Optimization 1: Better memory management
▪ Pre-allocate the max buffer size required for the workspace
▪ Carefully choose convolution algorithm for performance and memory requirements
▪ FFT – Fast, but requires a lot of device workspace memory
▪ GEMM – In place general matrix multiply
▪ Winograd – fast, but unstable for large filter sizes.
▪ Convolution algorithm chosen on a per layer basis
▪ According to per layer constraints
▪ Share max per layer workspace memory between all layers
▪ Re-use the buffer for the computation of each layer
29
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Conv1_1[3 x 3 x 4 x 64] Memory: 0.00878906mb
Conv1_2[3 x 3 x 64 x 64] Memory: 0.140625mb
Conv2_1[3 x 3 x 64 x 128] Memory: 0.28125mb
Conv2_2[3 x 3 x 128 x 128] Memory: 0.5625mb
Conv3_1[3 x 3 x 128 x 256] Memory: 1.125mb
Conv3_2[3 x 3 x 256 x 256] Memory: 2.25mb
Conv3_3[3 x 3 x 256 x 256] Memory: 2.25mb
Conv4_1[3 x 3 x 256 x 512] Memory: 4.5mb
Conv4_2[3 x 3 x 512 x 512] Memory: 9mb
Conv4_3[3 x 3 x 512 x 512] Memory: 9mb
Conv5_1[3 x 3 x 512 x 512] Memory: 9mb
Conv5_2[3 x 3 x 512 x 512] Memory: 9mb
Deconv5[5 x 5 x 512 x 512] Memory: 25mb
Deconv4[5 x 5 x 512 x 256] Memory: 12.5mb
Deconv3[5 x 5 x 256 x 128] Memory: 3.125mb
Deconv2[5 x 5 x 128 x 64] Memory: 0.78125mb
Deconv1[5 x 5 x 64 x 64] Memory: 0.390625mb
Optimization 1(b): Load the entire model in one go
30
Titan-X
Weights + biases ~ 87 MB
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Conv1_1[3 x 3 x 4 x 64] Memory: 0.59mb CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
Conv1_2[3 x 3 x 64 x 64] Memory: 0.25mb CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD
Conv2_1[3 x 3 x 64 x 128] Memory: 0.50mb CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD
Conv2_2[3 x 3 x 128 x 128] Memory: 1mb CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD
Conv3_1[3 x 3 x 128 x 256] Memory: 2mb CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD
Conv3_2[3 x 3 x 256 x 256] Memory: 4mb CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD
Conv3_3[3 x 3 x 256 x 256] Memory: 4mb CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD
Conv4_1[3 x 3 x 256 x 512] Memory: 8mb CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD
Conv4_2[3 x 3 x 512 x 512] Memory: 16mb CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD
Conv4_3[3 x 3 x 512 x 512] Memory: 16mb CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD
Conv5_1[3 x 3 x 512 x 512] Memory: 0mb CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM
Conv5_2[3 x 3 x 512 x 512] Memory: 0mb CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM
Deconv5[5 x 5 x 512 x 512] Memory: 0mb CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
Deconv4[5 x 5 x 512 x 256] Memory: 0mb CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM
Deconv3[5 x 5 x 256 x 128] Memory: 0.04mb CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
Deconv2[5 x 5 x 128 x 64] Memory: 0.15mb CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
Deconv1[5 x 5 x 64 x 64] Memory: 0.59mb CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
Optimization 1(c): Choose optimal convolution algorithm
31
Titan-X
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Optimization 1
32
17ms41msStart-up time (one time overhead)
Titan-X
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Optimization 1: Results
33
Image size Caffe Optimization 1 %Reduction
320 x 320 588mb 159mb 73%
640 x 640 4113mb 323mb 92%
960 x 960 8778mb 643mb 92.7%
1280 x 1280 Cannot run 977mb -
Memory
Titan-V
Image size Caffe Optimization 1 %Reduction
320 x 320 210ms 20ms 95%
640 x 640 540ms 68ms 87.4%
960 x 960 1.04sec 153ms 85%
1280 x 1280 cannot run 261ms -
Performance
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Optimization 2: Use FP16 instead of FP32
▪ All the weights for convolution and de-convolution were converted to float16.
▪ Volta has hardware for FAST fp16 – TRUE_HALF_CONFIG
▪ On Pascal and below, store in fp16 but process in fp32 – PSEUDO_HALF_CONFIG
▪ Pooling and un-pooling indices were stored with 8 bits(for 2 x 2 kernel size).
34
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Optimization 2: Use FP16 instead of FP32
▪ Results slightly different -> retrain with FP16
35
FP32 FP16 Abs difference
x100
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Optimization 2: Results
36
1 FP32 FP16 %Reduction
320 x 320 20ms 13ms 35%
640 x 640 68ms 35ms 50.7%
960 x 960 153ms 87ms 43.1%
1280 x 1280 261ms 153ms 41.4%
Titan-V
Image size FP32 FP16 %Reduction
320 x 320 159mb 99mb 37.7%
640 x 640 323mb 210mb 35%
960 x 960 643mb 361mb 43.8%
1280 x 1280 977mb 559mb 42.8%
Performance
Memory
Titan-V
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Optimization 3: Use Tensor Core on Volta
▪ Tensor Core performs half matrix multiply accumulate (HMMA)
▪ cuDNN 7.0 has optimizations for HMMA
▪ Convolutions must use
▪ CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED
▪ CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
▪ X,y,w tensors must be FP16
▪ Input and output filter maps must be multiple of 8 for alignment
37
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Optimization 3: Use Tensor Core on Volta
Mixed Precision Matrix Math4x4 matrices
D = AB + C
D =
FP16 or FP32 FP16 FP16 FP16 or FP32
A0,0 A0,1 A0,2 A0,3
A1,0 A1,1 A1,2 A1,3
A2,0 A2,1 A2,2 A2,3
A3,0 A3,1 A3,2 A3,3
B0,0 B0,1 B0,2 B0,3
B1,0 B1,1 B1,2 B1,3
B2,0 B2,1 B2,2 B2,3
B3,0 B3,1 B3,2 B3,3
C0,0 C0,1 C0,2 C0,3
C1,0 C1,1 C1,2 C1,3
C2,0 C2,1 C2,2 C2,3
C3,0 C3,1 C3,2 C3,3
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Optimization 3: Results
39
Image size FP16 FP16(with TensorCore) %Reduction
320 x 320 13ms 5ms 61.5%
640 x 640 35ms 15ms 57.4%
960 x 960 87ms 32ms 63.2%
1280 x 1280 153ms 53ms 65.3%
Titan-V
Image size FP16 FP16(with TensorCore) %Increase
320 x 320 99mb 111mb 12%
640 x 640 210mb 262mb 24.7%
960 x 960 361mb 533mb 47.6%
1280 x 1280 559mb 914mb 63.5%
Titan-V
Performance
Memory
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Optimization 4: Network Fusing
▪ Original Caffe implementation used 2 networks:
▪ Coarse matting : VGG16 autoencoder
▪ Fine matting : shallow 4 layer cnn
▪ Input A : Original mean subtracted RGB as per course network
▪ Input B : Output of coarse network scaled back to 0:255 and mean subtracted.
40
Conv
Conv
Conv
Deconv
Deconv
Conv
Conv
Deconv
Mean Subtracted RGB + TrimapCP
UG
PU
CP
U
Mean Subtracted
Rescale coarse output
Coarse Network Fine Network
Final Output
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Optimization 4: Network Fusing
▪ Original Caffe implementation used 2 networks:
▪ Coarse matting : VGG16 autoencoder
▪ Fine matting : shallow 4 layer CNN
▪ Input A : Original mean subtracted RGB as per course network
▪ Input B : Output of coarse network scaled back to 0:255 and mean subtracted.
▪ Causes unnecessary driver overhead copying to and from the CPU
▪ Pre and post processing can be done on the GPU
41
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Optimization 4: Network Fusing
▪ Treat both networks as a single network.
▪ Keep mean subtracted RGB on GPU.
▪ Treat coarse output post processing as a custom network layer.
42
Conv
Conv
Conv
Deconv
Deconv
Trimap
CP
UC
oar
se
Preprocess
Postprocess
RGB
Conv
Conv
Deconv
Fin
e
Final Ouptut
Everything Stays On The GPU
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Optimization 5: Layer fusing
▪ Some layer operations can be fused
▪ Convolution
▪ Bias Add
▪ Activation (eg. Relu)
▪ Advantages
▪ Reduces kernel launch overhead
▪ Some arithmetic operations can be combined (eg. FMAD)
▪ cuDNN as a combined version of Convolution+Bias+Activation
▪ cudnnStatus_t cudnnConvolutionBiasActivationForward(…)
▪ TensorRT will find the best fused configuration at serialization time
43
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Topics
▪ Introduction to Matting
▪ Deep Matting vs Photoshop Matting
▪ Deployment Challenges
▪ Optimization
▪ Results and Demo
▪ Conclusion
44
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Caffe vs Caffe2 vs Our Optimizations
45
Titan-V
210
540
1040
0
76
195
375
605
5 15 32 53
0
200
400
600
800
1000
1200
320 X 320 640 x 640 960 x 960 1280 x 1280
Tim
e(m
s)
Caffe Caffe2 Cudnn optimized
Performance
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Caffe vs Caffe2 vs Our Optimizations
46
Titan-V
588
4113
8778
0235
1645
3582
6369
111 262533
914
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
320 X 320 640 x 640 960 x 960 1280 x 1280
Me
mo
ry(m
b)
Caffe Caffe2 Cudnn optimized
Memory
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Summary
47
✓Do better memory management
✓Use optimal algorithm for convolution based on image and filter size(gemm/fft/winograd)
✓Do inference at lower precision(fp16 or uint8) if possible
✓Use hardware-specific optimizations(ex. HMMA on TensorCore)
✓Do layer fusion
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Conclusion and Future Work
48
▪ Explore WindowsML/DirectML for optimized cross platform inference
▪ Try TensorRT, Nvidia’s latest solution for optimized high performance inference