Download - Performance Optimizations for Deep Image Matting …on-demand.gputechconf.com/gtc/2018/presentation/s8550...Performance Optimizations for Deep Image Matting in Photoshop Betty Leong

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Performance Optimizations for Deep Image Matting in PhotoshopBetty Leong | Photoshop Engineering Manager, Adobe

Salil Tambe | Computer Vision Engineer, Adobe

Chris Hebert | DevTech Engineer, Nvidia


Topics

2

▪ Introduction to Matting

▪ Deep Matting vs Photoshop Matting

▪ Deployment Challenges

▪ Optimization

▪ Results and Demo

▪ Conclusion

© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 3

Select and Mask in Photoshop


Select and Mask in Photoshop


Matting in Photoshop

5


Deep Matting

Image Input Output

Xu et al. Deep Image Matting. CVPR 2017.

Brian Price (GTC 2017)


Topics




▪ Optimization

▪ Results

▪ Conclusion

7


Photoshop Matte Deep Matte

8

Image Image

Photoshop Matte Deep MatteCorrect matting for hairHair, should

be whiteTrimap

Photoshop Matte Deep Matte


Deep MattePhotoshop Matte

Background grass included

Grass excluded from matte

Trimap

Image


Topics




▪ Optimization


▪ Conclusion

10


• Resolution (320 x 320)

• Model size (80 MB)

• Memory

• Run time performance

• Cross platform support

Tech Transfer Challenges

Image Credits: Deep Image Matting [Xu et. al; CVPR 2017]


Challenges

▪ Interactive editing with Deep Matting should be possible, but it is computationally expensive!

Matting uses a VGG-Net based Encoder-Decoder network that requires > 1sec and 600mbmemory for inferencing a 320 x 320 image on Intel i-7 8650K CPU using caffe

▪ It should be deployable on all platforms supported by Photoshop

i.e. all combinations of Intel, AMD and Nvidia hardware on Mac and Windows

12


Deep Matting Deep Dive

13

Input trimap

+


Input trimap

+

Deep Matting Deep Dive

14


Inference Per Tile

15

Framework used: Caffe (with CUDA and CUDNN)

Deconv5

Deconv4

Deconv2

Conv2

Conv5

Conv1[320 x 320 x 64]

Conv4

Alpha Prediction

Image Credits: Deep Image Matting [Xu et. al; CVPR 2017]

[160 x 160 x 128]

Conv3[80 x 80 x 256]

[40 x 40 x 512]

Conv4[20 x 20 x 512]

[20 x 20 x 512]

[40 x 40 x 256]

[80 x 80 x 128]

[160 x 160 x 64]

Deconv3

[320 x 320 x 64]

[320 x 320 x 1]

Deconv1


Fine Network

16

+

+

Refinement Network

Final Matte

+


Explorations/Experiments

▪ Collaborate with Nvidia DevTech ProVis Team to come up with better per tile inference performance

▪ Chris Hebert – DevTech Engineer

▪ Inference customization with CuDNN kernels for optimal performance

17


Topics




▪ Optimization


▪ Conclusion

18


Caffe: Inference Timeline(nv prof)

19

160ms

Titan-X


Caffe: CPU Overhead

20

160ms

Titan-X

Layer weights transferred one a time


Caffe: Uses FFT(memory intensive)

21

160ms

Titan-X

FFT

FFT


cuDNN – A bit like OpenGL for Neural Networks

▪ Networks for inferencing are not difficult to implement with cuDNN

▪ cuDNN Provides a set of common network operations

▪ Convolution

▪ Activation

▪ Tensor Ops – Add, multiply etc

▪ Highly optimized for respective HW architectures

▪ cuDNN is the backend for most frameworks that target NVIDIA Hardware

22


Optimization 1: Better memory management

▪ Pre-allocate the max buffer size required for the workspace

23


Output layers size

Conv1_1[320 x 320 x 64] Memory: 25mb

Conv1_2[320 x 320 x 64] Memory: 25mb

Conv2_1[160 x 160 x 128] Memory: 12.5mb

Conv2_2[160 x 160 x 128] Memory: 12.5mb

Conv3_1[80 x 80 x 256] Memory: 6.25mb

Conv3_2[80 x 80 x 256] Memory: 6.25mb

Conv3_3[80 x 80 x 256] Memory: 6.25mb

Conv4_1[40 x 40 x 512] Memory: 3.125mb

Conv4_2[40 x 40 x 512] Memory: 3.125mb

Conv4_2[40 x 40 x 512] Memory: 3.125mb

Conv4_3[20 x 20 x 512] Memory: 0.78125mb

Conv5_1[20 x 20 x 512] Memory: 0.78125mb

Conv5_2[20 x 20 x 512] Memory: 0.78125mb

Deconv1[40 x 40 x 256] Memory: 1.5625mb



Deconv4[320 x 320 x 64] Memory: 25mb


24


Optimization 1(a): Pre-allocate the max buffer size required

Conv1_1[320 x 320 x 64] Memory: 25mb

Conv1_2[320 x 320 x 64] Memory: 25mb

Conv2_1[160 x 160 x 128] Memory: 12.5mb

Conv2_2[160 x 160 x 128] Memory: 12.5mb

Conv3_1[80 x 80 x 256] Memory: 6.25mb

Conv3_2[80 x 80 x 256] Memory: 6.25mb

Conv3_3[80 x 80 x 256] Memory: 6.25mb

Conv4_1[40 x 40 x 512] Memory: 3.125mb

Conv4_2[40 x 40 x 512] Memory: 3.125mb

Conv4_2[40 x 40 x 512] Memory: 3.125mb

Conv4_3[20 x 20 x 512] Memory: 0.78125mb

Conv5_1[20 x 20 x 512] Memory: 0.78125mb

Conv5_2[20 x 20 x 512] Memory: 0.78125mb




Deconv4[320 x 320 x 64] Memory: 25mb


25




26

Conv1_1

A 25mb

B 25mb

Memory Pool2x Largest TensorConv1_2

Conv1_3

Pool1

Conv2_1

A

B

A

B




▪ Carefully choose convolution algorithm for performance and memory requirements

▪ FFT – Fast, but requires a lot of device workspace memory

▪ GEMM – In place general matrix multiply

▪ Winograd – fast, but unstable for large filter sizes.

27








▪ Convolution algorithm chosen on a per layer basis

▪ According to per layer constraints

28








▪ Convolution algorithm chosen on a per layer basis

▪ According to per layer constraints

▪ Share max per layer workspace memory between all layers

▪ Re-use the buffer for the computation of each layer

29


Conv1_1[3 x 3 x 4 x 64] Memory: 0.00878906mb








Conv4_2[3 x 3 x 512 x 512] Memory: 9mb




Deconv5[5 x 5 x 512 x 512] Memory: 25mb

Deconv4[5 x 5 x 512 x 256] Memory: 12.5mb




Optimization 1(b): Load the entire model in one go

30

Titan-X

Weights + biases ~ 87 MB


Conv1_1[3 x 3 x 4 x 64] Memory: 0.59mb CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM

Conv1_2[3 x 3 x 64 x 64] Memory: 0.25mb CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD

Conv2_1[3 x 3 x 64 x 128] Memory: 0.50mb CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD

Conv2_2[3 x 3 x 128 x 128] Memory: 1mb CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD







Conv5_1[3 x 3 x 512 x 512] Memory: 0mb CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM

Conv5_2[3 x 3 x 512 x 512] Memory: 0mb CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM

Deconv5[5 x 5 x 512 x 512] Memory: 0mb CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM

Deconv4[5 x 5 x 512 x 256] Memory: 0mb CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM

Deconv3[5 x 5 x 256 x 128] Memory: 0.04mb CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM



Optimization 1(c): Choose optimal convolution algorithm

31

Titan-X


Optimization 1

32

17ms41msStart-up time (one time overhead)

Titan-X


Optimization 1: Results

33

Image size Caffe Optimization 1 %Reduction

320 x 320 588mb 159mb 73%

640 x 640 4113mb 323mb 92%

960 x 960 8778mb 643mb 92.7%

1280 x 1280 Cannot run 977mb -

Memory

Titan-V

Image size Caffe Optimization 1 %Reduction

320 x 320 210ms 20ms 95%

640 x 640 540ms 68ms 87.4%

960 x 960 1.04sec 153ms 85%

1280 x 1280 cannot run 261ms -

Performance


Optimization 2: Use FP16 instead of FP32

▪ All the weights for convolution and de-convolution were converted to float16.

▪ Volta has hardware for FAST fp16 – TRUE_HALF_CONFIG

▪ On Pascal and below, store in fp16 but process in fp32 – PSEUDO_HALF_CONFIG

▪ Pooling and un-pooling indices were stored with 8 bits(for 2 x 2 kernel size).

34


Optimization 2: Use FP16 instead of FP32

▪ Results slightly different -> retrain with FP16

35

FP32 FP16 Abs difference

x100



36

1 FP32 FP16 %Reduction

320 x 320 20ms 13ms 35%

640 x 640 68ms 35ms 50.7%

960 x 960 153ms 87ms 43.1%

1280 x 1280 261ms 153ms 41.4%

Titan-V

Image size FP32 FP16 %Reduction

320 x 320 159mb 99mb 37.7%

640 x 640 323mb 210mb 35%

960 x 960 643mb 361mb 43.8%

1280 x 1280 977mb 559mb 42.8%

Performance

Memory

Titan-V


Optimization 3: Use Tensor Core on Volta

▪ Tensor Core performs half matrix multiply accumulate (HMMA)

▪ cuDNN 7.0 has optimizations for HMMA

▪ Convolutions must use

▪ CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED

▪ CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM

▪ X,y,w tensors must be FP16

▪ Input and output filter maps must be multiple of 8 for alignment

37


Optimization 3: Use Tensor Core on Volta

Mixed Precision Matrix Math4x4 matrices

D = AB + C

D =

FP16 or FP32 FP16 FP16 FP16 or FP32

A0,0 A0,1 A0,2 A0,3

A1,0 A1,1 A1,2 A1,3

A2,0 A2,1 A2,2 A2,3

A3,0 A3,1 A3,2 A3,3

B0,0 B0,1 B0,2 B0,3

B1,0 B1,1 B1,2 B1,3

B2,0 B2,1 B2,2 B2,3

B3,0 B3,1 B3,2 B3,3

C0,0 C0,1 C0,2 C0,3

C1,0 C1,1 C1,2 C1,3

C2,0 C2,1 C2,2 C2,3

C3,0 C3,1 C3,2 C3,3



39

Image size FP16 FP16(with TensorCore) %Reduction

320 x 320 13ms 5ms 61.5%

640 x 640 35ms 15ms 57.4%

960 x 960 87ms 32ms 63.2%

1280 x 1280 153ms 53ms 65.3%

Titan-V

Image size FP16 FP16(with TensorCore) %Increase

320 x 320 99mb 111mb 12%

640 x 640 210mb 262mb 24.7%

960 x 960 361mb 533mb 47.6%

1280 x 1280 559mb 914mb 63.5%

Titan-V

Performance

Memory


Optimization 4: Network Fusing

▪ Original Caffe implementation used 2 networks:

▪ Coarse matting : VGG16 autoencoder

▪ Fine matting : shallow 4 layer cnn

▪ Input A : Original mean subtracted RGB as per course network

▪ Input B : Output of coarse network scaled back to 0:255 and mean subtracted.

40

Conv

Conv

Conv

Deconv

Deconv

Conv

Conv

Deconv

Mean Subtracted RGB + TrimapCP

UG

PU

CP

U

Mean Subtracted

Rescale coarse output

Coarse Network Fine Network

Final Output



▪ Original Caffe implementation used 2 networks:

▪ Coarse matting : VGG16 autoencoder

▪ Fine matting : shallow 4 layer CNN

▪ Input A : Original mean subtracted RGB as per course network

▪ Input B : Output of coarse network scaled back to 0:255 and mean subtracted.

▪ Causes unnecessary driver overhead copying to and from the CPU

▪ Pre and post processing can be done on the GPU

41



▪ Treat both networks as a single network.

▪ Keep mean subtracted RGB on GPU.

▪ Treat coarse output post processing as a custom network layer.

42

Conv

Conv

Conv

Deconv

Deconv

Trimap

CP

UC

oar

se

Preprocess

Postprocess

RGB

Conv

Conv

Deconv

Fin

e

Final Ouptut

Everything Stays On The GPU


Optimization 5: Layer fusing

▪ Some layer operations can be fused

▪ Convolution

▪ Bias Add

▪ Activation (eg. Relu)

▪ Advantages

▪ Reduces kernel launch overhead

▪ Some arithmetic operations can be combined (eg. FMAD)

▪ cuDNN as a combined version of Convolution+Bias+Activation

▪ cudnnStatus_t cudnnConvolutionBiasActivationForward(…)

▪ TensorRT will find the best fused configuration at serialization time

43


Topics




▪ Optimization


▪ Conclusion

44


Caffe vs Caffe2 vs Our Optimizations

45

Titan-V

210

540

1040

0

76

195

375

605

5 15 32 53

0

200

400

600

800

1000

1200

320 X 320 640 x 640 960 x 960 1280 x 1280

Tim

e(m

s)

Caffe Caffe2 Cudnn optimized

Performance


Caffe vs Caffe2 vs Our Optimizations

46

Titan-V

588

4113

8778

0235

1645

3582

6369

111 262533

914

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

320 X 320 640 x 640 960 x 960 1280 x 1280

Me

mo

ry(m

b)

Caffe Caffe2 Cudnn optimized

Memory


Summary

47

✓Do better memory management

✓Use optimal algorithm for convolution based on image and filter size(gemm/fft/winograd)

✓Do inference at lower precision(fp16 or uint8) if possible

✓Use hardware-specific optimizations(ex. HMMA on TensorCore)

✓Do layer fusion


Conclusion and Future Work

48

▪ Explore WindowsML/DirectML for optimized cross platform inference

▪ Try TensorRT, Nvidia’s latest solution for optimized high performance inference