+ All Categories
Home > Documents > CNN & cuDNN - Piazza

CNN & cuDNN - Piazza

Date post: 21-Jan-2022
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
39
CNN & cuDNN Bin ZHOU Bin ZHOU USTC Jan.2015
Transcript
Page 1: CNN & cuDNN - Piazza

CNN & cuDNN

Bin ZHOUBin ZHOU

USTC Jan.2015

Page 2: CNN & cuDNN - Piazza

AcknowledgementAcknowledgement

Reference:

1) Introducing NVIDIA® cuDNN, Sharan Chetlur, Software Engineer,

CUDA Libraries and Algorithms Group

深度卷积神经网络CNNs的多GPU并行框架 及其在图像识别的应用 --http://data.qq.com/article?id=1516

Page 3: CNN & cuDNN - Piazza

CNNCNN

Figure1. ImageNet CNN Model

Page 4: CNN & cuDNN - Piazza

Recall BP Network

1 2

l layer

l‐1 layer

1 32

Page 5: CNN & cuDNN - Piazza

BP Brief ReviewBP Brief Review

• Cost(Loss) Function To Evaluate the output of the network

• Common Cost Function

• MSE(Mean Squared Error)

• Cross Function

Page 6: CNN & cuDNN - Piazza
Page 7: CNN & cuDNN - Piazza

CNN BriefCNN Brief

Interpret AI task as the evaluation of complex function

Facial Recognition: Map a bunch of pixels to a name

Handwriting Recognition: Image to a character

Neural Network: Network of interconnected simple“neurons”

Neuron typically made up of 2 stages:

Linear Transformation of data

Point-wise application of non-linear function

In a CNN, Linear Transformation is a convolution

2015/1/25

Page 8: CNN & cuDNN - Piazza

cuDNNcuDNN

implementations of routines implementations of routines Convolution

P liPooling

softmax

neuron activations, including:SigmoidSigmoidRectified linear (ReLU)H b li (TANH)Hyperbolic tangent (TANH)

2015/1/25

Page 9: CNN & cuDNN - Piazza

CNNs: Stacked Repeating TripletsCNNs: Stacked Repeating Triplets

• Filtering• Kernel • Block-wise

Max-pooling• Linear• Non linear• Kernel

• stride• Block-wise• Max

• Non-linear

Convolution Activation

2015/1/25

Page 10: CNN & cuDNN - Piazza

Applications ?Applications ?

Anyone Enlighten me?

You can bring more brilliant applicationsg pp

2015/1/25

Page 11: CNN & cuDNN - Piazza

Multi convolve overviewMulti-convolve overview

Linear Transformation part of the CNN neuron

Main computational workload

80-90% of execution time

Generalization of the 2D convolution (a 4D tensor convolution)

Very compute intensive, therefore good for GPUs

However, not easy to implement efficientlyy p y

2015/1/25

Page 12: CNN & cuDNN - Piazza

Multi convolve pictoriallyMulti-convolve, pictorially

Good Parallelism

Good Parallelism

Why do it once if you can do it n times ? Batch the whole thing, to get parallelism.

2015/1/25

Page 13: CNN & cuDNN - Piazza

cuDNN GPU accelerated CNN libcuDNN-GPU accelerated CNN lib

Low-level Library of GPU-accelerated routines; similarin intent to BLAS

Out-of-the-box speedup of Neural Networks

Developed and maintained by NVIDIA

Optimized for current and future NVIDIA GPU generations

First release focused on Convolutional Neural Networks

2015/1/25

Page 14: CNN & cuDNN - Piazza

cuDNN FeaturescuDNN Features

Flexible API : arbitrary dimension ordering, striding,and sub-regions for 4d tensors

Less memory, more performance : Efficient forward

d b k d l hand backward convolution routines with zero memory

h doverhead

Easy Integration : black box implementation ofl ti d th ti R L Si idconvolution and other routines – ReLu, Sigmoid,

Tanh,Pooling SoftmaxPooling, Softmax

2015/1/25

Page 15: CNN & cuDNN - Piazza

Tensor 4d: ImportantTensor-4d: Important

Image Batches described as 4D nStride

Tensor[n, c, h, w] with stride support

[nStride, cStride, hStride, d ]wStride]

Allows flexible data layout

Easy access to subsets of features (

)Caffe’s “groups”)

Implicit cropping of sub-images

Plan to handle negative 2015/1/25

Page 16: CNN & cuDNN - Piazza

Example OverFeat Layer 1Example – OverFeat Layer 1

2015/1/25

Page 17: CNN & cuDNN - Piazza

Real Code that runsReal Code that runs

Under Linux

Demostration

2015/1/25

Page 18: CNN & cuDNN - Piazza

Implementation 1: 2D conv as a GEMVImplementation 1: 2D conv as a GEMV

2015/1/25

Page 19: CNN & cuDNN - Piazza

Multi convolveMulti-convolve

More of the same, just a little differentLonger dot productsLonger dot productsMore filter kernelsBatch of images, not just oneg , jMathematically:

2015/1/25

Page 20: CNN & cuDNN - Piazza

Implementation 2: Multi-convolve as GEMMGEMM

2015/1/25

Page 21: CNN & cuDNN - Piazza

PerformancePerformance

2015/1/25

Page 22: CNN & cuDNN - Piazza

cuDNN IntegrationcuDNN Integration

cuDNN is already integrated in major open-sourceframeworks

Caffe

Torch

2015/1/25

Page 23: CNN & cuDNN - Piazza

Using Caffe with cuDNNUsing Caffe with cuDNN

Accelerate Caffe layer types by 1.2 –3x

On average, 36% faster overall fortraining on Alexnet

Integrated into Caffe dev branchtoday!(official release with C ff 1 0)Caffe 1.0)

Seamless integration with a l b lglobal

switch 2015/1/25

*CPU is 24 core E5-2697v2 @ 2.4GHz Intel MKL 11.1.3

Page 24: CNN & cuDNN - Piazza

Caffe with cuDNN: No Programming RequiredRequired

layers { layers { layers {name: “MyData”type: DATAtop: “data”

layers {name: “Conv2”type: CONVOLUTIONbottom: “Conv1”top: data

top: “label”}

{

top: “Conv2”convolution_param {num_output: 256

layers {name: “Conv1”type: CONVOLUTION

kernel_size: 5}

bottom: “MyData”top: “Conv1”convolution param {convolution_param {num_output: 96kernel_size: 11stride: 4stride: 4}

2015/1/25

Page 25: CNN & cuDNN - Piazza

Caffe with cuDNN : Life is easyCaffe with cuDNN : Life is easy

install cuDNN

uncomment the USE_CUDNN := 1 flag in Makefile.config when installing Caffe.

Acceleration is automatic

2015/1/25

Page 26: CNN & cuDNN - Piazza

NVIDIA® cuDNN RoadmapNVIDIA® cuDNN Roadmap

2015/1/25

Page 27: CNN & cuDNN - Piazza

cuDNN availabilitycuDNN availability

Free for registered developers!

Release 1 / Release 2 – RC

available on Linux/Windows 64bit

GPU support for Kepler and newer

Already Done:Tegra K1 (Jetson board)Mac OSX support

2015/1/25

Page 28: CNN & cuDNN - Piazza

Multi GPU with CNNMulti-GPU with CNN

Problem:1) Single GPU has limited memory, which limits the size of the network2) Single GPU is still too slow for some very large scale network2) Single GPU is still too slow for some very large scale network

2015/1/25

Page 29: CNN & cuDNN - Piazza

Multi GPU ChallengeMulti-GPU Challenge

First, How to parallelize the whole process, to avoid or reduce data dependency between different nodes

Data IO and distribution to different Nodes

Pipeline and IO/Execution overlap to hide latency

Synchronize between all the nodes??

2015/1/25

Page 30: CNN & cuDNN - Piazza

Multi GPU StrategyMulti-GPU Strategy

2015/1/25

Page 31: CNN & cuDNN - Piazza

Data distribution IO/Exe OverlapData distribution IO/Exe Overlap

2015/1/25

Page 32: CNN & cuDNN - Piazza

8 GPU server8-GPU server

2015/1/25

Page 33: CNN & cuDNN - Piazza

Pipeline and Stream processing in CNNPipeline and Stream processing in CNN

2015/1/25

Page 34: CNN & cuDNN - Piazza

Familiar?? It’s a DAG!Familiar?? It s a DAG!

2015/1/25

Page 35: CNN & cuDNN - Piazza

My Algorithm for DAG auto-ParallelizationParallelization

2015/1/25

Page 36: CNN & cuDNN - Piazza

Test CaseTest Case

2015/1/25

Page 37: CNN & cuDNN - Piazza

A More Complex CaseA More Complex Case

2015/1/25

Page 38: CNN & cuDNN - Piazza

Speed with Multi GPUsSpeed with Multi-GPUs

Configuration Speedup vs. 1 GPU

2 GPU M d l P 1 712 GPUs Model P. 1.71

2 GPUs Data P 1.85

4 GPUs Data P. + Model P.

2.52

4 GPUs Data P. 2.67

2015/1/25

Page 39: CNN & cuDNN - Piazza

ConclusionConclusion

GPU is very well suitable for CNN

cuDNN is easy to use and good performance

Multi-GPU is improving more.p g

Carefully Designed parallel design on multi-GPU could get adequate scalabilityg q y

2015/1/25


Recommended