APowerful,Flexible,andIntui5veDeepLearningFramework�
@NVIDIAGTC,April6th,2016 �
ShoheiHido
ChiefResearchOfficer
PreferredNetworks,Inc.
Overview
l ChainerisaPython-baseddeeplearningframework
l Chainerv1.0wasreleasedasanopensourceonJune2015
l ItDOESN’TrelyonTheano,unlikeotherPythonframeworks
l ChainerusesauniqueschemenamedDefine-by-Run
http://chainer.org/
l WhydouserssOllneedanotherframework?
l HowdifferentandeffecOveChaineris?
2
Preferred Networks (PFN) A startup that applies deep learning to industrial IoT
l Founded: March 2014
l Headquarter: Tokyo, Japan
l U.S. Subsidiary: San Mateo, California
l Company size: 35 engineers & researchers
l Investors: Toyota, FANUC, NTT
Deep learning Industrial IoT
3
Manufacturing
Automotive
Healthcare
Partnering with world-leading companies using Chainer
l R&DcollaboraOononindustrialproblemswithreal-worlddata Specificrequirements,modifiedalgorithms,manytrialsanderrors,etc
Differentfrommakinggeneral-purposerecogniOonsystem
4
Toyota FANUC
Panasonic
NTT
Cisco NVIDIA
Two types of background behind DL frameworks
1.Scalability-oriented
l Use-casesinmind
Image/speechrecogniOonsystem
FastDLasaserviceincloud
l Problemtype
AfewgeneralapplicaOons
10+milliontrainingsamples
10+nodesclusterw/fastnetwork
l PossibleboZleneck
Tuningofwell-knownalgorithms
DistributedcomputaOonformodel/data-paralleltraining
2.Flexibility-oriented
l Use-casesinmind
Algorithmresearch
R&Dprojectsfornewproducts
l Problemtype
VariousspecificapplicaOons
10+ktrainingsamples
1nodewithmulOpleGPUs
l PossibleboZleneck
Trial-and-errorinprototyping
Debugging,profiling&refactoring
(waitOmeduringcompilaOon)
Designed for efficient research & development
l Flexible:newkindsofcomplexmodelsforvariousapplicaOons
l IntuiOve:rapidprototypingandefficienttrial-and-error
l Powerful:comparableperformancefor1node&mulO-GPUs
6
Scalability-oriented Flexibility-oriented
Agenda
l Deeplearningframeworkbasics
l IntroducOontoChainer
l CuPy:NumPy-compaObleGPUlibrary
l PerformanceandapplicaOons
7
Neural network and computation
x1
xN
・・
h1
hH
・・・・
kM
k1
yM
y1
Forward computation
Backward computation (backpropagation)
・・
・・
Input Hidden units OutputText
Image
Sensor
Object: Tulip
Anomaly score: 0.35
Category: Sports
・・
・・
・・
8
Chainer focuses on network representation/training
l Designchoicesfordeeplearningframeworks
Howtobuildneuralnetworks?
Howtotrainneuralnetworks?
Whichtextformat/languageformodeling?
WhichlanguageforcompuOng?
RunwithGPU?
RunonmulOpleGPUs?
RunonmulOplecomputenodes?
9
Building and training neural networks: Computational graph construction is the key
1. ConstructacomputaOonalgraph
BasedonnetworkdefiniOongivenbyusers
ChainsoffuncOonsandoperaOonsoninputvariables
2. Computelossandgradients
ForwardcomputaOontocalculatelossforaminibatch
BackpropagaOongivesgradientstoallofparameters
3. OpOmizemodel
Updateeachparameterwiththegradient
RepeatunOlconvergence
Step 1. is the most important and there are many approaches
10
Building blocks
l ThesefuncOonaliOesareverysimilarbetweenframeworks
l Butthestructure,abstracOonlevel,andinterfacearedifferent
l Itcomestothedesignofdomain-specificlanguageforNN
Array data structure (vector/matrix/tensor)
Operations & functions
Network (computational graph)
Optimizer (SGD/AdaGrad/Adam)
11
Types of domain-specific language for neural networks
l TextDSL Ex.Caffe(prototxt)
Ex.CNTK(NDL)
l Symbolicprogram OperaOons
onsymbols
Ex.Theano
Ex.TensorFlow
l ImperaOveprogram DirectcomputaOons
onrawdataarrays
Ex.Torch.nn
Ex.Chainer
#SymbolicdefiniOonA=Variable(‘A’)B=Variable(‘B’)C=B*AD=C+Constant(1)#Compilef=compile(D)d=f(A=np.ones(10),B=np.ones(10)*2)
#ImperaOvedeclaraOona=np.ones(10)b=np.ones(10)*2c=b*ad=c+1
%%DefiniOonintextf:{“A”:“Variable”,“B”:“Variable”,“C”:[“B”,“*”,“A”],“ret”:[“C”,“+”,1]}
#Compilef=compile(“f.txt”)d=f(A=np.ones(10),B=np.ones(10)*2)
12
Ex. MXNet
Comparison of DSL type
DSLtype Pros. Cons.
TextDSL
• Human-readabledefiniOon• Non-programmercaneasily
editthenetwork
• Usersmuststudytheformat• Formatmighthavetobe
extendedfornewalgorithms
InternalDSL Symbolic
• StaOcanalysisatcompile• OpOmizaOonbeforetraining• Easytoparallelize
• Usersmuststudyspecialsyntax• Mayneedmoreeffortsto
implementnewalgorithms
ImperaOve
• Lesseffortstolearnsyntax• Easydebuggingandprofiling• Suitablefornewalgorithms
withcomplexlogic
• HardtoopOmizeinadvance• Lessefficientinmemory
allocaOonandparallelizaOon
ChainerisattheextremeendofimperaOveprogramforhighflexibility
13
Agenda
l Deeplearningframeworkbasics
l IntroducOontoChainer
l CuPy:NumPy-compaObleGPUlibrary
l PerformanceandapplicaOons
14
Chainer as an open-source project
l hZps://github.com/pfnet/chainerl 50contributors
l 1,277stars&255fork
l 3,708commits
l AcOvedevelopment&releaseforlast10months v1.0.0(June2015)tov1.7.2(March2016)
15
Original developerSeiya Tokui
CuPy
Chainer software stack
CPU NVIDIAGPU
CUDA
cuDNN
BLAS
NumPy
Chainer
l ChainerisbuiltontopofNumPyandCUDA
l CuPyisalsointroducedasanequivalentofNumPyonGPU
16
Run
Define
Graph build scheme (1/2) - Define-and-Run: Most of frameworks use this scheme (Chainer does not)
l Define:buildacomputaOonalgraphbasedondefiniOon
l Run:updatethemodel(parameters)usingtrainingdataset
NetworkdefiniOon
ComputaOonalgraph
GradientfuncOon
Parameters
ComputaOonalgraph
GradientfuncOon
Parameters
Trainingdata
Update
Loss&gradient
AutodifferenOaOon
17
Define-by-Run
Graph build scheme (2/2) - Define-by-Run: Computational graph construction on the fly
l Nographisconstructedbeforetraining
l Instead,thegraphisbuiltateachforwardcomputaOon
l ComputaOonalgraphcanbemodifieddynamicallyforeachiteraOon/sampleordependingonsomecondiOons
ModeldefiniOon
ComputaOonalgraph
GradientfuncOon
Parameters
Trainingdata
Update
DynamicchangeCondiOons
18
Define-by-Run example: MLP for MNIST
l OnlytransformaOonsbetweenunitsaresetbeforetraining
l ConnecOonisgivenasforwardcomputaOon
l1 = Linear(784, n_units) l2 = Linear(n_units, 10))
Linear l2
Linear l1 x yh1
W bias
059
W bias
ReLU
def forward(x): h1 = ReLU(l1(x)) return l2(h1)
19
Define-by-Run: An interpreted language for neural network
l Idea ForwardcomputaOonactuallygoesthroughcomputaOonalgraph
Byrememberingthehistory,theactualgraphcanbeobtained
l Advantage Flexibilityfornewalgorithmswithcomplexcomponents
u Ex.recurrent,recursive,aZenOon,memory,adversarial,etc
IntuiOvecodingwithhighlyimperaOvenetworkdefiniOon
u Ex.stochasOcnetworkofwhichgraphchangesforeachiteraOon
l Currentdrawbacks GraphisgeneratedeveryOmealsoforfixednetworks
NoopOmizaOonevenforstaOcpartofgraphs
u JIT-likeanalysisandsubgraphcachemightbeuseful20
Basic components (1/2): Variable and Function
l Variable Variablewrapsarrays(.data)
ItremembersparentfuncOon(.creator)
Itwillbeassignedgradient(.grad)
ItkeepstrackofnotonlydatabutalsocomputaOons
l FuncOon TransformaOonbetweenVariable
Stateless
e.g.sigmoid,tanh,ReLU,maxpooling,dropout
Function
x y
Variable
x yh1059
21
Chain (MLP2)
Basic components (2/2): Link and Chain
l Link=funcOonwithstate ParametersarealsoVariable
andgradientswillbeassigned
e.g.Linear(fully-connected),LSTMConvoluOon2d,word-embedding
l Chain=network ChainhasasetofchildLink
ForwardcomputaOonisdefinedin.__call__()
e.g.MLP2,AlexNet,GoogLeNet,RNNLM,seq2seq,
Link (Linear)
y=f(W*x+b)
x y
W b
Linear l2
Linear l1 �� y�h1 �
W� bias�
W� bias�
ReLU
22
Backpropagation through computational graph
l ConsideranobjecOve(Link.Linear):L = f(x * w + b)
l ThiscomputesthevalueofLinforwardcomputaOon,andsimultaneouslybuildsthefollowingcomputaOonalgraph
l ThegradientofLcanbecomputedwithrespecttoanyvariablesbybackpropagaOon
l ThentheopOmizerupdatesthevalueofparameters
* x
W
+
b
f L
isVariableisFuncOon
23
Code sample (1/4): Multi-layer perceptron
class MLP2(Chain): def __init__(self): super(MLP2, self).__init__( l1=L.Linear(784, 100), l2=L.Linear(100, 10), ) def __call__(self, x): h1 = F.relu(self.l1(x)) y = self.l2(h1) return y class Classifier(Chain): def __init__(self, predictor): super(Classifier, self). __init__(predictor=predictor) def __call__(self, x, t): y = self.predictor(x) self.accuracy = F.accuracy(y, t) self.loss = F.softmax_cross_entropy(y, t) return self.loss, self.accuracy
# Model and optimizer setup model = Classifier(MLP2()) optimizer = optimizers.SGD() optimizer.setup(model) # training loop with minibatch for i in range(0, datasize, batchsize): x = Variable(x_tr[i:i+batchsize]) t = Variable(y_tr[i:i+batchsize]) model.zerograds() loss, acc = model(x, t) loss.backward() optimizer.update()
Chain (MLP2)
Linear l2
Linear l1 �� y�h1 �
W� bias�
W� bias�
ReLU
24
Code sample (2/4): Convolutional neural networkclass AlexNet(Chain): def __init__(self): super(AlexNet, self).__init__( conv1=L.Convolution2D(3, 96, 11, stride=4), conv2=L.Convolution2D(96, 256, 5, pad=2), conv3=L.Convolution2D(256, 384, 3, pad=1), conv4=L.Convolution2D(384, 384, 3, pad=1), conv5=L.Convolution2D(384, 256, 3, pad=1), fc6=L.Linear(9216, 4096), fc7=L.Linear(4096, 4096), fc8=L.Linear(4096, 1000), ) def __call__(self, x, t): h = F.max_pooling_2d(F.relu( F.local_response_normalization(self.conv1(x))), 3, stride=2) h = F.max_pooling_2d(F.relu( F.local_response_normalization(self.conv2(h))), 3, stride=2) h = F.relu(self.conv3(h)) h = F.relu(self.conv4(h)) h = F.max_pooling_2d(F.relu(self.conv5(h)), 3, stride=2) h = F.dropout(F.relu(self.fc6(h)), train=self.train) h = F.dropout(F.relu(self.fc7(h)), train=self.train) y = self.fc8(h) return y
* ImageNet Classification with Deep Convolutional Neural Networks http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf
conv2d
conv2d
conv2d
conv2d
conv2d
linear
linear
25
linear
Code sample (3/4): Recurrent neural network
class SimpleRNN(Chain): def __init__(self, n_vocab, n_units): super(SimpleRNN, self).__init__( embed=L.EmbedID(n_vocab, n_units) x2h=L.Linear(n_units, n_units), h2h=L.Linear(n_units, n_units), h2y=L.Linear(n_units, n_vocab),) self.h = None def __call__(self, x): y, h_new = self.fwd_one_step(x, self.h) self.h = h_new return y def fwd_one_step(self, x, h): x = F.tanh(self.embed(x)) if h is None: h = F.tanh(self.x2h(x)) else: h = F.tanh(self.x2h(x) + self.h2h(h)) y = F.softmax(self.h2y(h)) return y, h
x_1 h y_1
x_2 h y_2
x_3 h y_3
x_4 h y_4
BPTTlength=3
Inputword OutputRecurrentstate
# Truncated BPTT (length=3) for i in range(0, datasize, batchsize): ... accum_loss += model(x, t) if i % bptt_length == 0: model.zerograds() accum_loss.backward() accum_loss.unchain_backward() optimizer.update()
26
Code sample (4/4): Deep Networks with Stochastic Depth A paper published on arXiv, March 30, 2016
l AvariantofResidualNetthatskipsconnecOonsstochasOcally OutperformedtheoriginalResidualNet(ImageNet2015winner,MSR)
StochasOcskip:
Taken from http://arxiv.org/abs/1603.09382v2G. Huang et al.
# Mock code in Chainer
class StochasticResNet(Chain):
def __init__(self, prob, size, …):
super(StochasticResNet, size, …).__init__(
## Define f[i] as same for Residual Net )
self.p = prob # Survival probabilities
def __call__(self, h):
for i in range(self.size): b = numpy.random.binomial(1, self.p[i])
c = self.f[i](h) + h if b == 1 else h
h = F.relu(c)
return h
w/ survival probability:
27
Miscellaneous
l Otherfeatures Installwithpipinoneline:
MulO-GPUsupportbyexplicitlyselecOngtheIDtouse
Pre-trainedCaffemodelimportfromModelZoo
ModelserializaOon&save&load:HDF5orNumPynpz
l FuturedirecOon(notonlyforChainer) JIT-likeopOmizaOonduringDefine-by-Run
MemoryconsumpOonreducOon(GPUmemoryissOllsmall)
Handlingvariable-lengthinputswithoutminibatch
MaximizingperformanceonmulO-node&mulO-GPUenvironment
$ pip install chainer
28
Agenda
l Deeplearningframeworkbasics
l IntroducOontoChainer
l CuPy:NumPy-compaObleGPUlibrary
l PerformanceandapplicaOons
29
CuPy: (partially-)NumPy-compatible GPU library
l MoOvaOon:NumPy+CUDA=CuPy NumPyisthestandardlibraryinPythonfornumericalcomputaOon
CUDAisthestandardAPIsforusingGPUforhigh-performance
Unfortunately,NumPydoesNOTworkwithCUDA
l CuPysupports: FastcomputaOonusingNVIDIA’scuBLASandcuDNN
Arrayindexing,slicing,transpose,andreshape
MostofoperaOons/funcOonsinNumPy
u Chainerv1.7.2alreadysupportsmorethan170funcOons
User-definedfuncOonsandkernels
alldtypes,broadcasOng,memorypool,etc30
How to use CuPy
l UsageofCuPy:justreplaceNumPy withCuPy
l Conversionbetweennumpy.ndarrayandcupy.ndarray
l Ex.CPU/GPU-agnosOclogsumexpfuncOon def logsumexp(x, axis=None): xp = cuda.get_array_module(x) #Get CuPy or NumPy x_max = x.max(axis) exp_sum = xp.exp(x - x_max).sum(axis) return x_max + xp.log(exp_sum)
import numpy, cupy enable_cupy = True xp = cupy if enable_cupy else numpy
w_c = cupy.asarray(numpy.ones(10)) # cupy.ndarray w_n = cupy.asnumpy(cupy.ones(10)) # numpy.ndarray
31
CuPy implementation: Optimized for performance & NumPy-compatibility
l UseCythonforcupy.core&cupy.cuda
l DynamiccodegeneraOon&compile CUDAcodeisgeneratedforspecifictensordimension&datatype
On-the-flycompilebynvccandbinarycache(fasterawer1stuse)
CUDAlibraries(cuBLAS,cuRAND,cuDNN)�
ndarray�
ufunc,elementwise,reduc5on
CUDAPythonwrapper � cupy.cuda
cupy.core
Tensoropera5ons&func5ons � cupy
32
CuPy performance on linear algebra: 5 to 25 times faster than NumPy
def test(xp): a = xp.arange(1000000).reshape(1000, -1) return a.T * 2 test(numpy) t1 = datetime.datetime.now() for i in range(1000): test(numpy) t2 = datetime.datetime.now() print(t2 -t1) test(cupy) t1 = datetime.datetime.now() for i in range(1000): test(cupy) t2 = datetime.datetime.now() print(t2 -t1)
msec � speedup�
NumPy 2,929 1.0CuPy 585 5.0CuPy+MemoryPool
123 23.8
[email protected],32GB,GeForceGTX970
33
Use CuPy for GPU-based computation
l SupportthreepaZernsaswrappers ElementwiseKernel:forelement-wisecomputaOon
ReducOonKernel:forreduceoperaOonalongaxis
ufunc:universalfuncOonasinNumpy
l Ex.definiOonofanelement-wisefuncOon
l Usage(automaOcbroadcastandtypecheckaresupported)
squared_diff = cupy.ElementwiseKernel( ‘float32 x, float32 y’, # Input
‘float32 z’, # Output
‘z = (x - y) * (x - y)’, # Operation
‘squared_diff’) # Name
squared_diff(cupy.arange(10), 10)
34
Agenda
l Deeplearningframeworkbasics
l IntroducOontoChainer
l CuPy:NumPy-compaObleGPUlibrary
l PerformanceandapplicaOons
35
Public benchmark results (CNN): Chainer shows comparable performance
l ForwardcomputaOonisalmostthesamewithTensorFlow
l TrainingwithbackwardcomputaOonisslower,butitcanbeoffsetbynocompilaOonOmewhiledebugging/tuning
0
200
400
600
800
1000
1200
AlexNet GoogLeNet VGG-A OverFeat
TorchTensorFlowChainerCaffe(naCve)
0
200
400
600
800
1000
1200
AlexNet GoogLeNet VGG-A OverFeat
TorchTensorFlowChainerCaffe(naCve)
Forward computation (msec) Backward computation (msec)
Taken from https://github.com/soumith/convnet-benchmarks, using cuDNN except Caffe 36
Chainer can benefit from latest CUDA libraries: Ex. Winograd algorithm in cuDNN v5
l Conv3x3iscommoninCNNs&nowcomputedwithWinograd
l State-of-the-artCNNmodels(e.g.,GoogLeNet,VGG-A)canbeacceleratedupto2.0xattestOme(forwardonly)
0
100
200
300
400
500
600
AlexNet GoogLeNet VGG-A OverFeat
cuDNNv4cuDNNv5
0
100
200
300
400
500
600
AlexNet GoogLeNet VGG-A OverFeat
cuDNNv4cuDNNv5
Forward computation (msec) Backward computation (msec)
Independently measured by a modified version of soumith/convnet-benchmarkscuDNN v5 can be used in Chainer v1.8.0 37
Algorithm implementation in Chainer: A Neural Algorithm of Artistic Style (Gatys et al., 2015)
l hZps://github.com/maZya/chainer-gogh
Contentimage (cat)
Style image
New artistic image
+ =
Main code (45 lines) 38
l ManycollaboraOonsareon-goingw/Chainer-basedcomputervision,deepreinforcementlearning,etc…
l Ex.1Chainer-controlledtoycarsinToyotaboothatCES2016
l Ex.2HighlyaccurateFANUC’sbin-pickingrobotatIREX2015 8hourstrainingtoreachexpert-level,commercializaOonby2016end
Chainer in industry: Used in demonstrations & being commercialized
http://tinyurl.com/pfn-irex15http://tinyurl.com/pfn-ces16
39
Summary
l ChainerisaPython-baseddeeplearningframeworkwithdynamicnetworkconstrucOonschemeandCuPy
l ItisdesignedforefficientresearchandprototypingwhilekeepingcomparableperformancethankstoNVIDIAGPU
l Officialweb:hZp://chainer.org/
l Github:hZps://github.com/pfnet/chainer
YourcontribuOonswillbeappreciated&wearehiring!
40