Download - Chainer GTC 2016

APowerful,Flexible,andIntui5veDeepLearningFramework�

@NVIDIAGTC,April6th,2016 �

ShoheiHido

ChiefResearchOfficer

PreferredNetworks,Inc.

Overview

l  ChainerisaPython-baseddeeplearningframework

l  Chainerv1.0wasreleasedasanopensourceonJune2015

l  ItDOESN’TrelyonTheano,unlikeotherPythonframeworks

l  ChainerusesauniqueschemenamedDefine-by-Run

http://chainer.org/

l  WhydouserssOllneedanotherframework?

l  HowdifferentandeffecOveChaineris?

2

Preferred Networks (PFN) A startup that applies deep learning to industrial IoT

l  Founded: March 2014

l  Headquarter: Tokyo, Japan

l  U.S. Subsidiary: San Mateo, California

l  Company size: 35 engineers & researchers

l  Investors: Toyota, FANUC, NTT

Deep learning Industrial IoT

3

Manufacturing

Automotive

Healthcare

Partnering with world-leading companies using Chainer

l  R&DcollaboraOononindustrialproblemswithreal-worlddata  Specificrequirements,modifiedalgorithms,manytrialsanderrors,etc

  Differentfrommakinggeneral-purposerecogniOonsystem

4

Toyota FANUC

Panasonic

NTT

Cisco NVIDIA

Two types of background behind DL frameworks

1.Scalability-oriented

l  Use-casesinmind

  Image/speechrecogniOonsystem

  FastDLasaserviceincloud

l  Problemtype

  AfewgeneralapplicaOons

  10+milliontrainingsamples

  10+nodesclusterw/fastnetwork

l  PossibleboZleneck

  Tuningofwell-knownalgorithms

  DistributedcomputaOonformodel/data-paralleltraining

2.Flexibility-oriented

l  Use-casesinmind

  Algorithmresearch

  R&Dprojectsfornewproducts

l  Problemtype

  VariousspecificapplicaOons

  10+ktrainingsamples

  1nodewithmulOpleGPUs

l  PossibleboZleneck

  Trial-and-errorinprototyping

  Debugging,profiling&refactoring

  (waitOmeduringcompilaOon)

Designed for efficient research & development

l  Flexible:newkindsofcomplexmodelsforvariousapplicaOons

l  IntuiOve:rapidprototypingandefficienttrial-and-error

l  Powerful:comparableperformancefor1node&mulO-GPUs

6

Scalability-oriented Flexibility-oriented

Agenda

l  Deeplearningframeworkbasics

l  IntroducOontoChainer

l  CuPy:NumPy-compaObleGPUlibrary

l  PerformanceandapplicaOons

7

Neural network and computation

x1

xN

・・

h1

hH

・・・・

kM

k1

yM

y1

Forward computation

Backward computation (backpropagation)

・・

・・

Input Hidden units OutputText

Image

Sensor

Object: Tulip

Anomaly score: 0.35

Category: Sports

・・

・・

・・

8

Chainer focuses on network representation/training

l  Designchoicesfordeeplearningframeworks

  Howtobuildneuralnetworks?

  Howtotrainneuralnetworks?

  Whichtextformat/languageformodeling?

  WhichlanguageforcompuOng?

  RunwithGPU?

  RunonmulOpleGPUs?

  RunonmulOplecomputenodes?

9

Building and training neural networks: Computational graph construction is the key

1.  ConstructacomputaOonalgraph

  BasedonnetworkdefiniOongivenbyusers

  ChainsoffuncOonsandoperaOonsoninputvariables

2.  Computelossandgradients

  ForwardcomputaOontocalculatelossforaminibatch

  BackpropagaOongivesgradientstoallofparameters

3.  OpOmizemodel

  Updateeachparameterwiththegradient

  RepeatunOlconvergence

Step 1. is the most important and there are many approaches

10

Building blocks

l  ThesefuncOonaliOesareverysimilarbetweenframeworks

l  Butthestructure,abstracOonlevel,andinterfacearedifferent

l  Itcomestothedesignofdomain-specificlanguageforNN

Array data structure (vector/matrix/tensor)

Operations & functions

Network (computational graph)

Optimizer (SGD/AdaGrad/Adam)

11

Types of domain-specific language for neural networks

l  TextDSL  Ex.Caffe(prototxt)

  Ex.CNTK(NDL)

l  Symbolicprogram  OperaOons

onsymbols

  Ex.Theano

  Ex.TensorFlow

l  ImperaOveprogram  DirectcomputaOons

onrawdataarrays

  Ex.Torch.nn

  Ex.Chainer

#SymbolicdefiniOonA=Variable(‘A’)B=Variable(‘B’)C=B*AD=C+Constant(1)#Compilef=compile(D)d=f(A=np.ones(10),B=np.ones(10)*2)

#ImperaOvedeclaraOona=np.ones(10)b=np.ones(10)*2c=b*ad=c+1

%%DefiniOonintextf:{“A”:“Variable”,“B”:“Variable”,“C”:[“B”,“*”,“A”],“ret”:[“C”,“+”,1]}

#Compilef=compile(“f.txt”)d=f(A=np.ones(10),B=np.ones(10)*2)

12

Ex. MXNet

Comparison of DSL type

DSLtype Pros. Cons.

TextDSL

•  Human-readabledefiniOon•  Non-programmercaneasily

editthenetwork

•  Usersmuststudytheformat•  Formatmighthavetobe

extendedfornewalgorithms

InternalDSL Symbolic

•  StaOcanalysisatcompile•  OpOmizaOonbeforetraining•  Easytoparallelize

•  Usersmuststudyspecialsyntax•  Mayneedmoreeffortsto

implementnewalgorithms

ImperaOve

•  Lesseffortstolearnsyntax•  Easydebuggingandprofiling•  Suitablefornewalgorithms

withcomplexlogic

•  HardtoopOmizeinadvance•  Lessefficientinmemory

allocaOonandparallelizaOon

ChainerisattheextremeendofimperaOveprogramforhighflexibility

13

Agenda





14

Chainer as an open-source project

l  hZps://github.com/pfnet/chainerl  50contributors

l  1,277stars&255fork

l  3,708commits

l  AcOvedevelopment&releaseforlast10months  v1.0.0(June2015)tov1.7.2(March2016)

15

Original developerSeiya Tokui

CuPy

Chainer software stack

CPU NVIDIAGPU

CUDA

cuDNN

BLAS

NumPy

Chainer

l  ChainerisbuiltontopofNumPyandCUDA

l  CuPyisalsointroducedasanequivalentofNumPyonGPU

16

Run

Define

Graph build scheme (1/2) - Define-and-Run: Most of frameworks use this scheme (Chainer does not)

l  Define:buildacomputaOonalgraphbasedondefiniOon

l  Run:updatethemodel(parameters)usingtrainingdataset

NetworkdefiniOon

ComputaOonalgraph

GradientfuncOon

Parameters

ComputaOonalgraph

GradientfuncOon

Parameters

Trainingdata

Update

Loss&gradient

AutodifferenOaOon

17

Define-by-Run

Graph build scheme (2/2) - Define-by-Run: Computational graph construction on the fly

l  Nographisconstructedbeforetraining

l  Instead,thegraphisbuiltateachforwardcomputaOon

l  ComputaOonalgraphcanbemodifieddynamicallyforeachiteraOon/sampleordependingonsomecondiOons

ModeldefiniOon

ComputaOonalgraph

GradientfuncOon

Parameters

Trainingdata

Update

DynamicchangeCondiOons

18

Define-by-Run example: MLP for MNIST

l  OnlytransformaOonsbetweenunitsaresetbeforetraining

l  ConnecOonisgivenasforwardcomputaOon

l1 = Linear(784, n_units) l2 = Linear(n_units, 10))

Linear l2

Linear l1 ｘ yh1

W bias

０５９

W bias

ReLU

def forward(x): h1 = ReLU(l1(x)) return l2(h1)

19

Define-by-Run: An interpreted language for neural network

l  Idea  ForwardcomputaOonactuallygoesthroughcomputaOonalgraph

  Byrememberingthehistory,theactualgraphcanbeobtained

l  Advantage  Flexibilityfornewalgorithmswithcomplexcomponents

u  Ex.recurrent,recursive,aZenOon,memory,adversarial,etc

  IntuiOvecodingwithhighlyimperaOvenetworkdefiniOon

u  Ex.stochasOcnetworkofwhichgraphchangesforeachiteraOon

l  Currentdrawbacks  GraphisgeneratedeveryOmealsoforfixednetworks

  NoopOmizaOonevenforstaOcpartofgraphs

u  JIT-likeanalysisandsubgraphcachemightbeuseful20

Basic components (1/2): Variable and Function

l  Variable  Variablewrapsarrays(.data)

  ItremembersparentfuncOon(.creator)

  Itwillbeassignedgradient(.grad)

  ItkeepstrackofnotonlydatabutalsocomputaOons

l  FuncOon  TransformaOonbetweenVariable

  Stateless

  e.g.sigmoid,tanh,ReLU,maxpooling,dropout

Function

ｘ y

Variable

ｘ yh1０５９

21

Chain (MLP2)

Basic components (2/2): Link and Chain

l  Link=funcOonwithstate  ParametersarealsoVariable

andgradientswillbeassigned

  e.g.Linear(fully-connected),LSTMConvoluOon2d,word-embedding

l  Chain=network  ChainhasasetofchildLink

  ForwardcomputaOonisdefinedin.__call__()

  e.g.MLP2,AlexNet,GoogLeNet,RNNLM,seq2seq,

Link (Linear)

y=f(W*x+b)

ｘ y

W b

Linear l2

Linear l1 �� y�h1 �

W� bias�

W� bias�

ReLU

22

Backpropagation through computational graph

l  ConsideranobjecOve(Link.Linear):L = f(x * w + b)

l  ThiscomputesthevalueofLinforwardcomputaOon,andsimultaneouslybuildsthefollowingcomputaOonalgraph

l  ThegradientofLcanbecomputedwithrespecttoanyvariablesbybackpropagaOon

l  ThentheopOmizerupdatesthevalueofparameters

* ｘ

W

+

b

f L

isVariableisFuncOon

23

Code sample (1/4): Multi-layer perceptron

class MLP2(Chain): def __init__(self): super(MLP2, self).__init__( l1=L.Linear(784, 100), l2=L.Linear(100, 10), ) def __call__(self, x): h1 = F.relu(self.l1(x)) y = self.l2(h1) return y class Classifier(Chain): def __init__(self, predictor): super(Classifier, self). __init__(predictor=predictor) def __call__(self, x, t): y = self.predictor(x) self.accuracy = F.accuracy(y, t) self.loss = F.softmax_cross_entropy(y, t) return self.loss, self.accuracy

# Model and optimizer setup model = Classifier(MLP2()) optimizer = optimizers.SGD() optimizer.setup(model) # training loop with minibatch for i in range(0, datasize, batchsize): x = Variable(x_tr[i:i+batchsize]) t = Variable(y_tr[i:i+batchsize]) model.zerograds() loss, acc = model(x, t) loss.backward() optimizer.update()

Chain (MLP2)

Linear l2

Linear l1 �� y�h1 �

W� bias�

W� bias�

ReLU

24

Code sample (2/4): Convolutional neural networkclass AlexNet(Chain): def __init__(self): super(AlexNet, self).__init__( conv1=L.Convolution2D(3, 96, 11, stride=4), conv2=L.Convolution2D(96, 256, 5, pad=2), conv3=L.Convolution2D(256, 384, 3, pad=1), conv4=L.Convolution2D(384, 384, 3, pad=1), conv5=L.Convolution2D(384, 256, 3, pad=1), fc6=L.Linear(9216, 4096), fc7=L.Linear(4096, 4096), fc8=L.Linear(4096, 1000), ) def __call__(self, x, t): h = F.max_pooling_2d(F.relu( F.local_response_normalization(self.conv1(x))), 3, stride=2) h = F.max_pooling_2d(F.relu( F.local_response_normalization(self.conv2(h))), 3, stride=2) h = F.relu(self.conv3(h)) h = F.relu(self.conv4(h)) h = F.max_pooling_2d(F.relu(self.conv5(h)), 3, stride=2) h = F.dropout(F.relu(self.fc6(h)), train=self.train) h = F.dropout(F.relu(self.fc7(h)), train=self.train) y = self.fc8(h) return y

* ImageNet Classification with Deep Convolutional Neural Networks http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf

conv2d

conv2d

conv2d

conv2d

conv2d

linear

linear

25

linear

Code sample (3/4): Recurrent neural network

class SimpleRNN(Chain): def __init__(self, n_vocab, n_units): super(SimpleRNN, self).__init__( embed=L.EmbedID(n_vocab, n_units) x2h=L.Linear(n_units, n_units), h2h=L.Linear(n_units, n_units), h2y=L.Linear(n_units, n_vocab),) self.h = None def __call__(self, x): y, h_new = self.fwd_one_step(x, self.h) self.h = h_new return y def fwd_one_step(self, x, h): x = F.tanh(self.embed(x)) if h is None: h = F.tanh(self.x2h(x)) else: h = F.tanh(self.x2h(x) + self.h2h(h)) y = F.softmax(self.h2y(h)) return y, h

x_1 h y_1

x_2 h y_2

x_3 h y_3

x_4 h y_4

BPTTlength=3

Inputword OutputRecurrentstate

# Truncated BPTT (length=3) for i in range(0, datasize, batchsize): ... accum_loss += model(x, t) if i % bptt_length == 0: model.zerograds() accum_loss.backward() accum_loss.unchain_backward() optimizer.update()

26

Code sample (4/4): Deep Networks with Stochastic Depth A paper published on arXiv, March 30, 2016

l  AvariantofResidualNetthatskipsconnecOonsstochasOcally  OutperformedtheoriginalResidualNet(ImageNet2015winner,MSR)

  StochasOcskip:

Taken from http://arxiv.org/abs/1603.09382v2G. Huang et al.

# Mock code in Chainer

class StochasticResNet(Chain):

def __init__(self, prob, size, …):

super(StochasticResNet, size, …).__init__(

## Define f[i] as same for Residual Net )

self.p = prob # Survival probabilities

def __call__(self, h):

for i in range(self.size): b = numpy.random.binomial(1, self.p[i])

c = self.f[i](h) + h if b == 1 else h

h = F.relu(c)

return h

w/ survival probability:

27

Miscellaneous

l  Otherfeatures  Installwithpipinoneline:

  MulO-GPUsupportbyexplicitlyselecOngtheIDtouse

  Pre-trainedCaffemodelimportfromModelZoo

  ModelserializaOon&save&load:HDF5orNumPynpz

l  FuturedirecOon(notonlyforChainer)  JIT-likeopOmizaOonduringDefine-by-Run

  MemoryconsumpOonreducOon(GPUmemoryissOllsmall)

  Handlingvariable-lengthinputswithoutminibatch

  MaximizingperformanceonmulO-node&mulO-GPUenvironment

$ pip install chainer

28

Agenda





29

CuPy: (partially-)NumPy-compatible GPU library

l  MoOvaOon:NumPy+CUDA=CuPy  NumPyisthestandardlibraryinPythonfornumericalcomputaOon

  CUDAisthestandardAPIsforusingGPUforhigh-performance

  Unfortunately,NumPydoesNOTworkwithCUDA

l  CuPysupports:  FastcomputaOonusingNVIDIA’scuBLASandcuDNN

  Arrayindexing,slicing,transpose,andreshape

  MostofoperaOons/funcOonsinNumPy

u  Chainerv1.7.2alreadysupportsmorethan170funcOons

  User-definedfuncOonsandkernels

  alldtypes,broadcasOng,memorypool,etc30

How to use CuPy

l  UsageofCuPy:justreplaceNumPy withCuPy

l  Conversionbetweennumpy.ndarrayandcupy.ndarray

l  Ex.CPU/GPU-agnosOclogsumexpfuncOon def logsumexp(x, axis=None): xp = cuda.get_array_module(x) #Get CuPy or NumPy x_max = x.max(axis) exp_sum = xp.exp(x -　x_max).sum(axis) return x_max + xp.log(exp_sum)

import numpy, cupy enable_cupy = True xp = cupy if enable_cupy else numpy

w_c = cupy.asarray(numpy.ones(10)) # cupy.ndarray w_n = cupy.asnumpy(cupy.ones(10)) # numpy.ndarray

31

CuPy implementation: Optimized for performance & NumPy-compatibility

l  UseCythonforcupy.core&cupy.cuda

l  DynamiccodegeneraOon&compile  CUDAcodeisgeneratedforspecifictensordimension&datatype

  On-the-flycompilebynvccandbinarycache(fasterawer1stuse)

CUDAlibraries(cuBLAS,cuRAND,cuDNN)�

ndarray�

ufunc,elementwise,reduc5on

CUDAPythonwrapper � cupy.cuda

cupy.core

Tensoropera5ons&func5ons � cupy

32

CuPy performance on linear algebra: 5 to 25 times faster than NumPy

def test(xp): a = xp.arange(1000000).reshape(1000, -1) return a.T * 2 test(numpy) t1 = datetime.datetime.now() for i in range(1000): test(numpy) t2 = datetime.datetime.now() print(t2 -t1) test(cupy) t1 = datetime.datetime.now() for i in range(1000): test(cupy) t2 = datetime.datetime.now() print(t2 -t1)

msec � speedup�

NumPy 2,929 1.0CuPy 585 5.0CuPy+MemoryPool

123 23.8

[email protected],32GB,GeForceGTX970

33

Use CuPy for GPU-based computation

l  SupportthreepaZernsaswrappers  ElementwiseKernel:forelement-wisecomputaOon

  ReducOonKernel:forreduceoperaOonalongaxis

  ufunc:universalfuncOonasinNumpy

l  Ex.definiOonofanelement-wisefuncOon

l  Usage(automaOcbroadcastandtypecheckaresupported)

squared_diff = cupy.ElementwiseKernel( ‘float32 x, float32 y’, # Input

‘float32 z’, # Output

‘z = (x - y) * (x - y)’, # Operation

‘squared_diff’) # Name

squared_diff(cupy.arange(10), 10)

34

Agenda





35

Public benchmark results (CNN): Chainer shows comparable performance

l  ForwardcomputaOonisalmostthesamewithTensorFlow

l  TrainingwithbackwardcomputaOonisslower,butitcanbeoffsetbynocompilaOonOmewhiledebugging/tuning

0

200

400

600

800

1000

1200

AlexNet GoogLeNet VGG-A OverFeat

TorchTensorFlowChainerCaffe(naCve)

0

200

400

600

800

1000

1200


TorchTensorFlowChainerCaffe(naCve)

Forward computation (msec) Backward computation (msec)

Taken from https://github.com/soumith/convnet-benchmarks, using cuDNN except Caffe 36

Chainer can benefit from latest CUDA libraries: Ex. Winograd algorithm in cuDNN v5

l  Conv3x3iscommoninCNNs&nowcomputedwithWinograd

l  State-of-the-artCNNmodels(e.g.,GoogLeNet,VGG-A)canbeacceleratedupto2.0xattestOme(forwardonly)

0

100

200

300

400

500

600


cuDNNv4cuDNNv5

0

100

200

300

400

500

600


cuDNNv4cuDNNv5

Forward computation (msec) Backward computation (msec)

Independently measured by a modified version of soumith/convnet-benchmarkscuDNN v5 can be used in Chainer v1.8.0 37

Algorithm implementation in Chainer: A Neural Algorithm of Artistic Style (Gatys et al., 2015)

l  hZps://github.com/maZya/chainer-gogh

Contentimage (cat)

Style image

New artistic image

+ =

Main code (45 lines) 38

l  ManycollaboraOonsareon-goingw/Chainer-basedcomputervision,deepreinforcementlearning,etc…

l  Ex.1Chainer-controlledtoycarsinToyotaboothatCES2016

l  Ex.2HighlyaccurateFANUC’sbin-pickingrobotatIREX2015  8hourstrainingtoreachexpert-level,commercializaOonby2016end

Chainer in industry: Used in demonstrations & being commercialized

http://tinyurl.com/pfn-irex15http://tinyurl.com/pfn-ces16

39

Summary

l  ChainerisaPython-baseddeeplearningframeworkwithdynamicnetworkconstrucOonschemeandCuPy

l  ItisdesignedforefficientresearchandprototypingwhilekeepingcomparableperformancethankstoNVIDIAGPU

l  Officialweb:hZp://chainer.org/

l  Github:hZps://github.com/pfnet/chainer

YourcontribuOonswillbeappreciated&wearehiring!

40