Dnn Tutorial

Post on 18-Jul-2016

21 views 0 download

description

()

transcript

Deep Learning: Past, Present and Future (?)

Kyunghyun Cho

Laboratoire d’Informatique des Systèmes Adaptatifs,Département d’informatique et de recherche opérationnelle,

Facult des arts et des sciences,Université de Montréal

cho.k.hyun@gmail.com (chokyun@iro.umontreal.ca)

Machine Learning

Learning Inference

?

1. Let the modelM learn the data D2. Let the modelM infer unknown quantities

Machine Learning & Perception

Learning Inference

?

Perception is the organization, identification,and interpretation of sensory information inorder to represent and understand theenvironment.

–Wikipedia

(Farabet et al., 2013)

Machine Learning & Perception: Examples

Learning Inference

?

Data Sensory Information QueryLabeled Images An image Is a cat in the image?Transcribed Speech A speech segment What is this person saying?Paraphrases A pair of sentences Is this sentence a paraphrase?Movie Ratings Ratings of Y and by X Will a user X like a movie Y ?Parallel Corpora A Finnish sentence What is “moi” in English?

A human possesses the best machinery for perception, called a brain.

But, how does our brain do it?

Deep Learning: Motivated from Human Learning

(Van Essen&Gallant, 1994)

Learn massive data simple functions Multi-layered

(Krizhevsky et al., 2012)

Boltzmann machines? I remember working on them in 80s and 90s..

– Anonymous Interviewer, 2011paraphrased

Deep Learning: History

1958 Rosenblatt proposed perceptrons1980 Neocognitron (Fukushima, 1980)

1982 Hopfield network, SOM (Kohonen, 1982), Neural PCA (Oja, 1982)

1985 Boltzmann machines (Ackley et al., 1985)

1986 Multilayer perceptrons and backpropagation (Rumelhart et al., 1986)

1988 RBF networks (Broomhead&Lowe, 1988)

1989 Autoencoders (Baldi&Hornik, 1989), Convolutional network (LeCun, 1989)

1992 Sigmoid belief network (Neal, 1992)

1993 Sparse coding (Field, 1993)

Why all this fuss about deep learning now?

ImageNet: ILSVRC 2012 – Classification Task

Top Rankers1. SuperVision (0.153): Deep Conv. Neural Network (Krizhevsky et al.)

2. ISI (0.262): Features + FV + Linear classifier (Gunji et al.)

3. OXFORD_VGG (0.270): Features + FV + SVM (Simonyan et al.)

4. XRCE/INRIA (0.271): SIFT + FV + PQ + SVM (Perronin et al.)

5. University of Amsterdam (0.300): Color desc. + SVM (van de Sande et al.)

(Krizhevsky et al., 2012)

ImageNet: ILSVRC 2013 – Classification TaskTop Rankers1. Clarifi (0.117): Deep Convolutional Neural Networks (Zeiler)

2. NUS: Deep Convolutional Neural Networks3. ZF: Deep Convolutional Neural Networks4. Andrew Howard: Deep Convolutional Neural Networks5. OverFeat: Deep Convolutional Neural Networks6. UvA-Euvision: Deep Convolutional Neural Networks7. Adobe: Deep Convolutional Neural Networks8. VGG: Deep Convolutional Neural Networks9. CognitiveVision: Deep Convolutional Neural Networks

10. decaf: Deep Convolutional Neural Networks11. IBM Multimedia Team: Deep Convolutional Neural Networks12. Deep Punx (0.209): Deep Convolutional Neural Networks13. MIL (0.244): Local image descriptors + FV + linear classifier (Hidaka et

al.)14. Minerva-MSRA: Deep Convolutional Neural Networks15. Orange: Deep Convolutional Neural Networks16. BUPT-Orange: Deep Convolutional Neural Networks17. Trimps-Soushen1: Deep Convolutional Neural Networks18. QuantumLeap: 15 features + RVM (Shu&Shu)

(Sermanet et al., 2013)

You’re already using deep learning!

How do you tell deep learning from not-so-deep machine learning?

Not-So-Deep Machine Learning

1. Feature engineering ← not learned!2. Learning3. Inference

features

xx

x

1

2

N

...

...

......

(1) (2)

(3)

Separation between domain knowledge and general machine learning

Unsupervised Learning of Representation

1a. Feature engineering1b. Feature/Representation learning2. Learning3. Inference

features

xx

x

1

2

N

...

...

...

...

(1a)

(1b)

(2)

(3)

Deep Learning: toward the Ultimate Machine Learning?

1. Jointly learn everything2. Inference

(1)

(2)

The data decides –Yoshua Bengio

Why now? Why not 20 years ago?

What has happened in last 20+ years?

I We have connected the dots, e.g.,I PCA ⇔ Neural PCA ⇔ Probabilistic PCA ⇔ AutoencoderI Autoencoder ⇔ Belief network ⇔ Restricted Boltzmann machine

I We understand learning betterI Model structure matters a lotI Learning is but is not optimizationI No need to be scared of non-convex optimization

I We understand how learning and inference interactI Exponential growth of the amount of data and computational power

And Beyond. . .

(Goodfellow, 2013)

Today’s Tutorial: Introduction to Deep Learning

1. Deep Learning: Past, Present and Future2. Supervised Neural Networks

I Multilayer Perceptron and LearningI RegularizationI Practical RecipeI Task-specific Neural Network

3. Unsupervised Neural NetworksI Unsupervised LearningI Generative Modeling with Neural Networks

I Density/Distribution LearningI Learning to Infer

I Semi-Supervised Learning and Pretraining

4. Advanced TopicsI Beyond Computer VisionI Advanced Topics

Supervised Neural Networks

Kyunghyun Cho

Laboratoire d’Informatique des Systèmes Adaptatifs,Département d’informatique et de recherche opérationnelle,

Facult des arts et des sciences,Université de Montréal

cho.k.hyun@gmail.com, kyunghyun.cho@umontreal.ca

Warning!!!

The next 9–10 slides may be extremely boring.

Supervised Learning: Rough Picture

Data:

D = (x1, y1), (x2, y2) . . . , (xN , yN)

Assumption:

yn = f ∗(xn)

Find a function f , using D, that emulates f ∗ as well as possible onsamples potentially not included in D.

Supervised Learning: Probabilistic Picture

Underlying distributions:

X and Y | X

Data:

D = (x , y) | x ∼ X & y ∼ Y | X = x

Find a distribution p(y | x), using D, that emulates Y | X as well aspossible on new samples from p(x) potentially not included in D.

Supervised Learning: Evaluation and Generalization

Evaluation:

Ex∼p(x) [‖f (x)− f ∗(x)‖p] ≈∑

x∈Dtest

[‖f (x)− f ∗(x)‖p]

or

Ex∼p(x) [KL(p(y | x)‖p∗(y | x))] ≈∑

x∈Dtest

KL(p‖p∗),

where D 6= Dtest

Why do we test the found solution f ∗ or p∗ on Dtest, not on D?

Linear/Ridge Regression

Underlying assumption:

y = W ∗x + b∗ + ε,

where ε is a white Gaussian noise.

Data:

D = (x1, y1), (x2, y2) . . . , (xN , yN) ,

where yn = W ∗xn + b∗ + ε.

Learning:

W , b = argminW ,b

1N

N∑n=1

‖Wxn + b − yn‖22 + λ(‖W ‖2F + ‖b‖22

)

Multilayer Perceptron: (Binary) Classification

Underlying assumption:

y ∼ B (p = fθ∗(x)) ,

where fθ∗ is a nonlinear function parameterized with θ∗ and B(p) is aBernoulli distribution of mean p.

Data:

D = (x1, y1), (x2, y2) . . . , (xN , yN) ,

where yn ∼ B (p = fθ∗(xn)).

Learning:

θ = argminθ

1N

N∑n=1

yn log fθ(xn) + (1− yn) log (1− fθ(xn)) + λΩ(θ,D)

Learning as an Optimization

Ultimately, learning is (mostly)

θ = argminθ

1N

N∑n=1

c ((xn, yn) | θ) + λΩ (θ,D) ,

where c ((x , y) | θ) is a per-sample cost function.

Gradient Descent

Gradient-descent Algorithm:

θt = θt−1 − η∇L(θt−1)

where, in our case,

L(θ) =1N

N∑n=1

l ((xn, yn) | θ) .

Let us assume that Ω (θ,D) = 0.

Stochastic Gradient Descent

Often, it is too costly to compute C (θ) due to a large training set.

Stochastic gradient descent algorithm:

θt = θt−1 − ηt∇l((x ′, y ′) | θt−1) ,

where (x ′, y ′) is a randomly chosen sample from D, and

∞∑t=1

ηt →∞ and∞∑t=1

(ηt)2 <∞.

Let us assume that Ω (θ,D) = 0.

Any question so far?

Almost there. . .

How do we compute the gradient efficiently for deep neural networks?

Backpropagation Algorithm – (1) Forward Pass

x1

x2

h1

h2

f L

Forward Computation:

L(f (h1(x1, x2,θh1), h2(x1, x2,θh2),θf ), y)

Multilayer Perceptron with a single hidden layer:

L(x, y ,θ) =12(y −U>φ

(W>x

))2

Backpropagation Algorithm – (2) Chain Rule

x1

x2

h1

h2

f L∂L∂f

∂h∂x1

2

∂h∂x

2

1 ∂f∂h2

∂f∂h1

Chain rule of derivatives:

∂L∂x1

=∂L∂f

∂f∂x1

=∂L∂f

(∂f∂h1

∂h1

∂x1+

∂f∂h2

∂h2

∂x1

)

Backpropagation Algorithm – (3) Shared Derivatives

x1

x2

h1

h2

f L

∂f∂h1

∂f∂h2

∂L∂f

Local derivatives are shared:

∂L∂x1

=∂L∂f

(∂f∂h1

∂h1

∂x1+

∂f∂h2

∂h2

∂x1

)∂L∂x2

=∂L∂f

(∂f∂h1

∂h1

∂x2+

∂f∂h2

∂h2

∂x2

)

Backpropagation Algorithm – (4) Local Computation

∂L∂h

∂h∂a1

∂h∂a2

∂h∂aq

∂L∂a1

∂L∂a2

∂L∂aq

Each node computesI Forward: h(a1, a2, . . . , aq)

I Backward: ∂h∂a1

, ∂h∂a2

, . . . , ∂h∂aq

Backpropagation Algorithm – Requirements

∂L∂h

∂h∂a1

∂h∂a2

∂h∂aq

∂L∂a1

∂L∂a2

∂L∂aq

I Each node computes adifferentiable function1

I Directed Acyclic Graph2

1Well. . . ?2Well. . . ?

Backpropagation Algorithm – Automatic Differentiation

x1

x2

h1

h2

f L

∂f∂h1

∂f∂h2

∂L∂f

I Generalized approach to computing partial derivativesI As long as your neural network fits the requirements, you do not

need to derive the derivatives yourself!I Theano, Torch, . . .

Any question on backpropagation and automatic differentiation?

Regularization – (1) Maximum a Posteriori

Probabilistic Perspective: Find a modelM by. . .I Maximum Likelihood (ML): argmaxθ p(D | M)

I Maximum a Posteriori (MAP): argmaxθ p(M | D)

What is the probability of θ given a current data D?

p(M | D) =p(D | M)p(M)∑M p(D,M)

∝ p(D | M)p(M)

P(M): What do we think a good model should be?

Regularization – (2) Weight Decay

P(M): What do we think a good model should be?

Weight-Decay Regularization

I Prior Distribution: θj ∼ N(0, (Mλ)−1

)Maximum a Posteriori Estimation

θ = argmaxθ

N∑n=1

p(yn | xn,θ) + λ

M∑j=1

θ2j .

Regularization – (3) Smoothness and Noise Injection

Prior on a Model: SmoothnessI f (x) ≈ f (x + ε): the model should be insensitive to small

change/noise ⇐⇒ Minimize∑N

n=1

∣∣∣∂f (xn)∂x

∣∣∣2Regularizing

∑Nn=1

∣∣∣∂f (xn)∂x

∣∣∣2 is equivalent to adding random Gaussiannoise in the input (Bishop, 1995)

argminθ

N∑n=1

‖fθ(xn)− yn‖2 + λ

∣∣∣∣∂f (xn)

∂x

∣∣∣∣2

≈∑ε

p(ε)

(argmin

θ

N∑n=1

‖fθ(xn+ε)− yn‖2)

Regularization – (4a) Ensemble Learning and DropoutWisdom of the Crowd: Train M classifiers and let them vote

f (x) =1M

M∑m=1

fMm (x)

(Ciresan et al., 2012)

Regularization – (4b) Ensemble Learning and Dropout

Dropout: Train one, but exponentially many classifiers (Hinton et al., 2012)

L(θ) =1N

N∑n=1

logEm [p(yn,m | xn,θ)] ≥ 1N

N∑n=1

Em [log p(yn,m | xn,θ)]

x1

x2

h1

hH

fh2

h3...

Each update samples one network out of exponentially many classifiers.

θt = θt−1 − ηt∇l((x ′, y ′),m | θt−1) ,

where mi,l ∼ B (0.5).

Regularization – (4b) Ensemble Learning and Dropout

Dropout: When testing, halve the activations

L(θ) ≥ 1N

N∑n=1

Em [log p(yn,m | xn,θ)] ≈ 1N

N∑n=1

p(yn,m =

12| xn,θ

)

x1

x2

h1

hH

fh2

h3

...

12

12

12

12

Do you see why I spent so much time on regularization?

Common Recipe for Deep Neural Networks

1. Use a piecewise linear hidden unitI Rectifier: h(x) = max 0, x (Glorot&Bengio, 2011)I Maxout: h(x1, . . . , xp) = max x1, . . . , xp (Goodfellow et al., 2013)

2. Preprocess data and choose features carefullyI Images: Whitening? Local contrast normalization? Raw? SIFT?

HoG?I Speech: Raw? Spectrum?I Text: Characters? Words? Tree?I General: z-Normalization?

3. Use Dropout and other regularization methods4. Unsupervised Pretraining (Hinton&Salakhutdinov, 2006)

I Few labeled samples, but a lot of unlabeled samples

5. Carefully search for hyperparametersI Random search, Bayesian optimization

(Bergstra&Bengio, 2013;Bergstra et al., 2011;Snoek et al.,2012)

6. Often, deeper the better7. Build an ensemble of neural networks

But, nobody seems to use a vanilla multilayer perceptron, right?

How to Encode Prior/Domain Knowledge?

Data Preprocessing and Feature Extraction:I Object recognition from images

I Lighting condition shouldn’t matter → Contrast normalizationI Gesture recognition from skeleton

I Relative positions of joints are important → Relative coordinatesystem w.r.t. the body center

I Language ProcessingI Word counts are important → Bag-of-Words representation

Model Architecture Design

Convolutional Neural Networks – (1)

Suitable for Images, Videos and Speech

Prior/Domain KnowledgeI Translation invarianceI Rotation invariance: Images and VideosI Temporal invariance: Videos and SpeechI Frequency invariance: Speech

Convolutional Neural Networks – (2) Convolution andPooling

ConvolutionI Global

PoolingI Local

max max max

Convolutional Neural Networks – (3)Convolutional Neural Network

(TeraDeep, 2013)

Convolutional Layer1. Contrast Normalization2. Convolution3. Pooling4. Nonlinearity

Convolutional Neural Networks – (3)Deep Convolutional Neural Network

(Krizhevsky et al., 2012)

Recursive Neural Networks – (1)

Suitable for Text and Variable-Length Sequences

Prior/Domain KnowledgeI Compositionality ≈ Tree-based Grammar (?)I Location invarianceI Variable Length

Recursive Neural Networks – (2)

Compositional Structure

A small crowd quietly enters the historic church

Small (local) pieces are glued together to form a global structure

Recursive Neural Networks – (3)

Finding a good, compact representation of variable-length sequence

(Socher et al., 2011)

What other architectures can you think of?

Further Topics

I Is learning solved for supervised neural networks?I Recurrent neural networks: cope with variable-length inputs/outputsI Beyond sigmoid and rectifier functions

Unsupervised Neural Networks

Kyunghyun Cho

Laboratoire d’Informatique des Systèmes Adaptatifs,Département d’informatique et de recherche opérationnelle,

Facult des arts et des sciences,Université de Montréal

cho.k.hyun@gmail.com (chokyun@iro.umontreal.ca)

Warning!!!

The first half of this session can be boring.

Unsupervised Learning

No more label!

D = x1, x2, . . . , xN

What can we do?

(Exploratory) Data Analysis

The most important step in machine learning

−200 −150 −100 −50 0 50 100 150 200 250 300

−150

−100

−50

0

50

100

1500

50

100

150

zx

y

Human Gestures Visualization by DNN

(Cho&Chen, 2014)

Feature Extraction

With domain knowledge → Engineered FeaturesWithout domain knowledge → Learned Features

x f?

Generative Model: Probabilistic Picture

Underlying distribution:

X ∼ PD

Data:

D = x | x ∼ PD

Find a distribution p(x), using D, that emulates PD as well as possibleon new samples from p(x) potentially not included in D.

What should we do with data D = xnNn=1

Example target tasksI Classification p(xclass | xins)I Missing value reconstruction p(xm | xo)

I Denoising p(x | x)I Structured Output Prediction p(xout | xin)

I Outlier Detection p(x)? > τ

Ultimately, it comes down to learning a distribution of x.

Density/Distribution Estimation – (1)

Latent Variable Models pθ(x)

θ∗ = argmaxθ

1N

∑n=1

log∑

h

p(xn | h)p(h)

1. Define a parametric form of joint distribution pθ(x , h)2. Derive a learning rule for θ

Density/Distribution Estimation –(2) Restricted Boltzmann Machines

x1 x2 xp

h1 h2 hq

1. Joint distribution

pθ(x , h) =1

Z(θ)exp

0@ pXi=1

qXj=1

xihjwij

1A2. Marginal distribution

Xh

pθ(x , h) =1

Z(θ)

qYj=1

1 + exp

pXi=1

wi,jxi

!!

3. Learning rule: Maximum Likelihood

L(θ) =Ex∼Pd

24logqY

j=1

1 + exp

pXi=1

wi,jxi

!!− log Z(θ)

35∇wij =〈xihj 〉d − 〈xihj 〉m

(Smolensky, 1986)

Density/Distribution Estimation – (3) Belief Networks

...

...

...

...

1. Joint distribution

pθ(x , h) = p(x | h1)p(h1 | h2) · · · p(h)

2. Marginal distributionXh

pθ(x , h) = ?

3. Learning rule: Maximum Likelihood

L(θ) =Ex∼Pd

"Xh

pθ(x , h)

#

(Neal, 1996)

Density/Distribution Estimation – (4) NADE

1. Joint distribution

pθ(x) = pθ(x1)pθ(x2 | x1) · pθ(xd | x1, . . . , xd−1)

2. Marginal distribution: no latent variable h

3. Learning rule:3. Maximum Likelihood (fixed order), Order-agnostic (all orders)

(Larochelle&Murray, 2011)

Density/Distribution Estimation – (4) Issues

Intractability! Intractability! Intractability!

(General) Boltzmann MachinesI Normalization Constant Z(θ)

I Marginal ProbabilityP

h p(x, h)

I Posterior Probability p(h | x)

I Conditional Probabilityp(xmis | xobs)

Restricted Boltzmann MachinesI Normalization Constant Z(θ)

I Marginal ProbabilityP

h p(x, h)

I Posterior Probability p(h | x)

I Conditional Probabilityp(xmis | xobs)

Belief NetworksI Normalization Constant Z(θ)

I Marginal ProbabilityP

h p(x, h)

I Posterior Probability p(h | x)

I Conditional Probabilityp(xmis | xobs)

NADEI Normalization Constant Z(θ)

I Marginal ProbabilityP

h p(x, h)

I Posterior Probability p(h | x)

I Conditional Probabilityp(xmis | xobs)

I Somewhat unsatisfactoryperformance

Do we want to learn the distribution?

Generative Model – (1) Learn to Infer

Example target tasksI Classification p(xclass | xins)I Missing value reconstruction p(xm | xo)

I Denoising p(x | x)I Structured Output Prediction p(xout | xin)

I Outlier Detection p(x)? > τ

All we want is to infer the conditional distribution of unknown variables

(Goodfellow et al., 2013; Brakel et al., 2013; Stoyanov et al., 2011, Raiko et al., 2014)

Generative Model – (2) Learn to Infer

Approximate Inference in a Graphical Model

p(xmis | xobs) ≈ Q(xmis | xobs)

Methods:I Loopy belief propagationI Variational inference/message-passing

At the end of the day. . .

Qk(xmis | xobs) = f(Qk−1(xmis | xobs)

)until convergence

Generative Model – (3) Learn to Infer

x<0> x<1> x<2>

h<1> h<2>

x<k>

h<k>

...

...x

h

Approximate Inference in Restricted Boltzmann Machine

p(xmis | xobs) ≈ Q(xmis | xobs)

Mean-field Fixed-point Iteration

µkx = σ

(Wσ

(W>µk−1

x + c)

+ b)

At the end of the day, a multilayer perceptron with k − 1 hidden layers.→ Use backpropagation and stochastic gradient descent!

Generative Model – (4) Learn to Infer – NADE-k

x<0> x<1> x<2>

h<1> h<2>

x<k>

h<k>

...

...x

h

→ v<0> v<1>

h<1> h<1>[1] [2]

UW V

h<2> h<2>[1] [2]

UW V

v<2>

Further Generalization with Deep Neural Networks

p(xmis | xobs) ≈ Q(xmis | xobs) = fθ(xobs)

Interpret the model as a mixture of NADE’s with different orders ofvariables

I Exact computation of p(x) possibleI Fast inference p(xmis | xobs)I Flexible

(Raiko et al., 2014)

Generative Model – (5) Learn to Infer

Lesson:

- Do not maximize log-likelihood, but minimize the actual cost!

⇐⇒

- Don’t do what a model tells you to do, but do what you aim to do.

But, popular science journalists don’t care about generative models..

Manifold Learning – Semi-Supervised Learning (1)

???

I Which class does the dot belong to, red or blue?

Manifold Learning – Semi-Supervised Learning (2)

???

I Now, which class does the dot belong to, red or blue?I The black dots are unlabeled

Manifold Learning – New Representation (1)

Representation φ on the data manifold?

Hidden space

Data space

κ(x)

1. φ should reflect changes alongthe manifold

φ(xi ) 6= φ(xj), for all xi , xj ∈ D

2. φ should not reflect anychange orthogonal to themanifold

φ(xi + ε) = φ(xi )

Manifold Learning – New Representation (2)Denoising Autoencoder

(Vincent et al., 2011)

Representation that capture manifold1. φ(xi ) 6= φ(xj), for all xi , xj ∈ D2. φ(xi + ε) = φ(xi )

Denoising autoencoder achieves it by

minθ,θ′‖x− gθ′ (fθ (x + ε))‖2

Hidden space

Data space

κ(x)

Semi-Supervised Learning in Action (1)Layer-wise Pretraining

x

h1 h2

h3 y

Semi-Supervised Learning in Action (2)Layer-wise Pretraining

xx

h[1] h[1] h[1]

h[2] h[2]

y

Pretraining (1st layer)

Pretraining (2nd layer)

(Hinton&Salakhutdinov, 2006; Bengio et al., 2007; Ranzato et al., 2007)

Manifold Embedding and Visualization – (1)

I Manifold Embedding: M⊂ Rd → Rq, q dI If q = 2 or 3, data visualization

Data

z1z2

z2

z1

(Oja, 1991; Kramer, 1991; Hinton & Salakhutdinov, 2006)

Manifold Embedding and Visualization – (2)

Handwritten Digits [0, 1]196 → R2

0123456789

Pose Frame Data R30 → R2

rotateArmsLBackrotateArmsRBackrotateArmsBBack

What other applications can you think of?

Advanced Topics

Kyunghyun Cho

Laboratoire d’Informatique des Systèmes Adaptatifs,Département d’informatique et de recherche opérationnelle,

Facult des arts et des sciences,Université de Montréal

cho.k.hyun@gmail.com (chokyun@iro.umontreal.ca)

Is deep learning all about computer vision and speech recognition?

Deep Reinforcement Learning

a

h1

h2

h3

s

Q LearningI Q(s, a): state-action functionI Action at time t

= argmaxa∈[1,j] Q(s, a)

I Update Q on-the-fly

Deep Q LearningI Model Q with a deep neural networkI Predict Q(aj , ·) for all j at onceI State s: visual perception

not internal states!

(Mnih et al., 2013)

Natural Language Processing

In neuropsychology, linguistics and the philosophy of language, a naturallanguage or ordinary language is any language which arises,unpremeditated, in the brains of human beings.

–Wikipedia

Natural Language Processing

To machine learning researchers:

Natural Language is a huge set of variable-length sequences ofhigh-dimensional vectors.

Natural Language Processing – (1)How should we represent a linguistic symbol?

Say, we have four symbols (words):

[EU], [3], [France], [three]

Most naïve, uninformative coding:

[EU] = [1, 0, 0, 0]

[3] = [0, 1, 0, 0]

[France] = [0, 0, 1, 0]

[three] = [0, 0, 0, 1]

Not satisfying..

Natural Language Processing – (2)How should we represent a linguistic symbol?

Say, we have four symbols (words):

[EU], [3], [France], [three]

Is there a representation that preserves the similarities of meanings ofsymbols?

D([EU] , [France]) < D([EU] , [3] ,

D([3] , [three]) < D([France] , [three] ,

D([3] , [three]) < ε

...

Natural Language Processing – (3)Continuous-Space Representation

Sample sentences:1. There are three teams left for the qualification.2. 3 teams have passed the first round.

Task: Predict a following word given a current work [three]

Naïve approach: build a table (so called n-gram)I (three, teams), (3, teams)I The table can grow arbitrarily.

Machine learning: compress the table into a continuous functionI Map three and 3 to nearby points x in a continuous spaceI From x , map to [teams].

Natural Language Processing – (4)Continuous-Space Representation

(Cho et al., 2014)

Natural Language Processing – (5)Beyond Word Representation

(Cho et al., 2014)

NN: I am very powerful and can model anything as long as I’m fedenough computational resource.

SVM: But, you have to optimize a high-dimensional, non-convexfunction which has many, many local minima!

NN: Really?

Advanced Optimization – (1) Statistical Physics says. . .

Not really

Advanced Optimization – (2)Local Minima? Saddle Points?

1.5 1.0 0.5 0.0 0.5 1.0 1.54

3

2

1

0

1

2

3

4

X

1.00.5

0.00.5

1.0

Y

1.00.5

0.00.5

1.0

Z

2.52.01.51.00.5

0.00.51.0

X

1.5 1.0 0.5 0.0 0.5 1.0 1.5

Y

1.51.0

0.50.0

0.51.0

1.5

Z

3

2

1

0

1

2

X

1.0 0.5 0.0 0.5 1.0

Y

1.0

0.5

0.0

0.5

1.0

Z

0.50.0

0.5

1.0

(Dauphin et al., 2014; Pascanu et al., 2014)

Advanced Optimization – (3)Beyond the 2nd-order Method

(Quasi-)Newton Method

θ ← θ − H−1∇L(θ)

How well does the quadratic approximation hold when training neuralnetworks?

Saddle-Free Newton Method (very new!!) (Dauphin et al., 2014)

θ ← θ − |H|−1∇L(θ),

where |H| is constructed by

|H| = U |Σ|V

when H = UΣV .

Lastly but not at all least,is there any theoretical ground for using deep neural networks?

Theoretical Analysis –Deep Rectifier Networks Fold the Space

1. Fold along the 2. Fold along thehorizontal axisvertical axis

3.

(a)

S1S2S3

S4

S ′4 S ′1

S ′1S ′1

S ′1 S ′4

S ′4S ′4

S ′2

S ′2S ′2

S ′2 S ′3 S ′3

S ′3 S ′3

S ′1S ′4

S ′2S ′3

Input Space

First Layer Space

Second LayerSpace

(b) (c)

(Montufar et al., 2014; Pascanu et al., 2014)

Is it the beginning of deep learning or the end of deep learning?