+ All Categories
Home > Documents > Machine Translation - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2_4up.pdf ·...

Machine Translation - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2_4up.pdf ·...

Date post: 29-Sep-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
15
Machine Translation 02: Neural Network Basics Rico Sennrich University of Edinburgh R. Sennrich MT – 2018 – 02 1 / 21 The biggest revolution in the technological landscape for fifty years Now accepting applications! Find out more and apply at: pervasiveparallelism.inf.ed.ac.uk 4-year programme: 4-year programme: MSc by Research + PhD MSc by Research + PhD • Collaboration between: University of Edinburgh’s School of Informatics Ranked top in the UK by 2014 REF Edinburgh Parallel Computing Centre UK’s largest supercomputing centre • Full funding available • Industrial engagement programme includes internships at leading companies • Research-focused: Work on your thesis topic from the start • Research topics in software, hardware, theory and application of: Parallelism Concurrency Distribution R. Sennrich MT – 2018 – 02 1 / 21 Today’s Lecture linear regression stochastic gradient descent (SGD) backpropagation a simple neural network R. Sennrich MT – 2018 – 02 2 / 21 Linear Regression Parameters: θ = θ 0 θ 1 Model: h θ (x)= θ 0 + θ 1 x R. Sennrich MT – 2018 – 02 3 / 21
Transcript
Page 1: Machine Translation - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2_4up.pdf · Edinburgh Parallel Computing Centre ... R. Sennrich MT 2018 02 1/21 Today's Lecture

Machine Translation02: Neural Network Basics

Rico Sennrich

University of Edinburgh

R. Sennrich MT – 2018 – 02 1 / 21

The biggest revolution in the technological landscape for fifty years

Now accepting applications! Find out more and apply at:

pervasiveparallelism.inf.ed.ac.uk

• • 4-year programme: 4-year programme: MSc by Research + PhDMSc by Research + PhD

• Collaboration between: ▶ University of Edinburgh’s School of Informatics ✴ Ranked top in the UK by 2014 REF

▶ Edinburgh Parallel Computing Centre ✴ UK’s largest supercomputing centre

• Full funding available

• Industrial engagement programme includes internships at leading companies

• Research-focused: Work on your thesis topic from the start

• Research topics in software, hardware, theory and

application of: ▶ Parallelism ▶ Concurrency ▶ Distribution

R. Sennrich MT – 2018 – 02 1 / 21

Today’s Lecture

linear regression

stochastic gradient descent (SGD)

backpropagation

a simple neural network

R. Sennrich MT – 2018 – 02 2 / 21

Linear Regression

Parameters: θ =

[θ0θ1

]Model: hθ(x) = θ0 + θ1x

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

Data

R. Sennrich MT – 2018 – 02 3 / 21

Page 2: Machine Translation - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2_4up.pdf · Edinburgh Parallel Computing Centre ... R. Sennrich MT 2018 02 1/21 Today's Lecture

Linear Regression

Parameters: θ =

[θ0θ1

]Model: hθ(x) = θ0 + θ1x

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

Data

R. Sennrich MT – 2018 – 02 3 / 21

Linear Regression

Parameters: θ =

[θ0θ1

]Model: hθ(x) = θ0 + θ1x

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 5. 00 + 1. 50x

Data

R. Sennrich MT – 2018 – 02 3 / 21

Linear Regression

Parameters: θ =

[θ0θ1

]Model: hθ(x) = θ0 + θ1x

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 6. 00 + 2. 00x

Data

R. Sennrich MT – 2018 – 02 3 / 21

Linear Regression

Parameters: θ =

[θ0θ1

]Model: hθ(x) = θ0 + θ1x

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 2. 50 + 1. 00x

Data

R. Sennrich MT – 2018 – 02 3 / 21

Page 3: Machine Translation - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2_4up.pdf · Edinburgh Parallel Computing Centre ... R. Sennrich MT 2018 02 1/21 Today's Lecture

Linear Regression

Parameters: θ =

[θ0θ1

]Model: hθ(x) = θ0 + θ1x

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 3. 90 + 1. 19x

Data

R. Sennrich MT – 2018 – 02 3 / 21

The cost (or loss) function

We try to find parameters θ̂ ∈ R2 such that the cost function J(θ) isminimal:

J : R2 → R

θ̂ = argminθ∈R2

J(θ)

Mean Square Error:

J(θ) =1

2m

m∑

i=1

(hθ(x

(i))− y(i))2

=1

2m

m∑

i=1

(θ0 + θ1x

(i) − y(i))2

where m is the number of data points in the training set.

R. Sennrich MT – 2018 – 02 4 / 21

The cost (or loss) function

We try to find parameters θ̂ ∈ R2 such that the cost function J(θ) isminimal:

J : R2 → R

θ̂ = argminθ∈R2

J(θ)

Mean Square Error:

J(θ) =1

2m

m∑

i=1

(hθ(x

(i))− y(i))2

=1

2m

m∑

i=1

(θ0 + θ1x

(i) − y(i))2

where m is the number of data points in the training set.

R. Sennrich MT – 2018 – 02 4 / 21

The cost (or loss) function

We try to find parameters θ̂ ∈ R2 such that the cost function J(θ) isminimal:

J : R2 → R

θ̂ = argminθ∈R2

J(θ)

Mean Square Error:

J(θ) =1

2m

m∑

i=1

(hθ(x

(i))− y(i))2

=1

2m

m∑

i=1

(θ0 + θ1x

(i) − y(i))2

where m is the number of data points in the training set.

R. Sennrich MT – 2018 – 02 4 / 21

Page 4: Machine Translation - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2_4up.pdf · Edinburgh Parallel Computing Centre ... R. Sennrich MT 2018 02 1/21 Today's Lecture

The cost (or loss) function

We try to find parameters θ̂ ∈ R2 such that the cost function J(θ) isminimal:

J : R2 → R

θ̂ = argminθ∈R2

J(θ)

Mean Square Error:

J(θ) =1

2m

m∑

i=1

(hθ(x

(i))− y(i))2

=1

2m

m∑

i=1

(θ0 + θ1x

(i) − y(i))2

where m is the number of data points in the training set.

R. Sennrich MT – 2018 – 02 4 / 21

The cost (or loss) function

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 5. 00 + 1. 50x

Data

J(

[−5.001.50

]) = 6.1561

R. Sennrich MT – 2018 – 02 5 / 21

The cost (or loss) function

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 6. 00 + 2. 00x

Data

J(

[−6.002.00

]) = 19.3401

R. Sennrich MT – 2018 – 02 5 / 21

The cost (or loss) function

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 2. 50 + 1. 00x

Data

J(

[−2.501.00

]) = 4.7692

R. Sennrich MT – 2018 – 02 5 / 21

Page 5: Machine Translation - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2_4up.pdf · Edinburgh Parallel Computing Centre ... R. Sennrich MT 2018 02 1/21 Today's Lecture

The cost (or loss) function

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 3. 90 + 1. 19x

Data

J(

[−3.901.19

]) = 4.4775

R. Sennrich MT – 2018 – 02 5 / 21

The cost (or loss) function

So, how do we find θ̂ = argminθ∈R2

J(θ) computationally?

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

R. Sennrich MT – 2018 – 02 6 / 21

The cost (or loss) function

So, how do we find θ̂ = argminθ∈R2

J(θ) computationally?

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

R. Sennrich MT – 2018 – 02 6 / 21

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 0, α = 0.01

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

R. Sennrich MT – 2018 – 02 7 / 21

Page 6: Machine Translation - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2_4up.pdf · Edinburgh Parallel Computing Centre ... R. Sennrich MT 2018 02 1/21 Today's Lecture

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 0, α = 0.01

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

R. Sennrich MT – 2018 – 02 7 / 21

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 1, α = 0.01

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

R. Sennrich MT – 2018 – 02 7 / 21

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 20, α = 0.01

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

R. Sennrich MT – 2018 – 02 7 / 21

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 200, α = 0.01

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

R. Sennrich MT – 2018 – 02 7 / 21

Page 7: Machine Translation - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2_4up.pdf · Edinburgh Parallel Computing Centre ... R. Sennrich MT 2018 02 1/21 Today's Lecture

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 10000, α = 0.01

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

R. Sennrich MT – 2018 – 02 7 / 21

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 10000, α = 0.005

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

R. Sennrich MT – 2018 – 02 7 / 21

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 10000, α = 0.02

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

R. Sennrich MT – 2018 – 02 7 / 21

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 10, α = 0.025

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

R. Sennrich MT – 2018 – 02 7 / 21

Page 8: Machine Translation - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2_4up.pdf · Edinburgh Parallel Computing Centre ... R. Sennrich MT 2018 02 1/21 Today's Lecture

Backpropagation

How do we calculate∂

∂θjJ(θ)?

In other words:how sensitive is the loss function to the change of a parameter θj?

why backpropagation?we could do this by hand for linear regression...but what about complex functions?→ propagate error backward(special case of automatic differentiation)

R. Sennrich MT – 2018 – 02 8 / 21

Computation Graphs

applying chain rule:

∂e

∂b=∂e

∂c· ∂c∂b

+∂e

∂d· ∂d∂b

= 1 · 2 + 1 · 3 = 5

next, let’s use dynamic programmingto avoid re-computing intermediate results...

Christopher Olah http://colah.github.io/posts/2015-08-Backprop/R. Sennrich MT – 2018 – 02 9 / 21

Computation Graphs

applying chain rule:

∂e

∂b=∂e

∂c· ∂c∂b

+∂e

∂d· ∂d∂b

= 1 · 2 + 1 · 3 = 5

next, let’s use dynamic programmingto avoid re-computing intermediate results...

Christopher Olah http://colah.github.io/posts/2015-08-Backprop/R. Sennrich MT – 2018 – 02 9 / 21

Computation Graphs

applying chain rule:

∂e

∂b=∂e

∂c· ∂c∂b

+∂e

∂d· ∂d∂b

= 1 · 2 + 1 · 3 = 5

next, let’s use dynamic programmingto avoid re-computing intermediate results...

Christopher Olah http://colah.github.io/posts/2015-08-Backprop/R. Sennrich MT – 2018 – 02 9 / 21

Page 9: Machine Translation - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2_4up.pdf · Edinburgh Parallel Computing Centre ... R. Sennrich MT 2018 02 1/21 Today's Lecture

Computation Graphs

applying chain rule:

∂e

∂b=∂e

∂c· ∂c∂b

+∂e

∂d· ∂d∂b

= 1 · 2 + 1 · 3 = 5

next, let’s use dynamic programmingto avoid re-computing intermediate results...

Christopher Olah http://colah.github.io/posts/2015-08-Backprop/R. Sennrich MT – 2018 – 02 9 / 21

Backpropagation

forward-mode differentiation lets us compute partial derivatives∂x

∂bfor all

nodes x→ still inefficient if you have many inputs

Christopher Olah http://colah.github.io/posts/2015-08-Backprop/

R. Sennrich MT – 2018 – 02 10 / 21

Backpropagation

backward-mode differentiation lets us efficiently compute∂e

∂xfor all inputs

x in one pass→ also known as error backpropagation

Christopher Olah http://colah.github.io/posts/2015-08-Backprop/

R. Sennrich MT – 2018 – 02 10 / 21

To summarize what we have learned

When approaching a machine learning problem, we need:

a suitable model;

(here: a linear model)

a suitable cost (or loss) function;

(here: mean square error)

an optimization algorithm;

(here: a variant of SGD)

the gradient(s) of the cost function (if required by the optimizationalgorithm).

R. Sennrich MT – 2018 – 02 11 / 21

Page 10: Machine Translation - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2_4up.pdf · Edinburgh Parallel Computing Centre ... R. Sennrich MT 2018 02 1/21 Today's Lecture

To summarize what we have learned

When approaching a machine learning problem, we need:

a suitable model;

(here: a linear model)

a suitable cost (or loss) function;

(here: mean square error)

an optimization algorithm;

(here: a variant of SGD)

the gradient(s) of the cost function (if required by the optimizationalgorithm).

R. Sennrich MT – 2018 – 02 11 / 21

To summarize what we have learned

When approaching a machine learning problem, we need:

a suitable model;

(here: a linear model)

a suitable cost (or loss) function;

(here: mean square error)

an optimization algorithm;

(here: a variant of SGD)

the gradient(s) of the cost function (if required by the optimizationalgorithm).

R. Sennrich MT – 2018 – 02 11 / 21

To summarize what we have learned

When approaching a machine learning problem, we need:

a suitable model;

(here: a linear model)

a suitable cost (or loss) function;

(here: mean square error)

an optimization algorithm;

(here: a variant of SGD)

the gradient(s) of the cost function (if required by the optimizationalgorithm).

R. Sennrich MT – 2018 – 02 11 / 21

To summarize what we have learned

When approaching a machine learning problem, we need:

a suitable model;

(here: a linear model)

a suitable cost (or loss) function;

(here: mean square error)

an optimization algorithm;

(here: a variant of SGD)

the gradient(s) of the cost function (if required by the optimizationalgorithm).

R. Sennrich MT – 2018 – 02 11 / 21

Page 11: Machine Translation - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2_4up.pdf · Edinburgh Parallel Computing Centre ... R. Sennrich MT 2018 02 1/21 Today's Lecture

To summarize what we have learned

When approaching a machine learning problem, we need:

a suitable model; (here: a linear model)

a suitable cost (or loss) function; (here: mean square error)

an optimization algorithm; (here: a variant of SGD)

the gradient(s) of the cost function (if required by the optimizationalgorithm).

R. Sennrich MT – 2018 – 02 11 / 21

What is a Neural Network?

A complex non-linear function which:is built from simpler units (neurons, nodes, gates, . . . )maps vectors/matrices to vectors/matricesis parameterised by vectors/matrices

Why is this useful?

very expressivecan represent (e.g.) parameterised probability distributionsevaluation and parameter estimation can be built up from components

relationship to linear regressionmore complex architectures with hidden units(neither input nor output)

neural networks typically use non-linear activation functions

R. Sennrich MT – 2018 – 02 12 / 21

What is a Neural Network?

A complex non-linear function which:is built from simpler units (neurons, nodes, gates, . . . )maps vectors/matrices to vectors/matricesis parameterised by vectors/matrices

Why is this useful?very expressivecan represent (e.g.) parameterised probability distributionsevaluation and parameter estimation can be built up from components

relationship to linear regressionmore complex architectures with hidden units(neither input nor output)

neural networks typically use non-linear activation functions

R. Sennrich MT – 2018 – 02 12 / 21

What is a Neural Network?

A complex non-linear function which:is built from simpler units (neurons, nodes, gates, . . . )maps vectors/matrices to vectors/matricesis parameterised by vectors/matrices

Why is this useful?very expressivecan represent (e.g.) parameterised probability distributionsevaluation and parameter estimation can be built up from components

relationship to linear regressionmore complex architectures with hidden units(neither input nor output)

neural networks typically use non-linear activation functions

R. Sennrich MT – 2018 – 02 12 / 21

Page 12: Machine Translation - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2_4up.pdf · Edinburgh Parallel Computing Centre ... R. Sennrich MT 2018 02 1/21 Today's Lecture

An Artificial Neuron

x1

x2

x3

...

xn

g(w · x+ b) y

x is a vector input, y is a scalar output

w and b are the parameters (b is a bias term)

g is a (non-linear) activation function

R. Sennrich MT – 2018 – 02 13 / 21

Why Non-linearity?

Functions like XOR cannot be separated by a linear function

XORTruth table

x1 x2 output0 0 00 1 11 0 11 1 0

A

B

C

D

x1

x2

y

1

1

1

-2

1

0.5

0.5

(neurons arranged in layers, and fire if input is ≥ 1)

R. Sennrich MT – 2018 – 02 14 / 21

Why Non-linearity?

Functions like XOR cannot be separated by a linear function

XORTruth table

x1 x2 output0 0 00 1 11 0 11 1 0

A

B

C

D

x1

x2

y

1

1

1

-2

1

0.5

0.5

(neurons arranged in layers, and fire if input is ≥ 1)

R. Sennrich MT – 2018 – 02 14 / 21

Why Non-linearity?

Functions like XOR cannot be separated by a linear function

XORTruth table

x1 x2 output0 0 00 1 11 0 11 1 0

A

B

C

D

x1

x2

y

1

1

1

-2

1

0.5

0.5

(neurons arranged in layers, and fire if input is ≥ 1)

R. Sennrich MT – 2018 – 02 14 / 21

Page 13: Machine Translation - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2_4up.pdf · Edinburgh Parallel Computing Centre ... R. Sennrich MT 2018 02 1/21 Today's Lecture

Why Non-linearity?

Functions like XOR cannot be separated by a linear function

XORTruth table

x1 x2 output0 0 00 1 11 0 11 1 0

A

B

C

D

x1

x2

y

1

1

1

-2

1

0.5

0.5

(neurons arranged in layers, and fire if input is ≥ 1)

R. Sennrich MT – 2018 – 02 14 / 21

Activation functions

desirable:differentiable (for gradient-based training)monotonic (for better training stability)non-linear (for better expressivity)

−3.0 −2.0 −1.0 1.0 2.0 3.0

−1.0

1.0

2.0

3.0

x

y

identity (linear)sigmoid

tanhrectified linear unit (ReLU)

R. Sennrich MT – 2018 – 02 15 / 21

A Simple Neural Network: Maths

we can use linear algebra to formalize our neural network:

the network

A

B

C

D

x1

x2

y

1

0

0

1

1

-2

1

0.5

0.5

w1 =

1 00.5 0.50 1

h1 =

ABC

x =

[x1x2

]

w2 =[1 −2 1

]y =

[D]

calculation of x 7→ y

h1 = ϕ(xw1)

y = ϕ(h1w2)

R. Sennrich MT – 2018 – 02 16 / 21

A Simple Neural Network: Python Code

import numpy as np

#activation function

def phi(x):

return np.greater_equal(x,1).astype(int)

def nn(x, w1, w2):

h1 = phi(np.dot(x, w1))

y = phi(np.dot(h1, w2))

return y

w1 = np.array([ [1, 0.5, 0], [0, 0.5, 1] ])

w2 = np.array([[1], [-2], [1]])

x = np.array([1, 0])

print nn(x, w1, w2)

R. Sennrich MT – 2018 – 02 17 / 21

Page 14: Machine Translation - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2_4up.pdf · Edinburgh Parallel Computing Centre ... R. Sennrich MT 2018 02 1/21 Today's Lecture

More Complex ArchitecturesConvolutional

tor wi ∈ Rd of a word in the sentence:

s =

w1 . . . ws

(2)

To address the problem of varying sentencelengths, the Max-TDNN takes the maximum ofeach row in the resulting matrix c yielding a vectorof d values:

cmax =

max(c1,:)...

max(cd,:)

(3)

The aim is to capture the most relevant feature, i.e.the one with the highest value, for each of the drows of the resulting matrix c. The fixed-sizedvector cmax is then used as input to a fully con-nected layer for classification.

The Max-TDNN model has many desirableproperties. It is sensitive to the order of the wordsin the sentence and it does not depend on externallanguage-specific features such as dependency orconstituency parse trees. It also gives largely uni-form importance to the signal coming from eachof the words in the sentence, with the exceptionof words at the margins that are considered fewertimes in the computation of the narrow convolu-tion. But the model also has some limiting as-pects. The range of the feature detectors is lim-ited to the span m of the weights. Increasing m orstacking multiple convolutional layers of the nar-row type makes the range of the feature detectorslarger; at the same time it also exacerbates the ne-glect of the margins of the sentence and increasesthe minimum size s of the input sentence requiredby the convolution. For this reason higher-orderand long-range feature detectors cannot be easilyincorporated into the model. The max pooling op-eration has some disadvantages too. It cannot dis-tinguish whether a relevant feature in one of therows occurs just one or multiple times and it for-gets the order in which the features occur. Moregenerally, the pooling factor by which the signalof the matrix is reduced at once corresponds tos−m+1; even for moderate values of s the pool-ing factor can be excessive. The aim of the nextsection is to address these limitations while pre-serving the advantages.

3 Convolutional Neural Networks withDynamic k-Max Pooling

We model sentences using a convolutional archi-tecture that alternates wide convolutional layers

K-Max pooling(k=3)

Fully connected layer

Folding

Wideconvolution

(m=2)

Dynamick-max pooling (k= f(s) =5)

Projectedsentence

matrix(s=7)

Wideconvolution

(m=3)

The cat sat on the red mat

Figure 3: A DCNN for the seven word input sen-tence. Word embeddings have size d = 4. Thenetwork has two convolutional layers with twofeature maps each. The widths of the filters at thetwo layers are respectively 3 and 2. The (dynamic)k-max pooling layers have values k of 5 and 3.

with dynamic pooling layers given by dynamic k-max pooling. In the network the width of a featuremap at an intermediate layer varies depending onthe length of the input sentence; the resulting ar-chitecture is the Dynamic Convolutional NeuralNetwork. Figure 3 represents a DCNN. We pro-ceed to describe the network in detail.

3.1 Wide Convolution

Given an input sentence, to obtain the first layer ofthe DCNN we take the embedding wi ∈ Rd foreach word in the sentence and construct the sen-tence matrix s ∈ Rd×s as in Eq. 2. The valuesin the embeddings wi are parameters that are op-timised during training. A convolutional layer inthe network is obtained by convolving a matrix ofweights m ∈ Rd×m with the matrix of activationsat the layer below. For example, the second layeris obtained by applying a convolution to the sen-tence matrix s itself. Dimension d and filter widthm are hyper-parameters of the network. We let theoperations be wide one-dimensional convolutionsas described in Sect. 2.2. The resulting matrix chas dimensions d× (s + m− 1).

658

[Kalchbrenner et al., 2014]

Recurrent

Andrej Karpathy

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

R. Sennrich MT – 2018 – 02 18 / 21

Practical Considerations

efficiency:GPU acceleration of BLAS operationsperform SGD in mini-batches

hyperparameters:number and size of layersminibatch sizelearning rate...

initialisation of weight matrices

stopping criterion

regularization (dropout)

bias units (always-on input)

R. Sennrich MT – 2018 – 02 19 / 21

Toolkits for Neural Networks

What does a Toolkit ProvideMulti-dimensional matrices (tensors)

Automatic differentiation

Efficient GPU routines for tensor operations

Torch http://torch.ch/

TensorFlow https://www.tensorflow.org/

Theano http://deeplearning.net/software/theano/

There are many more!R. Sennrich MT – 2018 – 02 20 / 21

Further Reading

required reading: Koehn (2017), chapter 13.2-3.

further reading on backpropagation:http://colah.github.io/posts/2015-08-Backprop/

R. Sennrich MT – 2018 – 02 21 / 21

Page 15: Machine Translation - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2_4up.pdf · Edinburgh Parallel Computing Centre ... R. Sennrich MT 2018 02 1/21 Today's Lecture

Slide Credits

some slides borrowed from:

Sennrich, Birch, and Junczys-Dowmunt (2016): Advances in NeuralMachine Translation

Sennrich and Haddow (2017): Practical Neural Machine Translation

R. Sennrich MT – 2018 – 02 22 / 21

Bibliography I

Kalchbrenner, N., Grefenstette, E., and Blunsom, P. (2014).A Convolutional Neural Network for Modelling Sentences.In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

R. Sennrich MT – 2018 – 02 23 / 21


Recommended