+ All Categories
Home > Documents > 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf ·...

02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf ·...

Date post: 29-Sep-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
58
Machine Translation 02: Neural Network Basics Rico Sennrich University of Edinburgh R. Sennrich MT – 2018 – 02 1 / 21
Transcript
Page 1: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Machine Translation02: Neural Network Basics

Rico Sennrich

University of Edinburgh

R. Sennrich MT – 2018 – 02 1 / 21

Page 2: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

The biggest revolution in the technological landscape for fifty years

Now accepting applications! Find out more and apply at:

pervasiveparallelism.inf.ed.ac.uk

• • 4-year programme: 4-year programme: MSc by Research + PhDMSc by Research + PhD

• Collaboration between: ▶ University of Edinburgh’s School of Informatics ✴ Ranked top in the UK by 2014 REF

▶ Edinburgh Parallel Computing Centre ✴ UK’s largest supercomputing centre

• Full funding available

• Industrial engagement programme includes internships at leading companies

• Research-focused: Work on your thesis topic from the start

• Research topics in software, hardware, theory and

application of: ▶ Parallelism ▶ Concurrency ▶ Distribution

R. Sennrich MT – 2018 – 02 1 / 21

Page 3: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Today’s Lecture

linear regression

stochastic gradient descent (SGD)

backpropagation

a simple neural network

R. Sennrich MT – 2018 – 02 2 / 21

Page 4: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Linear Regression

Parameters: θ =

[θ0θ1

]Model: hθ(x) = θ0 + θ1x

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

Data

R. Sennrich MT – 2018 – 02 3 / 21

Page 5: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Linear Regression

Parameters: θ =

[θ0θ1

]Model: hθ(x) = θ0 + θ1x

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

Data

R. Sennrich MT – 2018 – 02 3 / 21

Page 6: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Linear Regression

Parameters: θ =

[θ0θ1

]Model: hθ(x) = θ0 + θ1x

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 5. 00 + 1. 50x

Data

R. Sennrich MT – 2018 – 02 3 / 21

Page 7: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Linear Regression

Parameters: θ =

[θ0θ1

]Model: hθ(x) = θ0 + θ1x

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 6. 00 + 2. 00x

Data

R. Sennrich MT – 2018 – 02 3 / 21

Page 8: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Linear Regression

Parameters: θ =

[θ0θ1

]Model: hθ(x) = θ0 + θ1x

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 2. 50 + 1. 00x

Data

R. Sennrich MT – 2018 – 02 3 / 21

Page 9: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Linear Regression

Parameters: θ =

[θ0θ1

]Model: hθ(x) = θ0 + θ1x

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 3. 90 + 1. 19x

Data

R. Sennrich MT – 2018 – 02 3 / 21

Page 10: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

The cost (or loss) function

We try to find parameters θ̂ ∈ R2 such that the cost function J(θ) isminimal:

J : R2 → R

θ̂ = argminθ∈R2

J(θ)

Mean Square Error:

J(θ) =1

2m

m∑i=1

(hθ(x

(i))− y(i))2

=1

2m

m∑i=1

(θ0 + θ1x

(i) − y(i))2

where m is the number of data points in the training set.

R. Sennrich MT – 2018 – 02 4 / 21

Page 11: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

The cost (or loss) function

We try to find parameters θ̂ ∈ R2 such that the cost function J(θ) isminimal:

J : R2 → R

θ̂ = argminθ∈R2

J(θ)

Mean Square Error:

J(θ) =1

2m

m∑i=1

(hθ(x

(i))− y(i))2

=1

2m

m∑i=1

(θ0 + θ1x

(i) − y(i))2

where m is the number of data points in the training set.

R. Sennrich MT – 2018 – 02 4 / 21

Page 12: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

The cost (or loss) function

We try to find parameters θ̂ ∈ R2 such that the cost function J(θ) isminimal:

J : R2 → R

θ̂ = argminθ∈R2

J(θ)

Mean Square Error:

J(θ) =1

2m

m∑i=1

(hθ(x

(i))− y(i))2

=1

2m

m∑i=1

(θ0 + θ1x

(i) − y(i))2

where m is the number of data points in the training set.

R. Sennrich MT – 2018 – 02 4 / 21

Page 13: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

The cost (or loss) function

We try to find parameters θ̂ ∈ R2 such that the cost function J(θ) isminimal:

J : R2 → R

θ̂ = argminθ∈R2

J(θ)

Mean Square Error:

J(θ) =1

2m

m∑i=1

(hθ(x

(i))− y(i))2

=1

2m

m∑i=1

(θ0 + θ1x

(i) − y(i))2

where m is the number of data points in the training set.

R. Sennrich MT – 2018 – 02 4 / 21

Page 14: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

The cost (or loss) function

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 5. 00 + 1. 50x

Data

J(

[−5.001.50

]) = 6.1561

R. Sennrich MT – 2018 – 02 5 / 21

Page 15: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

The cost (or loss) function

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 6. 00 + 2. 00x

Data

J(

[−6.002.00

]) = 19.3401

R. Sennrich MT – 2018 – 02 5 / 21

Page 16: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

The cost (or loss) function

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 2. 50 + 1. 00x

Data

J(

[−2.501.00

]) = 4.7692

R. Sennrich MT – 2018 – 02 5 / 21

Page 17: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

The cost (or loss) function

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 3. 90 + 1. 19x

Data

J(

[−3.901.19

]) = 4.4775

R. Sennrich MT – 2018 – 02 5 / 21

Page 18: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

The cost (or loss) function

So, how do we find θ̂ = argminθ∈R2

J(θ) computationally?

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

R. Sennrich MT – 2018 – 02 6 / 21

Page 19: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

The cost (or loss) function

So, how do we find θ̂ = argminθ∈R2

J(θ) computationally?

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

R. Sennrich MT – 2018 – 02 6 / 21

Page 20: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 0, α = 0.01

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

R. Sennrich MT – 2018 – 02 7 / 21

Page 21: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 0, α = 0.01

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

R. Sennrich MT – 2018 – 02 7 / 21

Page 22: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 1, α = 0.01

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

R. Sennrich MT – 2018 – 02 7 / 21

Page 23: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 20, α = 0.01

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

R. Sennrich MT – 2018 – 02 7 / 21

Page 24: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 200, α = 0.01

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

R. Sennrich MT – 2018 – 02 7 / 21

Page 25: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 10000, α = 0.01

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

R. Sennrich MT – 2018 – 02 7 / 21

Page 26: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 10000, α = 0.005

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

R. Sennrich MT – 2018 – 02 7 / 21

Page 27: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 10000, α = 0.02

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

R. Sennrich MT – 2018 – 02 7 / 21

Page 28: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 10, α = 0.025

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

R. Sennrich MT – 2018 – 02 7 / 21

Page 29: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Backpropagation

How do we calculate∂

∂θjJ(θ)?

In other words:how sensitive is the loss function to the change of a parameter θj?

why backpropagation?we could do this by hand for linear regression...but what about complex functions?→ propagate error backward(special case of automatic differentiation)

R. Sennrich MT – 2018 – 02 8 / 21

Page 30: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Backpropagation

applying chain rule:

∂e

∂b=∂e

∂c· ∂c∂b

+∂e

∂d· ∂d∂b

= 1 · 2 + 1 · 3 = 5

next, let’s use dynamic programmingto avoid re-computing intermediate results...

Christopher Olah http://colah.github.io/posts/2015-08-Backprop/R. Sennrich MT – 2018 – 02 9 / 21

Page 31: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Backpropagation

applying chain rule:

∂e

∂b=∂e

∂c· ∂c∂b

+∂e

∂d· ∂d∂b

= 1 · 2 + 1 · 3 = 5

next, let’s use dynamic programmingto avoid re-computing intermediate results...

Christopher Olah http://colah.github.io/posts/2015-08-Backprop/R. Sennrich MT – 2018 – 02 9 / 21

Page 32: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Backpropagation

applying chain rule:

∂e

∂b=∂e

∂c· ∂c∂b

+∂e

∂d· ∂d∂b

= 1 · 2 + 1 · 3 = 5

next, let’s use dynamic programmingto avoid re-computing intermediate results...

Christopher Olah http://colah.github.io/posts/2015-08-Backprop/R. Sennrich MT – 2018 – 02 9 / 21

Page 33: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Backpropagation

applying chain rule:

∂e

∂b=∂e

∂c· ∂c∂b

+∂e

∂d· ∂d∂b

= 1 · 2 + 1 · 3 = 5

next, let’s use dynamic programmingto avoid re-computing intermediate results...

Christopher Olah http://colah.github.io/posts/2015-08-Backprop/R. Sennrich MT – 2018 – 02 9 / 21

Page 34: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Backpropagation

forward-mode differentiation lets us compute partial derivatives∂x

∂bfor all

nodes x→ still inefficient if you have many inputs

Christopher Olah http://colah.github.io/posts/2015-08-Backprop/

R. Sennrich MT – 2018 – 02 10 / 21

Page 35: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Backpropagation

backward-mode differentiation lets us efficiently compute∂e

∂xfor all inputs

x in one pass→ also known as error backpropagation

Christopher Olah http://colah.github.io/posts/2015-08-Backprop/

R. Sennrich MT – 2018 – 02 10 / 21

Page 36: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

To summarize what we have learned

When approaching a machine learning problem, we need:

a suitable model;

(here: a linear model)

a suitable cost (or loss) function;

(here: mean square error)

an optimization algorithm;

(here: a variant of SGD)

the gradient(s) of the cost function (if required by the optimizationalgorithm).

R. Sennrich MT – 2018 – 02 11 / 21

Page 37: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

To summarize what we have learned

When approaching a machine learning problem, we need:

a suitable model;

(here: a linear model)

a suitable cost (or loss) function;

(here: mean square error)

an optimization algorithm;

(here: a variant of SGD)

the gradient(s) of the cost function (if required by the optimizationalgorithm).

R. Sennrich MT – 2018 – 02 11 / 21

Page 38: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

To summarize what we have learned

When approaching a machine learning problem, we need:

a suitable model;

(here: a linear model)

a suitable cost (or loss) function;

(here: mean square error)

an optimization algorithm;

(here: a variant of SGD)

the gradient(s) of the cost function (if required by the optimizationalgorithm).

R. Sennrich MT – 2018 – 02 11 / 21

Page 39: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

To summarize what we have learned

When approaching a machine learning problem, we need:

a suitable model;

(here: a linear model)

a suitable cost (or loss) function;

(here: mean square error)

an optimization algorithm;

(here: a variant of SGD)

the gradient(s) of the cost function (if required by the optimizationalgorithm).

R. Sennrich MT – 2018 – 02 11 / 21

Page 40: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

To summarize what we have learned

When approaching a machine learning problem, we need:

a suitable model;

(here: a linear model)

a suitable cost (or loss) function;

(here: mean square error)

an optimization algorithm;

(here: a variant of SGD)

the gradient(s) of the cost function (if required by the optimizationalgorithm).

R. Sennrich MT – 2018 – 02 11 / 21

Page 41: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

To summarize what we have learned

When approaching a machine learning problem, we need:

a suitable model; (here: a linear model)

a suitable cost (or loss) function; (here: mean square error)

an optimization algorithm; (here: a variant of SGD)

the gradient(s) of the cost function (if required by the optimizationalgorithm).

R. Sennrich MT – 2018 – 02 11 / 21

Page 42: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

What is a Neural Network?

A complex non-linear function which:is built from simpler units (neurons, nodes, gates, . . . )maps vectors/matrices to vectors/matricesis parameterised by vectors/matrices

Why is this useful?

very expressivecan represent (e.g.) parameterised probability distributionsevaluation and parameter estimation can be built up from components

relationship to linear regressionmore complex architectures with hidden units(neither input nor output)

neural networks typically use non-linear activation functions

R. Sennrich MT – 2018 – 02 12 / 21

Page 43: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

What is a Neural Network?

A complex non-linear function which:is built from simpler units (neurons, nodes, gates, . . . )maps vectors/matrices to vectors/matricesis parameterised by vectors/matrices

Why is this useful?very expressivecan represent (e.g.) parameterised probability distributionsevaluation and parameter estimation can be built up from components

relationship to linear regressionmore complex architectures with hidden units(neither input nor output)

neural networks typically use non-linear activation functions

R. Sennrich MT – 2018 – 02 12 / 21

Page 44: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

What is a Neural Network?

A complex non-linear function which:is built from simpler units (neurons, nodes, gates, . . . )maps vectors/matrices to vectors/matricesis parameterised by vectors/matrices

Why is this useful?very expressivecan represent (e.g.) parameterised probability distributionsevaluation and parameter estimation can be built up from components

relationship to linear regressionmore complex architectures with hidden units(neither input nor output)

neural networks typically use non-linear activation functions

R. Sennrich MT – 2018 – 02 12 / 21

Page 45: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

An Artificial Neuron

x1

x2

x3

...

xn

g(w · x+ b) y

x is a vector input, y is a scalar output

w and b are the parameters (b is a bias term)

g is a (non-linear) activation function

R. Sennrich MT – 2018 – 02 13 / 21

Page 46: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Why Non-linearity?

Functions like XOR cannot be separated by a linear function

XORTruth table

x1 x2 output0 0 00 1 11 0 11 1 0

A

B

C

D

x1

x2

y

1

1

1

-2

1

0.5

0.5

(neurons arranged in layers, and fire if input is ≥ 1)

R. Sennrich MT – 2018 – 02 14 / 21

Page 47: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Why Non-linearity?

Functions like XOR cannot be separated by a linear function

XORTruth table

x1 x2 output0 0 00 1 11 0 11 1 0

A

B

C

D

x1

x2

y

1

1

1

-2

1

0.5

0.5

(neurons arranged in layers, and fire if input is ≥ 1)

R. Sennrich MT – 2018 – 02 14 / 21

Page 48: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Why Non-linearity?

Functions like XOR cannot be separated by a linear function

XORTruth table

x1 x2 output0 0 00 1 11 0 11 1 0

A

B

C

D

x1

x2

y

1

1

1

-2

1

0.5

0.5

(neurons arranged in layers, and fire if input is ≥ 1)

R. Sennrich MT – 2018 – 02 14 / 21

Page 49: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Why Non-linearity?

Functions like XOR cannot be separated by a linear function

XORTruth table

x1 x2 output0 0 00 1 11 0 11 1 0

A

B

C

D

x1

x2

y

1

1

1

-2

1

0.5

0.5

(neurons arranged in layers, and fire if input is ≥ 1)

R. Sennrich MT – 2018 – 02 14 / 21

Page 50: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Activation functions

desirable:differentiable (for gradient-based training)monotonic (for better training stability)non-linear (for better expressivity)

−3.0 −2.0 −1.0 1.0 2.0 3.0

−1.0

1.0

2.0

3.0

x

y

identity (linear)sigmoid

tanhrectified linear unit (ReLU)

R. Sennrich MT – 2018 – 02 15 / 21

Page 51: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

A Simple Neural Network: Maths

we can use linear algebra to formalize our neural network:

the network

A

B

C

D

x1

x2

y

1

0

0

1

1

-2

1

0.5

0.5

w1 =

1 00.5 0.50 1

h1 =

ABC

x =

[x1x2

]w2 =

[1 −2 1

]y =

[D]

calculation of x 7→ y

h1 = ϕ(xw1)

y = ϕ(h1w2)

R. Sennrich MT – 2018 – 02 16 / 21

Page 52: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

A Simple Neural Network: Python Code

import numpy as np

#activation function

def phi(x):

return np.greater_equal(x,1).astype(int)

def nn(x, w1, w2):

h1 = phi(np.dot(x, w1))

y = phi(np.dot(h1, w2))

return y

w1 = np.array([ [1, 0.5, 0], [0, 0.5, 1] ])

w2 = np.array([[1], [-2], [1]])

x = np.array([1, 0])

print nn(x, w1, w2)

R. Sennrich MT – 2018 – 02 17 / 21

Page 53: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

More Complex ArchitecturesConvolutional

tor wi ∈ Rd of a word in the sentence:

s =

w1 . . . ws

(2)

To address the problem of varying sentencelengths, the Max-TDNN takes the maximum ofeach row in the resulting matrix c yielding a vectorof d values:

cmax =

max(c1,:)...

max(cd,:)

(3)

The aim is to capture the most relevant feature, i.e.the one with the highest value, for each of the drows of the resulting matrix c. The fixed-sizedvector cmax is then used as input to a fully con-nected layer for classification.

The Max-TDNN model has many desirableproperties. It is sensitive to the order of the wordsin the sentence and it does not depend on externallanguage-specific features such as dependency orconstituency parse trees. It also gives largely uni-form importance to the signal coming from eachof the words in the sentence, with the exceptionof words at the margins that are considered fewertimes in the computation of the narrow convolu-tion. But the model also has some limiting as-pects. The range of the feature detectors is lim-ited to the span m of the weights. Increasing m orstacking multiple convolutional layers of the nar-row type makes the range of the feature detectorslarger; at the same time it also exacerbates the ne-glect of the margins of the sentence and increasesthe minimum size s of the input sentence requiredby the convolution. For this reason higher-orderand long-range feature detectors cannot be easilyincorporated into the model. The max pooling op-eration has some disadvantages too. It cannot dis-tinguish whether a relevant feature in one of therows occurs just one or multiple times and it for-gets the order in which the features occur. Moregenerally, the pooling factor by which the signalof the matrix is reduced at once corresponds tos−m+1; even for moderate values of s the pool-ing factor can be excessive. The aim of the nextsection is to address these limitations while pre-serving the advantages.

3 Convolutional Neural Networks withDynamic k-Max Pooling

We model sentences using a convolutional archi-tecture that alternates wide convolutional layers

K-Max pooling(k=3)

Fully connected layer

Folding

Wideconvolution

(m=2)

Dynamick-max pooling (k= f(s) =5)

Projectedsentence

matrix(s=7)

Wideconvolution

(m=3)

The cat sat on the red mat

Figure 3: A DCNN for the seven word input sen-tence. Word embeddings have size d = 4. Thenetwork has two convolutional layers with twofeature maps each. The widths of the filters at thetwo layers are respectively 3 and 2. The (dynamic)k-max pooling layers have values k of 5 and 3.

with dynamic pooling layers given by dynamic k-max pooling. In the network the width of a featuremap at an intermediate layer varies depending onthe length of the input sentence; the resulting ar-chitecture is the Dynamic Convolutional NeuralNetwork. Figure 3 represents a DCNN. We pro-ceed to describe the network in detail.

3.1 Wide Convolution

Given an input sentence, to obtain the first layer ofthe DCNN we take the embedding wi ∈ Rd foreach word in the sentence and construct the sen-tence matrix s ∈ Rd×s as in Eq. 2. The valuesin the embeddings wi are parameters that are op-timised during training. A convolutional layer inthe network is obtained by convolving a matrix ofweights m ∈ Rd×m with the matrix of activationsat the layer below. For example, the second layeris obtained by applying a convolution to the sen-tence matrix s itself. Dimension d and filter widthm are hyper-parameters of the network. We let theoperations be wide one-dimensional convolutionsas described in Sect. 2.2. The resulting matrix chas dimensions d× (s + m− 1).

658

[Kalchbrenner et al., 2014]

Recurrent

Andrej Karpathy

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

R. Sennrich MT – 2018 – 02 18 / 21

Page 54: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Practical Considerations

efficiency:GPU acceleration of BLAS operationsperform SGD in mini-batches

hyperparameters:number and size of layersminibatch sizelearning rate...

initialisation of weight matrices

stopping criterion

regularization (dropout)

bias units (always-on input)

R. Sennrich MT – 2018 – 02 19 / 21

Page 55: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Toolkits for Neural Networks

What does a Toolkit ProvideMulti-dimensional matrices (tensors)

Automatic differentiation

Efficient GPU routines for tensor operations

Torch http://torch.ch/

TensorFlow https://www.tensorflow.org/

Theano http://deeplearning.net/software/theano/

There are many more!R. Sennrich MT – 2018 – 02 20 / 21

Page 56: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Further Reading

required reading: Koehn (2017), chapter 13.2-3.

further reading on backpropagation:http://colah.github.io/posts/2015-08-Backprop/

R. Sennrich MT – 2018 – 02 21 / 21

Page 57: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Slide Credits

some slides borrowed from:

Sennrich, Birch, and Junczys-Dowmunt (2016): Advances in NeuralMachine Translation

Sennrich and Haddow (2017): Practical Neural Machine Translation

R. Sennrich MT – 2018 – 02 22 / 21

Page 58: 02: Neural Network Basics - University of Edinburghhomepages.inf.ed.ac.uk/rsennric/mt18/2.pdf · University of Edinburgh R. Sennrich MT – 2018 – 02 1/21. ... Today’s Lecture

Bibliography I

Kalchbrenner, N., Grefenstette, E., and Blunsom, P. (2014).A Convolutional Neural Network for Modelling Sentences.In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

R. Sennrich MT – 2018 – 02 23 / 21


Recommended