+ All Categories
Home > Documents > Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix...

Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix...

Date post: 28-Sep-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
22
Introduction to Neural Networks Fundamental ideas behind artificial neural networks Haidar Khan Bulent Yener
Transcript
Page 1: Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix 𝑋𝑋 rows are observations 𝒙𝒙. 𝑖𝑖 (1Γ—π‘šπ‘š) Data labels -𝑛𝑛×

Introduction to Neural Networks

Fundamental ideas behind artificial neural networks

Haidar KhanBulent Yener

Page 2: Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix 𝑋𝑋 rows are observations 𝒙𝒙. 𝑖𝑖 (1Γ—π‘šπ‘š) Data labels -𝑛𝑛×

RPI – Computer Science – Haidar Khan

Outline

Introduction Machine Learning framework Neural Networks1. Simple linear models2. Nonlinear activations3. Gradient descent Demos

Page 3: Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix 𝑋𝑋 rows are observations 𝒙𝒙. 𝑖𝑖 (1Γ—π‘šπ‘š) Data labels -𝑛𝑛×

RPI – Computer Science – Haidar Khan

What are (artificial) neural networks

A technique to estimate patterns from data (~1940s)

Also called β€œmulti-layer perceptrons” β€œneural” – very crude mimicry of

how real biological neurons work Large network of simple units which

produce a complex output

Page 4: Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix 𝑋𝑋 rows are observations 𝒙𝒙. 𝑖𝑖 (1Γ—π‘šπ‘š) Data labels -𝑛𝑛×

RPI – Computer Science – Haidar Khan

Why do we care about them

Key ingredient in real AI Useful for industry problems Perform best on important tasks Yield insights into the biological brain (maybe)

Page 5: Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix 𝑋𝑋 rows are observations 𝒙𝒙. 𝑖𝑖 (1Γ—π‘šπ‘š) Data labels -𝑛𝑛×

RPI – Computer Science – Haidar Khan

General machine learning framework

Data – 𝑛𝑛 Γ— π‘šπ‘š matrix 𝑋𝑋 rows are observations 𝒙𝒙𝑖𝑖 (1 Γ— π‘šπ‘š)

Data labels - 𝑛𝑛 Γ— 1 vector 𝑦𝑦 Assume there is some unknown

function 𝑓𝑓 οΏ½ that generates the label 𝑦𝑦𝑖𝑖 given 𝒙𝒙𝑖𝑖:

𝑓𝑓 𝒙𝒙𝑖𝑖 = 𝑦𝑦𝑖𝑖 ML problem: estimate 𝑓𝑓 οΏ½ Use it to generate labels for new

observations!

𝑋𝑋 π’šπ’š

𝑛𝑛

π‘šπ‘š

Page 6: Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix 𝑋𝑋 rows are observations 𝒙𝒙. 𝑖𝑖 (1Γ—π‘šπ‘š) Data labels -𝑛𝑛×

RPI – Computer Science – Haidar Khan

Some examples…

Problem Data Data labels119 images of cats and dogs (20 x 20 pixels)

119 Γ— 400 matrix of pixeldata(we stretch each image into a long vector)

{Cat, Dog}

A 15 question political poll of 139 residents on recent state legislation

139 Γ— 15 matrix of answers (A-E)

Party affiliation: {Republican, Democrat, Independent}

Page 7: Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix 𝑋𝑋 rows are observations 𝒙𝒙. 𝑖𝑖 (1Γ—π‘šπ‘š) Data labels -𝑛𝑛×

RPI – Computer Science – Haidar Khan

Recall: Linear regression

Assume the generating function 𝑓𝑓(οΏ½) is linear Write label 𝑦𝑦𝑖𝑖 as a linear function of 𝑋𝑋: 𝑦𝑦𝑖𝑖 = π’™π’™π‘–π‘–π’˜π’˜ Matrix form: π’šπ’š = π‘‹π‘‹π’˜π’˜

What should the π‘šπ‘š Γ— 1 vector π’˜π’˜ be? This is the familiar least squares regression:

π’˜π’˜ = 𝑋𝑋𝑇𝑇𝑋𝑋 βˆ’1π‘‹π‘‹π‘‡π‘‡π’šπ’š We will set up the simplest neural network and show we arrive

at this same solution!

Page 8: Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix 𝑋𝑋 rows are observations 𝒙𝒙. 𝑖𝑖 (1Γ—π‘šπ‘š) Data labels -𝑛𝑛×

RPI – Computer Science – Haidar Khan

Declare a simple neural network

Recall 𝒙𝒙 is 1 Γ— π‘šπ‘š One artificial neural unit Connects to each input π‘₯π‘₯𝑖𝑖 with a

weight 𝑀𝑀𝑖𝑖 Produces one output 𝑧𝑧

𝑧𝑧 = �𝑖𝑖

π‘šπ‘š

π‘₯π‘₯𝑖𝑖𝑀𝑀𝑖𝑖http://nikhilbuduma.com

𝑧𝑧

Γ—

Page 9: Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix 𝑋𝑋 rows are observations 𝒙𝒙. 𝑖𝑖 (1Γ—π‘šπ‘š) Data labels -𝑛𝑛×

RPI – Computer Science – Haidar Khan

Set an objective to learn

Want network outputs 𝑧𝑧𝑖𝑖 to match labels 𝑦𝑦𝑖𝑖 Choose a loss function 𝐸𝐸 and optimize w.r.t the weights

𝐸𝐸 =12�𝑖𝑖

𝑁𝑁

𝑧𝑧𝑖𝑖 βˆ’ 𝑦𝑦𝑖𝑖 2

𝐸𝐸 =12�𝑖𝑖

𝑁𝑁

π’™π’™π‘–π‘–π’˜π’˜ βˆ’ 𝑦𝑦𝑖𝑖 2

How to minimize 𝐸𝐸 with respect to π’˜π’˜?

Page 10: Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix 𝑋𝑋 rows are observations 𝒙𝒙. 𝑖𝑖 (1Γ—π‘šπ‘š) Data labels -𝑛𝑛×

RPI – Computer Science – Haidar Khan

Equivalence to least squares

Take the derivative and set it to zero:π‘‘π‘‘πΈπΈπ‘‘π‘‘π’˜π’˜

= �𝑖𝑖

𝑁𝑁

π’™π’™π‘–π‘–π’˜π’˜ βˆ’ 𝑦𝑦𝑖𝑖 𝒙𝒙𝑖𝑖𝑇𝑇

�𝑖𝑖

𝑁𝑁

π’™π’™π‘–π‘–π‘‡π‘‡π’™π’™π‘–π‘–π’˜π’˜ βˆ’ 𝒙𝒙𝑖𝑖𝑇𝑇𝑦𝑦𝑖𝑖 = 𝟎𝟎

Written in matrix form this becomes:π‘‹π‘‹π‘‡π‘‡π‘‹π‘‹π’˜π’˜ βˆ’ π‘‹π‘‹π‘‡π‘‡π’šπ’š = πŸŽπŸŽπ’˜π’˜ = 𝑋𝑋𝑇𝑇𝑋𝑋 βˆ’1π‘‹π‘‹π‘‡π‘‡π’šπ’š

Page 11: Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix 𝑋𝑋 rows are observations 𝒙𝒙. 𝑖𝑖 (1Γ—π‘šπ‘š) Data labels -𝑛𝑛×

RPI – Computer Science – Haidar Khan

Key idea: compose simple units

Where do we go from here? Use many of these simple units and

compose them in layers: Function composition: 𝑔𝑔(β„Ž οΏ½ )

Each layer learns a new representation of the data 3 layer network: 𝑧𝑧𝑖𝑖 = β„Ž3 β„Ž2 β„Ž1 𝒙𝒙𝑖𝑖

http://neuralnetworksanddeeplearning.com

Page 12: Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix 𝑋𝑋 rows are observations 𝒙𝒙. 𝑖𝑖 (1Γ—π‘šπ‘š) Data labels -𝑛𝑛×

RPI – Computer Science – Haidar Khan

Drawback to only linear units

Recall our earlier assumption that 𝑓𝑓(οΏ½) is linear This is a very restrictive assumption

Furthermore, composing strictly linear models is also linear!𝑧𝑧𝑖𝑖 = β„Ž3 β„Ž2 β„Ž1 𝒙𝒙𝑖𝑖 = π‘Šπ‘Š3π‘Šπ‘Š2π‘Šπ‘Š1𝒙𝒙𝑖𝑖 = π‘Šπ‘Š123𝒙𝒙𝑖𝑖

XOR problem (Minsky, Papert 1969)

Page 13: Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix 𝑋𝑋 rows are observations 𝒙𝒙. 𝑖𝑖 (1Γ—π‘šπ‘š) Data labels -𝑛𝑛×

RPI – Computer Science – Haidar Khan

XOR problem

Can’t learn a simple XOR gate using only one straight line

Page 14: Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix 𝑋𝑋 rows are observations 𝒙𝒙. 𝑖𝑖 (1Γ—π‘šπ‘š) Data labels -𝑛𝑛×

RPI – Computer Science – Haidar Khan

Key idea: non-linear activations

Solution: add a non-linear function at the output of each layer What kind of function? Differentiable at least:

Hyperbolic tangent: 𝑧𝑧 = tanh(π’˜π’˜π‘‡π‘‡π’™π’™π‘–π‘–)

Sigmoid: 𝑧𝑧 = 1

1+π‘’π‘’βˆ’π’˜π’˜π‘‡π‘‡π’™π’™π‘–π‘–

Rectified Linear: 𝑧𝑧 = max 0,π’˜π’˜π‘‡π‘‡π’™π’™π‘–π‘– Why? Labels π’šπ’š can be a non-linear function of the inputs (like

XOR)

Page 15: Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix 𝑋𝑋 rows are observations 𝒙𝒙. 𝑖𝑖 (1Γ—π‘šπ‘š) Data labels -𝑛𝑛×

RPI – Computer Science – Haidar Khan

Examples of non-linear activations

http://ufldl.stanford.edu

Page 16: Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix 𝑋𝑋 rows are observations 𝒙𝒙. 𝑖𝑖 (1Γ—π‘šπ‘š) Data labels -𝑛𝑛×

RPI – Computer Science – Haidar Khan

How do we learn weights now?

With multiple layers and non-linear activation functions we can’t simply take the derivative and set it to 0

Still can set a loss function and: Randomly try different weights Numerically estimate the derivative

𝑓𝑓′ π‘₯π‘₯ =𝑓𝑓 π‘₯π‘₯ + β„Ž βˆ’ 𝑓𝑓 π‘₯π‘₯

β„Ž Terribly inefficient and scale badly with the number of layers…

Page 17: Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix 𝑋𝑋 rows are observations 𝒙𝒙. 𝑖𝑖 (1Γ—π‘šπ‘š) Data labels -𝑛𝑛×

RPI – Computer Science – Haidar Khan

Key idea: gradient descent on loss function

Suppose we could calculate the partial derivative of 𝐸𝐸 w.r.t each weight 𝑀𝑀𝑖𝑖:

𝛿𝛿𝛿𝛿𝛿𝛿𝑀𝑀𝑖𝑖

(gradient)

Decrease the loss function 𝐸𝐸 by updating weights:

𝑀𝑀𝑖𝑖 = 𝑀𝑀𝑖𝑖 +𝛿𝛿𝐸𝐸𝛿𝛿𝑀𝑀𝑖𝑖

Repeatedly doing this process is called gradient descent

Leads to a set of weights that correspond to a local minimum of the loss function

Page 18: Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix 𝑋𝑋 rows are observations 𝒙𝒙. 𝑖𝑖 (1Γ—π‘šπ‘š) Data labels -𝑛𝑛×

RPI – Computer Science – Haidar Khan

Backpropagation to estimate gradients

One of the breakthroughs in neural network research Allows to calculate the gradients of the network! Core idea behind the algorithm is multiple applications of the

chain rule of derivatives:𝐹𝐹 π‘₯π‘₯ = 𝑓𝑓 𝑔𝑔 π‘₯π‘₯

𝐹𝐹′ π‘₯π‘₯ = 𝑓𝑓′ 𝑔𝑔 π‘₯π‘₯ 𝑔𝑔′ π‘₯π‘₯ Two passes through the network: forward and backward

Forward: calculate 𝑔𝑔(π‘₯π‘₯) and then 𝑓𝑓(𝑔𝑔 π‘₯π‘₯ ) Backward: calculate 𝑓𝑓𝑓(𝑔𝑔 π‘₯π‘₯ ) and then 𝑔𝑔𝑓(π‘₯π‘₯)

Page 19: Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix 𝑋𝑋 rows are observations 𝒙𝒙. 𝑖𝑖 (1Γ—π‘šπ‘š) Data labels -𝑛𝑛×

RPI – Computer Science – Haidar Khan

Multilayer Backpropagation

Assume we have 𝑑𝑑𝑖𝑖 , 𝑑𝑑𝑗𝑗 , 𝑧𝑧𝑗𝑗 from the forward pass Work backward from the output of the network:

𝐸𝐸 = 12βˆ‘π‘—π‘—βˆˆπ‘œπ‘œπ‘œπ‘œπ‘œπ‘œπ‘œπ‘œπ‘œπ‘œπ‘œπ‘œ 𝑑𝑑𝑗𝑗 βˆ’ 𝑦𝑦𝑗𝑗

2, π›Ώπ›Ώπ›Ώπ›Ώπ›Ώπ›Ώπ‘œπ‘œπ‘—π‘—

= βˆ’ 𝑑𝑑𝑗𝑗 βˆ’ 𝑦𝑦𝑗𝑗 (for output neurons)

π›Ώπ›Ώπ›Ώπ›Ώπ›Ώπ›Ώπ‘œπ‘œπ‘–π‘–

= βˆ‘π‘—π‘—π‘‘π‘‘π‘§π‘§π‘—π‘—π‘‘π‘‘π‘œπ‘œπ‘–π‘–

𝛿𝛿𝛿𝛿𝛿𝛿𝑧𝑧𝑗𝑗

= βˆ‘π‘—π‘—π‘€π‘€π‘–π‘–π‘—π‘—π›Ώπ›Ώπ›Ώπ›Ώπ›Ώπ›Ώπ‘§π‘§π‘—π‘—

𝛿𝛿𝛿𝛿𝛿𝛿𝑧𝑧𝑗𝑗

=π›Ώπ›Ώπ‘œπ‘œπ‘—π‘—π›Ώπ›Ώπ‘§π‘§π‘—π‘—

π›Ώπ›Ώπ›Ώπ›Ώπ›Ώπ›Ώπ‘œπ‘œπ‘—π‘—

= π‘‘π‘‘π‘—π‘—π›Ώπ›Ώπ›Ώπ›Ώπ›Ώπ›Ώπ‘œπ‘œπ‘—π‘—

π›Ώπ›Ώπ›Ώπ›Ώπ›Ώπ›Ώπ‘œπ‘œπ‘–π‘–

= βˆ‘π‘—π‘—π‘€π‘€π‘–π‘–π‘—π‘— π‘‘π‘‘π‘—π‘—π›Ώπ›Ώπ›Ώπ›Ώπ›Ώπ›Ώπ‘œπ‘œπ‘—π‘—

𝛿𝛿𝛿𝛿𝛿𝛿𝑀𝑀𝑖𝑖𝑗𝑗

=𝛿𝛿𝑧𝑧𝑗𝑗𝛿𝛿𝑀𝑀𝑖𝑖𝑗𝑗

𝛿𝛿𝛿𝛿𝛿𝛿𝑧𝑧𝑗𝑗

= π‘‘π‘‘π‘–π‘–π‘‘π‘‘π‘—π‘—π›Ώπ›Ώπ›Ώπ›Ώπ›Ώπ›Ώπ‘œπ‘œπ‘—π‘—

http://nikhilbuduma.com

Page 20: Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix 𝑋𝑋 rows are observations 𝒙𝒙. 𝑖𝑖 (1Γ—π‘šπ‘š) Data labels -𝑛𝑛×

RPI – Computer Science – Haidar Khan

Putting all the pieces together

3 key elements to understanding neural networks Composition of units with simple operations (dot-product) Non-linearity activation functions at unit outputs Learn weights using gradient descent

Using neural networks: Set up data matrix and label vector: 𝑋𝑋 and 𝑦𝑦 Define a network architecture: number of layers, units per layer Choose a loss function to minimize: depends on the task

Page 21: Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix 𝑋𝑋 rows are observations 𝒙𝒙. 𝑖𝑖 (1Γ—π‘šπ‘š) Data labels -𝑛𝑛×

RPI – Computer Science – Haidar Khan

A couple of demos…

Page 22: Recurrent Neural NetworksΒ Β· General machine learning framework Data – 𝑛𝑛× π‘šπ‘šmatrix 𝑋𝑋 rows are observations 𝒙𝒙. 𝑖𝑖 (1Γ—π‘šπ‘š) Data labels -𝑛𝑛×

RPI – Computer Science – Haidar Khan

Credits

Images from: http://nikhilbuduma.com/2015/01/11/a-deep-dive-into-recurrent-neural-networks/ http://ufldl.stanford.edu http://neuralnetworksanddeeplearning.com


Recommended