Introduction to Deep Learning
Standard feed-forward neural network with 3 hidden layers
A convolutional neural network (AlexNet, Krizhevsky et al ‘12)
Example applications
Image classification(AlexNet, Krizhevsky et al ‘12)
Generating image descriptions (Karpathy et al ‘15)
Translation (Wu et al ‘15)Face generation
(Berthelot et al ‘17)
Very Brief History
2012AlexNet
Deep Neural NetworksSupport Vector Machines,Kernel methods, ….
(90s and earlier:Neural networks)
Seminar plan
Meeting # Speaker Topic
1 David Belius Intro to Machine Learning
2 Marko Thiel Intro to Artificial Neural Networks
3 tba DL Basics: Regularization
4 tba DL Basics: Optimization
5 tba DL Basics: Convolutional neural networks
6 tba DL Basics: Recurrent Neural Networks
7-11 tba Advanced topics
Introduction to ML
ML: Learn to generalize from data● MNIST: 60k handwritten digits, 28x28 grayscale pixels
x
y● CIFAR100: 50k images of objects from 100 classes,
32x32 RGB pixels
x
y● Europarl EN-DE: 1.7m sentence pairs
0 8 3 7
train sunflower elephant cow
Frau Präsidentin, können Sie mir sagen, warum sich dieses Parlament nicht...
Madam President, can you tell me why this Parliament does not….
It is why we cannot say a clear yes.
Deswegen können wir nicht eindeutig ja sagen.
Labeled data
Unlabeled data
Space of “inputs” (e.g. image of handwritten digit)
Space of labels/”outputs” (e.g. digit 0-9)
Data point
Space of data
Data point
Supervised learning
Unsupervised learning
E.g.
– Clustering
– Dimensionality reduction
ML: Learn to generalize from data
Probabilistic model of data
Data set
are iid samples from unknown probability distribution on
Goal of learning: Get information about
Basic ML tasks:Supervised learning
Classification, regression– Predict y from x, i.e. learn
– Often assume Y deterministic function of X
– Then have “truth”
– Seek estimate
Basic ML tasks:Supervised learning
Classification, regression– Predict y from x, i.e. learn
– Often assume Y deterministic function of X
– Then have “truth”
– Seek estimate
– Classification ● is finite set of labels, e.g. ● Want for most
– Regression●
● E.g.: ● Want small for most
– Density estimation● probability density of on● Seek estimate ● E.g. outlier detection:
– Sampling/synthesis● Learn how to simulate a sample from a
probability law that approximates● E.g: Learn to generate an image of a realistic looking
human face
Basic ML tasks:Unsupervised learning
Evaluating performance
● Classification– True error for fixed :
– True error is unknown
– If are iid samples from
is unbiased estimator for true error.
– Warning: Only true if not usedto construct !
Evaluating performance
● Regression– True Mean Squared Error (MSE) for fixed :
– Not known, but unbiased estimate:
if not used to construct !
● Split data set into– Training set (~80%)
– Test set (~20%)
● Construct using training set.● Evaluate performance using test set.
Train data and test data
How to learn
● Non-parametric algorithms– k-nearest neighbours classification, decision trees,
k-means clustering
● Parametric algorithms (“fitting”)– Hypothesis set of potential estimates
parametrized by some number of real parameters
– Error of on training set (regression):
– Learning: find with small error on training setand set
Crucial to restrict class somehow
has zero error on training set.
Example: Linear regression● ,●
● Find that minimize
●
Line of best fit
● Recall: there is a closed form formula for the optimal (least squares, normal equations)
● But also:
is smooth function in → Loss function● Furthermore is convex
– Has unique global minimum
– which can be found by numerical optimization: gradient descent
Example: Linear regression
● Gradient descent– arbitrary (random)
– small (step size/learning rate)
–
● “Always” finds global minimum of smooth, convex loss function
● But typically not fornon-convex function
Example: Linear regression
General recipe of parametric ML
1. Define hypothesis set
2. Define loss smooth loss function
3. Numerically minimize to find estimate
● Traditional ML: make sure is convex to have guarantees for numerical minimization
● Deep Learning/Neural Networks: – highly non-convex
– Somehow, still works. Gradient descent finds “good” minima.
Example: Linear regression● Can fit data that is basically linear:
● Can’t fit other relationships:
● Solution: Make hypothesis set richer!
Example: Polynomial regression● ,●
● Loss
is smooth and convex.
Capacity, overfitting, underfitting
● Capacity: the “richness” of hypothesis set ● Mathematical definitions exist (e.g. VC-dimension)
but often used as intuitive notion● Polynomial regression:
● Too little capacity: can’t fit train data→underfitting● Too much capacity: generalize badly→overfitting
More capacityLess capacity
Polynomial regression:underfitting/overfitting
(Credit: Francois Fleuret, EPFL)
Capacity, overfitting, underfitting
● Underfitting: train error large, test error large● Overfitting: train error small, test error large● Trade-off: must find appropriate level of capacity
for data distribution
More capacityLess capacity
Train error
Test error
Best compromise
Underfitting Overfitting
● Traditional ML: low to moderate capacity● Deep Learning: Enormous capacity.
– Millions of parameters (>> # training examples)
– Still don’t overfit. Why?
Capacity, overfitting, underfitting
Model selection(Hyperparameter selection)
● If test set is used to evaluate performance of different hypothesis sets (different models):– Test error no longer unbiased estimator of true
error!
● Good:
● Still good:
● Bad:
(Credit: Francois Fleuret, EPFL)
● Solution: Further split train data into– Training set
– Validation set
● Pick algorithm that evaluates the best on validation set
● Report performance on test set
● Good:
Model selection(Hyperparameter selection)
(~80%)
(~20%)
(Credit: Francois Fleuret, EPFL)
Loss for categorical data
● True error for classification● Empirical training error
is not smooth function! Can’t be used as loss for gradient based numerical optimization.
Loss for categorical data
● Solution: Formulate as really predicting conditional distribution
● Specify probability distribution on , as vector of probabilities
●
● True cond. distribution is
● Seek estimate
Loss for categorical data
● To quantify error made in prediction:– Use relative entropy/ Kullback-Leibler divergence
as distance between prob. measures.
– Recall:
● Error made in prediction for fixed :
● Unknown true error
Loss for categorical data
● Unknown true error● Unbiased estimator
– Here:
● Concretely, estimator equals:
(One-hot encoding)
Loss for categorical data● Concretely, estimator equals:
● The training loss function
is smooth! → Can use numerical optimization● Remarks:
– MSE loss can be justified in terms of predicting a Gaussian dist.
– Not all loss functions derived in such a principled way.
Example: Logistic regression● Hypothesis set
● Loss
is convex in W,b! → Can find global minimum with gradient descent.
● To predict one class: output – Equivalently: output k with largest
Example: Logistic regression● To predict one class: output
– Equivalently: output largest k with largest
● Logistic regression can fit linearly separable data
Decision boundary
Can fit Can’t fit
Example: Logistic regression● If data not linearly separable, can look for
transformation that makes it more so:
● Train on data● = a representation of● Traditional ML: Construct representation by hand● Deep learning: Algorithm finds good representation
during training
Example: Logistic regression● MNIST: 60k handwritten digits, 28x28 grayscale pixels
x
y● Logistic regression on MNIST:
test error rate ~7%
0 8 3 7
Classification: overfitting/underfitting
Good fit
Overfitting
Underfitting
(Credit: Wikipedia)
Data encoding● MNIST: 60k handwritten digits, 28x28 grayscale pixels
x
y● CIFAR100: 50k images of objects from 100 classes,
32x32 RGB pixels
x
y● Europarl EN-DE: 1.7m sentence pairs
0 8 3 7
train sunflower elephant cow
Frau Präsidentin, können Sie mir sagen, warum sich dieses Parlament nicht...
Madam President, can you tell me why this Parliament does not….
It is why we cannot say a clear yes.
Deswegen können wir nicht eindeutig ja sagen.
(One-hot)
(One-hot)
(One-hot)
Feature engineeringRepresentation engineering
● Traditional ML: Use hand-engineered features as inputs to algorithms
● Deep Learning: Feed algorithm raw data (pixels, character level text,….)
Standard data sets:Used as benchmarks
(Credit: https://srconstantin.wordpress.com/) (Credit: Francois Fleuret, EPFL)
Performance on MNIST Performance on CIFAR10
Collective overfitting of test set by ML community
● Recall– Good:
– For heavily used dataset:
● Need new datasets to appear periodically(Credit: Francois Fleuret, EPFL)
Bias-variance tradeoff● Related to underfitting/overfitting● Fix one ● Fit is random variable (depends on trian
data)● Decompose true MSE error at :
Variance
Bias
Irreducible error
● Small variance, high bias: underfit● Large variance, low bias: overfit● Small bias, small variance is hard → Tradeoff
Bias-variance tradeoff
(Credit: Francois Fleuret, EPFL)
Bias-variance tradeoff
Less capacity
Test error
Underfitting Overfitting
More capacity
Variance
Bias
Test error = variance + bias (+ irreducible error)
● Example: Logistic regression●
● True cond. distribution is
● Seek estimate● Likelihood of train y give train x
● Loss is neg. log-likelihood:
Maximum likelihood interpretation
● Consider model parameters as random with a prior distribution
● Bayes’ rule gives posterior distribution on parameters conditioned on data
Bayesian interpretation
Deep Learning
● Parametric ML with hypothesis class:
General references
● Goodfellow, Bengio, Courville,Deep learning, MIT press, 2016,http://www.deeplearningbook.org
● EPFL course slides and videos– Prof. Francois Fleuret
– https://documents.epfl.ch/users/f/fl/fleuret/www/dlc/
OrganisationMeeting # Speaker Topic
Today David Belius Intro to Machine Learning
Next week Marko Thiel Intro to Artificial Neural Networks
March 8 Master Student Regularization (Bengio Ch 7)
March 15 Master Student Optimization (Bengio Ch 8)
March 22 Master Student Convolutional neural networks (Bengio Ch 9)
April 12 Master Student Recurrent Neural Networks (Bengio Ch 10)
April 26+ PhD Students Advanced topics
● First four student talks– Master student speakers: e-mail me any preferences
– Set up preliminary meeting with Marko and me
● Optional practical sessions (programming)- E-mail me if interested
● No meeting April 19