Getting Started with Neural Networksand
Fundamentals of Machine LearningMay 1, 2019
http://cross-entropy.net/ML410/Deep_Learning_3.pdf
Agenda for Tonight
• Homework Review
• [DLP] Chapter 2: Getting Started with Neural Networks
• [DLP] Chapter 3: Fundamentals of Machine Learning
Getting Started with Neural Networks
1. Anatomy of a Neural Network
2. Introduction to Keras
3. Setting Up a Deep Learning Workstation
4. Classifying Movie Reviews: a Binary Classification Example
5. Classifying Newswires: a MultiClass Classification Example
6. Predicting House Prices: a Regression Example
7. Chapter Summary
Neural Networks
Training involves …
Anatomy of a Neural Network
Relationship between the Network, Layers, Loss Function, and Optimizer
Anatomy of a Neural Network
Layers as the Lego Bricks of Deep Learning ☺
The first layer requires an input_shape parameter [the dimensions of a single observation], while additional layers do not require this parameter
Anatomy of a Neural Network
Networks of Layers
• A deep learning model is a directed, acyclic graph of layers
• Most common instance is a “linear” (simple, sequential) stack of layers
• Other common instances include …• Two-branch networks; e.g. question goes down one branch and text passage
goes down another
• Multi-head networks; e.g. we have both an image and a text description as inputs
• Inception blocks; e.g. we want to use a few different convolution (input filtering) approaches in parallel
Anatomy of a Neural Network
Loss Function and Optimizers
• Loss functions for this class include: cross entropy, mean squared error, mean absolute error, content loss, style loss, total variation loss, Kullback Leibler loss, temporal difference loss, actor loss, and critic loss
• Optimization functions for the class include: Stochastic Gradient Descent (SGD), Root Mean Squared (Gradient) Propagation (RMSProp), and Adaptive Moments (AdaM: RMSProp + Momentum)
Anatomy of a Neural Network
Keras Features
Introduction to Keras
Google Search Interest for Deep Learning Frameworks
Introduction to Keras
Deep Learning Software and Hardware Stack
• Nvidia Graphics Processing Units (GPUs) and Google Tensor Processing Units (TPUs) support efficient deep learning
• Nvidia’s Common Unified Device Architecture (CUDA) Application Programming Interface (API) and the CUDA Deep Neural Network (DNN) library provide an interface to Nvidia GPUs
• Eigen library implements the Basic Linear Algebra Subprograms (BLAS) specification, allowing tensor manipulation on Central Processing Units (CPUs)
Theano and CNTK are no longer maintained
Introduction to Keras
Typical Keras Workflow
Introduction to Keras
Network Definition:Sequential Model versus the Functional APISame model with both methods …
Introduction to Keras
Model Configuration and Training
Introduction to Keras
Two Options for Getting Keras Running
Setting Up a Deep Learning Workstation
Loading the Internet Movie DataBase (IMDB) Sentiment Analysis Datanum_words is the size of the vocabulary
Classifying Movie Reviews
Decoding a Document
Classifying Movie Reviews
Turning Lists of Integers into Tensors
Note: if more than one value in a row is one, we should refer to this as a multi-hot encoding
• one-hot encoding for identifying a class in a dense target vector
• multi-hot encoding to identify the tokens present in a document in a dense input vector
Classifying Movie Reviews
Encoding the Integer Sequences into a Binary Matrix
Classifying Movie Reviews
Architecture Decisions for Simple Feedforward Network• How many layers to use
• How many hidden units to choose for each layer
• Which activation functions to use• Do *not* forget to include activation functions: unexplained suboptimality
will ensue
Classifying Movie Reviews
Common Activation Functions
Rectified Linear Unit (ReLU): max(0,x)
[no saturation issue]
Sigmoid: 1/(1+exp(-x)
[usually used for output layer]
Classifying Movie Reviews
IMDB Network Architecture
• Features flow from bottom to top• An output is called a “head”
• Two hidden layers and an output layer with weights• It’s a deep neural network
• We’ll get to more than one hundred layers soon enough
Classifying Movie Reviews
Model Definition
Classifying Movie Reviews
Parameters and Outputs for a Dense Layer
• Parameters• (Number of Inputs from Previous Layer + 1) * (Number of “Units”)
• + 1 for bias weights: one for each “unit”
• We used to refer to “units” as neurons• The names have been changed to protect the innocent? Our approach was inspired by
neuroscience, but our brains aren’t using RMSProp ☺
• These are the same weight vectors we’ve come to know and love: projecting inputs to a new representation, one feature at a time [the number of “units” is the number of new features for the new representation]
• Output Shape• (Batch Size) x (Number of “Units”)
Classifying Movie Reviews
Why Are Activation Functions Necessary?
Try omitting activation functions from 1) the output layer and 2) hidden layers … so you can recognize this issue later
Classifying Movie Reviews
Compiling the Model
Classifying Movie Reviews
Setting Aside a Validation Set
Classifying Movie Reviews
Training the Model
Classifying Movie Reviews
Plotting the Training and Validation Loss
Classifying Movie Reviews
Where Do We Start Overfitting?
Classifying Movie Reviews
Plotting the Training and Validation Accuracy
Classifying Movie Reviews
Where Do We Start Overfitting?
Classifying Movie Reviews
Retraining the Model from Scratch
Why are we “retraining the model from scratch”?
Classifying Movie Reviews
Generating Predictions on New Data
Classifying Movie Reviews
Ideas for Experiments
Classifying Movie Reviews
Wrapping Up the IMDB Example
Classifying Movie Reviews
Loading the Reuters Dataset
Classifying Newswires
Decoding Newswires Back to Text
Classifying Newswires
Preparing the Document Matrices
Classifying Newswires
Preparing the Target Matrices
Classifying Newswires
Defining the Model
Classifying Newswires
Notes About the Architecture
Classifying Newswires
Validating the Approach
Classifying Newswires
Where Do We Start Overfitting?
Classifying Newswires
Where Do We Start Overfitting?
Classifying Newswires
Retraining a Model from Scratch
Classifying Newswires
Why are we “retraining a model from scratch”?
Comparing to Random[and a Majority Classifier]
Nota bene (note well): 813 of the 2,356 test examples belonged to class 3
The accuracy of a majority classifier is 36.2%
Classifying Newswires
Generating Predictions for New Data
Classifying Newswires
Dense Versus Sparse Labels
Classifying Newswires
Model With an Information Bottleneck
71% accuracy: an 8% absolute drop
Classifying Newswires
Further Experiments
Classifying Newswires
Wrapping Up
Classifying Newswires
Loading the Boston Housing Dataset
1970s home prices in thousands of dollars
Predicting House Prices
Brief Discussion of Bias
• Boston Housing dataset has been used by many popular textbooks
• The data explicitly offers a race-related variable for modeling• Avoid using proxy variables that lead to discrimination based on race, gender,
religion, etc
• Example: don’t ask about gender if all you really want to know is whether the candidate can lift X pounds
Predicting House Prices
http://lib.stat.cmu.edu/datasets/boston
Normalizing the Data
Predicting House Prices
Model Definition
Predicting House Prices
3-Fold Cross-Validation
Predicting House Prices
4-Fold Cross-Validation Implementation
Predicting House Prices
Cross-Validation Loop
Predicting House Prices
Cross-Validation Results
Predicting House Prices
Alternative Implementation [saved history]
Predicting House Prices
Plotting the Average Mean Absolute Error (MAE)
Predicting House Prices
Visualization Suggestions
Predicting House Prices
Smoothing the Curve
The smoothed_points expression should look familiar: RMSProp (0.9 for last squared gradient) and AdaM (0.999 for last gradient)
Predicting House Prices
Plotting the Smoothed MAE
Predicting House Prices
Training the Final Model
Predicting House Prices
Wrapping Up
Predicting House Prices
Chapter Summary
Fundamentals of Machine Learning
1. Four Branches of Machine Learning
2. Evaluating Machine Learning Models
3. Data Preprocessing, Feature Engineering, and Feature Learning
4. Overfitting and Underfitting
5. The Universal Workflow of Machine Learning
Supervised Learning Examples
Four Branches of Machine Learning
Unsupervised Learning
• Dimensionality Reduction
• Clustering
Four Branches of Machine Learning
Self-Supervised Learning
• Learning Without Human Annotated Labels
• Autoencoders
• Trying to predict the next word given previous words
• Trying to predict the next frame given previous frames
Four Branches of Machine Learning
Reinforcement Learning
• Google Deep Mind used reinforcement learning to create a model to play Atari games
• AlphaGo was created to play Go
• Occasional rewards
• Examples of possible applications include: self-driving cars, robotics, resource management, and education
Four Branches of Machine Learning
Also from Yann LeCun …
https://t.co/2LSb622114
Four Branches of Machine Learning
Classification and Regression Glossary
Four Branches of Machine Learning
Classification and Regression Glossary
Four Branches of Machine Learning
Simple Hold-out Validation Split
Evaluating Machine Learning Models
Hold-out Validation Implementation[note the concatenation]
Evaluating Machine Learning Models
K-Fold Cross-Validation
Used for smaller data sets• If K is too small, we’ll experience high bias (underfitting)
• If K is too large, we’ll experience high variance (overfitting)
Evaluating Machine Learning Models
K-Fold Cross Validation Implementation
Evaluating Machine Learning Models
Iterated K-Fold Cross-Validation with Shuffling
• history = []
• for i in range(iterationCount):• shuffle(data)
• history.append(crossValidation(data, K = k))
• Requires building iterationCount * K + 1 models
Evaluating Machine Learning Models
Things to Keep in Mind
Evaluating Machine Learning Models
Value Normalization
• Dividing by 255 was an example of min-max normalization:
• value = (value – min(value)) / (max(value) – min(value))
• The max pixel value was 255 and the min pixel value was 0
• Alternatively, you can use center-and-scale normalization:
[-1,1] is fine too
Consider removing outliers
Data Preprocessing, Feature Engineering, and Feature Learning
Missing Values
• “In general, with neural networks, it’s safe to input missing values as 0, with the condition that zero isn’t a meaningful value”
• It’s possible to add indicator variables: 1 if missing; 0 otherwise
• If you expect missing values at test time, be sure to train with missing values:• We train like we deploy and deploy like we train
Data Preprocessing, Feature Engineering, and Feature Learning
Feature Engineering Example
Three different inputs for the “What time is it?” model …
Why no radius on the polar coordinates?
Data Preprocessing, Feature Engineering, and Feature Learning
Feature Engineering
• Does this mean you don’t have to worry about feature engineering as long as you’re using deep neural networks?
• No …
Data Preprocessing, Feature Engineering, and Feature Learning
Original versus Lower Capacity Model
Original Model:16 units for each hidden layer
Lower Capacity Model: 4 units for each hidden layer
Overfitting and Underfitting
Original versus Lower Capacity Model
Smaller network starts overfitting later and it’s performance degrades more slowly
Overfitting and Underfitting
Original versus Higher Capacity Model:Validation Data
Validation Loss Noisierfor Higher Capacity Model(512 versus 16 units for each hidden layer)
Overfitting and Underfitting
Original versus Higher Capacity Model:Training DataMore capacity gives a model the ability to more quickly model the training data, but it also makes it susceptible to overfitting
Overfitting and Underfitting
Regularization [for Smaller Weights]
Overfitting and Underfitting
Example for Effect of Weight Regularization
Note: the goal of weight regularization is to improve generalization performance ☺
Overfitting and Underfitting
Additional Weight Regularizers for Keras
Overfitting and Underfitting
Adding Dropout(dropoutRate)
Overfitting and Underfitting
Adding Dropout to the IMDB Network
Overfitting and Underfitting
Recap of Most Common Ways to Prevent Overfitting
Overfitting and Underfitting
Define the Problem
Universal Workflow of Machine Learning
Hypothesis
Universal Workflow of Machine Learning
Choosing a Measure of Success
• Accuracy
• Precision and Recall
• Area Under the Receiver Operating Characteristic (ROC) Curve (AUC)
• Maximize Recall subject to a constraint on the False Positive Rate?
• Mean Average Precision
Universal Workflow of Machine Learning
Deciding on an Evaluation Protocol
Universal Workflow of Machine Learning
Preparing Your Data
Universal Workflow of Machine Learning
Key Choices for Your First Iteration
Universal Workflow of Machine Learning
Choosing the Last-Layer Activation and Loss Function
Universal Workflow of Machine Learning
How Big Should the Model Be?
Developing a model that overfits …
Universal Workflow of Machine Learning
Regularizing the Model
Universal Workflow of Machine Learning
Tuning the Model
We call these hyperparameters to distinguish them from the parameters of the model; i.e. the weights.
Note: We tune against validation data. Much like the “private leaderboard”, we only get one look at test perf.
Universal Workflow of Machine Learning
Hyperas: Keras + Hyperopt
Bayesian optimization [as supported by Hyperas]• Start with random sets of hyperparameters for evaluation
• Iteratively select new hyperparameters for evaluation based on history• Partition the hyperparameter sets into two groups
• Sets where evaluated loss is lower than threshold tau
• Sets where evaluated loss is greater than or equal to threshold tau
• Hyperparameters can be viewed as having a hierarchy; e.g. we only need to optimize the number of hidden units for layer ‘k’ if we decide to add layer ‘k’
• Tree-based Parzen Estimators are used to select hyperparameter values based on
Expected Improvement, which is driven by: 𝑝 𝑥 | 𝑔𝑟𝑒𝑎𝑡𝑒𝑟𝑇ℎ𝑎𝑛𝑂𝑟𝐸𝑞𝑢𝑎𝑙
𝑝 𝑥 | 𝑙𝑒𝑠𝑠𝑇ℎ𝑎𝑛
Universal Workflow of Machine Learning
https://en.wikipedia.org/wiki/Kernel_density_estimation
https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf
Installing Hyperas and Running an Example
• pip install hyperas• wget https://github.com/maxpumperla/hyperas/blob/master/examples/mnist_readme.py
• python mnist_readme.py
model.add(Dropout({{uniform(0, 1)}}))model.add(Dense({{choice([256, 512, 1024])}}))model.add(Activation({{choice(['relu', 'sigmoid'])}}))if {{choice(['three', 'four'])}} == 'four’:
model.add(Dense(100))model.add({{choice([Dropout(0.5), Activation('linear')])}})optimizer={{choice(['rmsprop', 'adam', 'sgd'])}}batch_size={{choice([64, 128])}}
Universal Workflow of Machine Learning
Chapter Summary