Download - Statistical learning intro

Introduction toMachine/Statistical Learning

[email protected] Hackerspace, 2014.9.20

The purpose of this talk

• Not to develop robust understanding of ML algorithms nor to derive them

• But to provide sufficient basis to do applied predictive modeling

• Our goal is to do prediction modeling, building accurate models by utilizing statistical principles, feature engineering, model tuning, applying appropriate ML and do error analysis

Preliminary outline

• Model purpose – for prediction, for explanation• The basic study design of Machine learning– Model Representation– Classification vs. Regression Problems– Supervised vs. Unsupervised Learning

• Model Assessment & Selection– Interplay between Bias, Variance & Complexity– Cross Validation: The wrong/correct way of doing it

• The Single Algorithm Hypothesis & Deep Learning

Ex. Models for Explanation

Wong, P. T. P. (2014). Viktor Frankl’s meaning seeking model and positive psychology.

Coursera Course, Machine learning by Andrew Ng

Andrew Ng: Deep Learning, Self-Taught Learning and Unsupervised Feature Learning

Receptive field in Humans

Preliminary outline













Independent VariablesPredictors

Features

Dependent VariablesResponses

Preliminary outline










To recap: some definitions

• Variance – the amount which the prediction would change if we

estimated it using a different training data set• Bias– the error that is introduced by approximating a real-

life problem– more flexible methods result in less bias, but more

variance• Flexibility = degrees of freedom ~ Complexity– Can be modified by regularization parameter– or increase/reduce number of features

Study design – training/test sets

An Introduction to Statistical Learning, Ch 5 Resampling Methods

In practice – training/CV/test set

• Training set– used to fit the models

• Validation set – used to estimate prediction error for model selection

• Test set – used for assessment of the generalization error of the final chosen

model.

The Elements of Statistical Learning ch7. Model Assessment and Selection


參數來源 θ (x(i), y(i))

Training error Training set Training set

CV error Training set CV set



The Bias-Variance Trade-Off


Cross validation – single split


Cross validation – n = 10 folds


K-fold Cross validation ensures better estimation of test error

Compare these two CV methods, what’s different and what’s wrong ?

1. Screen the predictors– find a subset of “good”

predictors that show fairly strong (univariate) correlation with the class labels

2. Build a multivariate classifier– Using just this subset of

predictors3. Apply cross-validation

– to estimate the unknown tuning parameters and to estimate the prediction error of the final model.

1. Divide the samples into K cross-validation folds (groups) at random

2. For each fold k = 1,2,...,Ka. Find a subset of “good”

predictors that show fairly strong (univariate) correlation with the class labels, using all of the samples except those in fold k.

b. Using just this subset of predictors, build a multivariate classifier, using all of the samples except those in fold k.

c. Use the classifier to predict the class labels for the samples in fold k.

The predictors chosen by the left method have an unfair advantage

• they were chosen in step (1) on the basis of all of the samples.

• Leaving samples out after the variables have been selected does not correctly mimic the application of the classifier to a completely independent test set

• these predictors “have already seen” the left out samples.

The Elements of Statistical Learning ch7. Model Assessment and Selection

Recap principles from Statistics – K-fold CV is a form of random sampling

Coursera Course, Data Analysis and Statistical Inference by Dr. Mine Çetinkaya-Rundel

ML algorithm performance is dependent on the underlying data

An Introduction to Statistical Learning, Ch 8 Tree methods

More issues to be covered in next talk

• Remedies for Severe Class Imbalance• Measuring Predictor Importance• Factors That Can Affect Model Performance

Preliminary outline




Back then, the prevailing wisdom

• MIT's Marvin Minsky - a "Society of Mind”– To achieve AI, it was believed, engineers would

have to build and combine thousands of individual computing units or agents.

– One group of agents, or module, would handle vision, another language, and so on…

The Single Algorithm Hypothesis

• Human intelligence stems from a single learning algorithm– In 1978 paper by Vernon Mountcastle: An Organizing

Principle for Cerebral Function – Jeff Hawkins “Memory-prediction framework”

• Origin– Neuroplasticity during brain development– Potential of other cortical areas to cover previous lost

function after brain injury (eg. stroke)

Deep Learning - 1• Single Algorithm– neural networks to mimic human brain behavior• A basic layer of artificial neurons that can detect simple

things like the edges of a particular shape• The next layer could then piece together these edges

to identify the larger shape• Then the shapes could be strung together to

understand an object

• Key: the software does all this on its own– give the system a lot of data, so it can discover by

itself what some of the concepts in the world are

The Man Behind the Google Brain: Andrew Ng and the Quest for the New AI, Wired

Deep Learning - 2• This approach is inspired by how scientists believe that

humans learn. – The algorithm didn’t know the word “cat” — Ng had to

supply that — but over time, it learned to identify the furry creatures we know as cats, all on its own.

– As babies, we watch our environments and start to understand the structure of objects we encounter, but until a parent tells us what it is, we can’t put a name to it.

• Building High-level Features Using Large Scale Unsupervised Learning

The Man Behind the Google Brain: Andrew Ng and the Quest for the New AI, WiredBuilding High-level Features Using Large Scale Unsupervised Learning, QV Le, et al

References

Stanford Andrew Ng course