Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy,...

transcript

Advanced Introduction to Machine Learning

— Spring Quarter, Week 2 —https://canvas.uw.edu/courses/1372141

Prof. Je↵ Bilmes

University of Washington, Seattle

Departments of: Electrical & Computer Engineering, Computer Science & Engineering

http://melodi.ee.washington.edu/~bilmes

April 6th/8th, 2020

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F1/67 (pg.1/163)

Logistics Review

Announcements

HW1 to be posted this evening, due in 1.5 weeks.

Virtual o�ce hours this week, Thursday night at 10:00pm via zoom(same link as class).

mmmm omma

Logistics Review

Class Road Map

W1(3/30,4/1): What is ML, Probability, Coins, Gaussians and linearregression, Associative Memories, Supervised LearningW2(4/6,4/8): More supervised, logistic regression, complexity andbias/variance tradeo↵W3(4/13,4/15): Bias/Variance, Regularization, Ridge, CrossVal,MulticlassW4(4/20,4/22): Multiclass classification, ERM, Gen/Disc, Naıve BayesW5(4/27,4/29): Lasso, Regularizers, Curse of DimensionalityW6(5/4,5/6): Curse of Dimensionality, Dimensionality Reduction, k-NNW7(5/11,5/13): k-NN, LSH, DTs, Bootstrap/Bagging, Boosting &Random Forests, GBDTsW8(5/18,5/20): Graphs; Graphical Models (Factorization, Inference,MRFs, BNs);W9(5/27,6/1): Learning Paradigms; Clustering; EM Algorithm;W10(6/3,6/8): Spectral Clustering, Graph SSL, Deep models, (SVMs,RL); The Future.

Last lecture is 6/8 since 5/25 is holiday (or we could just have lecture on 5/25).Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F3/67 (pg.3/163)

Logistics Review

Acknowledgments/References

Some of the below material was drawn from:

Bishop, 1996.

https://courses.cs.washington.edu/courses/cse546/18au/,https://courses.cs.washington.edu/courses/cse546/16au/,https://courses.cs.washington.edu/courses/cse546/14au/,http://cs229.stanford.edu/syllabus.html,https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/

Logistics Review

Review

This is where each day we will be reviewing previous lecture material.

Logistics Review

Some readings

Matrix cookbook https:

//www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf

Linear algebra noteshttp://cs229.stanford.edu/section/cs229-linalg.pdf

Writeup on Overfitting and Underfitting on our web page(https://canvas.uw.edu/courses/1372141). (see in particularhttps://canvas.uw.edu/courses/1372141/discussion_topics/

5384617)

Logistics Review

Class (and Machine Learning) overview

1. Introduction• What is ML• What is AI• Why are we so interested in these topics right now?

2. ML Paradigms/Concepts• Over!tting/Under!tting, model complexity, bias/variance• size of data, big data, sample complexity• ERM, loss + regularization, loss functions, regularizers• Supervised, unsupervised, and semi-supervised learning;• reinforcement learning, RL, multi-agent, planning/control• transfer and multi-task learning• federated and distributed learning• active learning, machine teaching• self-supervised, zero/one-shot, open-set learning

3. Dealing with Features• dimensionality reduction, PCA, LDA, MDS, T-SNE, UMAP • Locality sensitive hashing (LSH)• feature selection• feature engineering• matrix factorization & feature engineering• representation learning

4. Evaluation• accuracy/error, precision/recall, ROC, likelihood/posterior, cost/utility, margin • train/eval/test data splits• n-fold cross validation• method of the bootstrap

6. Inference Methods• probabilistic inference• MLE, MAP• belief propagation• forward/backpropagation• Monte Carlo methods

7. Models & Representation• linear least squares, linear regression, logistic regression, sparsi-ty, ridge, lasso• generative vs. discriminative models• Naive Bayes• k-nearest neighbors• clustering, k-means, k-mediods, EM & GMMs, single linkage• decision trees and random forests• support vector machines, kernel methods, max margin• perceptron, neural networks, DNNs• Gaussian processes• Bayesian nonparametric methods• ensemble methods• the bootstrap, bagging, and boosting• graphical models• time-series, HMMs, DBNs, RNNs, LSTMs, Attention, Transformers • structured prediction• grammars (as in NLP)

12. Other Techniques• compressed sensing• submodularity, diversity/homogeneity modeling

8. Philosophy, Humanity, Spirituality• arti!cial intelligence (AI)• arti!cal general intelligence (AGI)• arti!cial intelligence vs. science !ction

9. Applications• computational biology• social networks• computer vision• speech recognition• natural language processing• information retrieval• collaborative !ltering/matrix factorization

10. Programming• python• libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras, H2O, etc.• HPC: C/C++, CUDA, vector processing

11. Background• linear algebra• multivariate calculus• probability theory and statistics• information theory• mathematical (e.g., convex) optimization

(x1, x2) (x1, x3) (x3, x4) (x3, x5)X

(x2, x6)

| {z }� 66,2(x2)

(x1, x2)� 66,2(x2)X

(x1, x3) (x3, x4) (x3, x5)

| {z }� 63,1,4,5(x1,x4,x5)

(x1, x2)� 66,2(x2)X

� 63,1,4,5(x1, x4, x5)

| {z }� 63, 64,1,5(x1,x5)

= (x1, x2)� 66,2(x2)X

� 63, 64,1,5(x1, x5)

| {z }� 63,65, 64,1(x1)

= (x1, x2)� 66,2(x2)� 63, 64, 65,1(x1)

p(x1, x2) =X

· · ·

p(x1, x2, . . . , x6)

(x1, x2) (x1, x3) (x3, x4) (x3, x5)X

(x2, x6)

| {z }� 66,2(x2)

(x1, x2)� 66,2(x2) (x1, x3) (x3, x4)X

(x3, x5)

| {z }� 65,3(x3)

(x1, x2)� 66,2(x2) (x1, x3)� 65,3(x3)X

(x3, x4)

| {z }� 64,3(x3)

= (x1, x2)� 66,2(x2)X

(x1, x3)� 65,3(x3)� 64,3(x3)

| {z }� 65, 64, 63,1(x1)

= (x1, x2)� 66,2(x2)� 65,64, 63,1(x1)

p(x1, x2) =X

· · ·

p(x1, x2, . . . , x6)

Reconstituted Graph Reconstituted Graph

GraphicalTransformation

CorrespondingMarginalization Operation

GraphicalTransformation

CorrespondingMarginalization Operation

Variableto

Eliminateand

Complexity

Variableto

Eliminateand

Complexity

InputLayer

HiddenLayer 1

HiddenLayer 2

HiddenLayer 3

HiddenLayer 4

HiddenLayer 5

HiddenLayer 6

HiddenLayer 7

OutputUnit

5. Optimization Methods• Unconstrained Continuous Optimization: (stochastic) gradient descent (SGD), adap-tive learning rates, conjugate gradient, 2nd order Newton• Constrained Continuous Optimization : Frank-Wolf (conditional gradient descent), projected gradient, linear, quadratic, and convex programming• Discrete optimization - greedy, beam search, branch-and-bound, submodular optimization.

Logistics Review

Strategy

Strategy for the next period of time.

For some topic in (2)for subtopic in subset of (6) relevant to topic.

nmnmmqq

Logistics Review

Traditional Computer Programming vs. ML

Let us change our traditional attitude to the construc-tion of programs. Instead of imagining that our main task is to instruct a computer what to do, let us concen-trate rather on explaining to human beings what we want a computer to do. -- Donald Knuth

writesHuman Programming a Computer

Algorithm/Computer Program

Computer

Something Seemingly

Useful

produces

Logistics Review

writes

Machine Learning is the art of repeatedly telling a computer what one wants the computer to tell a second computer about a lot of data. This continues until the second computer gets it right.

Human Programming a Computer

Algorithm/Computer Program

Something Seemingly Intelligent

Training Data

TestData

writes producesAlgorithm/Computer Program

Computer

Logistics Review

https://imarticus.org/what-is-machine-learning-and-does-it-matter/

other defs of ML: https://www.kdnuggets.com/2018/12/essence-machine-learning.html

Logistics Review

Probability and Uncertainty

Key point: the world is a complicated place, we cannot knoweverything, and even what we think we know we can (nor should) notalways be certain. Uncertainty abounds!Need a representation of uncertainty.Probability has a precise mathematical definition (Kolmogorovaxioms), but we use it in deference to the inevitable uncertaintysurrounding all decisions.Simple and subjective working definition:

probability =number of cases something happened

number of total cases. (1.2)

Good for repeatable measurable events (e.g., coins flips, dice, etc.).Harder for future events (probability it will rain tomorrow, probabilityManchester City wins Liverpool, etc.).Despite shortcomings, used as representation of uncertainty/certainty(i.e., probability that image x contains face of person y).Machine learning often strives for the “best” probabilities in data usinglearning algorithms.

Logistics Review

Coin Flipping and ML

D = {b1, b2, . . . , bn} is series of n independent and identical coin flips,bi 2 {H,T}.

k = |{i : bi = H}| is the count of the number of heads in D

How true, or likely, is it that ✓ is probability of heads?

Pr(D|✓) = ✓k(1� ✓)n�k = Likelihood of D given ✓ (1.2)

How to find the most likely explanation of D? Maximum likelihood

✓MLE = argmax✓2[0,1]

Pr(D|✓) = argmax✓2[0,1]

log Pr(D|✓) (1.3)

How to find ✓MLE, calculus,@@✓ log Pr(D|✓) = 0 leads to

✓MLE = k/n (1.4)

Thus, computing k and dividing by n is a simple way to learn!Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F13/67 (pg.13/163)

Logistics Review

Learning Gaussians

Given the data sample D without knowing µ, C, how likely is thesample under some hypothesized parameters µ, C.

log Pr(D|µ, C) =nX

log Pr(xi|µ, C) (1.3)

, log Likelihood of D given µ, C (1.4)

How to find the most likely explanation of D? Maximum likelihood

[µMLE, CMLE] = argmaxµ2Rm,C2P(m)

log Pr(D|µ, C) (1.5)

How to find MLE quantities, again calculus, @@µ log Pr(D|µ, C) = 0

and @@C log Pr(D|µ, C) = 0 leads to

µMLE =1

xi and CMLE =1

(xi � µMLE)(xi � µMLE)|

Logistics Review

Associations and Associative Memories

Associative memory, auto-associative memory, or hetero-associativememory. In general, associate x 2 X to y 2 Y via h : X ! Y.

Examples: memory subsystem (separate address for each x 2 X ), datastructures like hash tables, or red-black trees, etc.

Often X , Y is very large, and we have only a sample associationsD = {(xi, yi)}

ni=1 where xi 2 X , yi 2 Y where n⌧ |X |.

We want to build an associative memory that works even outside of D.That is, even for x /2 {x : x = xi for some i 2 [n], (xi, yi) 2 D}.

Why? D might not be complete, variation, noise, or possible datacorruption not fully captured in D. Also, X might be infinitely large.

Logistics Review

Associations and Associative Memories

Machine learning: Write an algorithm that, given large enough D,produces a program h that generalizes (works) well on unseen samples.Respond reasonably to variation, noise, data corruption (be robust).Do this computationally as e�ciently as possible, and (ideally)understand it mathematically.

Boils down to finding a good h : X ! Y that can do the mapping(association). Sometimes we choose some h 2 H where H is largecollection of possible associators. More frequently, h is parameterizedvia some parameters ✓ and we find a good ✓ leading to h✓.

Many ways to do this, depends on nature of X , Y, how big the data is(number of samples n), and available resources (compute, coremachine memory/RAM, storage/disk, communication(latency/bandwidth), time, money, energy usage).

Often, x 2 Rm is an m-dimensional vector of features. In general, x isknown as a feature vector.

Logistics Review

Statistical parameter estimation

Training data D =�(x(i)

, y(i))

i=1where (x(i)

, y(i)) ⇠ p(x, y) are

drawn from some distribution, x(i)2 Rm and y

(i)2 R.

x(i) is m-dimensional column vector of features, y

(i) is scalar.Goal: find h✓ : X ! Y with minimum error, where

Errori = ei = h✓(x(i))� y

(i) (1.3)

E[e2] = Ep(x,y)[(h✓(x)� y)2] =

Zp(x, y)(h✓(x)� y)2dxdy

Z(h✓(x)� y)2p(y|x)dydx (1.5)

and ✓ 2 Rm is a parameter vector, ✓ = (✓1, ✓2, . . . , ✓m), ✓i 2 R.Taking derivatives and setting to zero, we get best solution:

h✓(x) =

Zyp(y|x)dy = E[Y |x] = best association. (1.6)

This assumes we have the distribution p and also the resources tocompute E[Y |x].

Logistics Review

Linear estimator: Objective Optimization

Recall, h✓(x) , ✓|x is parameterized by parameters ✓ so

J(✓) =1

(h✓(x(i))� y

(i))2 (1.3)

Taking derivative of error objective J(✓) w.r.t. ✓ and set to zero gets:

(h✓(x(i))� y

(i))@h✓(x(i))

@✓= 0 (1.4)

Linear h✓(x) = x|✓ assumption, yields @h✓(x(i))

@✓ = x(i).

Logistics Review

Linear Least Squares

This gives objective to be minimized (smallest, or least of the sum ofsquares of the errors).

@J(✓)

(x(i)|✓ � y

(i))x(i) = 0 (1.3)

We simplify this a bit by defining matrices associated with thesequantities. First define a n⇥m design matrix X and length-n columnvector ~y

...x(n)|

CCCA, and ~y =

CCCA(1.4)

Objective Equation (??), equivalent matrix-vector form:

J(✓) =1

2(X✓ � ~y)|(X✓ � ~y) (1.5)

Logistics Review

Normal Equations

With this, we get the “normal equations”

r✓J(✓) = X|(X✓ � ~y) = ~0 (1.3)

i.e., modeling ~y to be in column space of matrix X (linearcombinations of columns of X), when ~y is being approximated by X✓.

Called normal equations because column space of X is orthogonal tothe residual error E = (~y �X✓), giving solution ✓ = ✓ as shown.

what is to beapproximated

y y � X ✓

X ✓actualapproximation,

closest point to y

X✓, ✓

2 R m} space of possible approximations,

column space of X

If X|X invertible (typical

if n � m), solution hasform:

✓ = (X|X)�1

where (X|X)�1

known as the Moore-Penrose pseudo-inverse ofmatrix X.

Logistics Review

Gradient Descent, Batch Gradient Descent

Gradient updates for all elements of ✓ at the same time and for samplepair (x(i)

, y(i))

✓ ✓ + ↵(y(i) � h✓(x(i)))x(i) = ✓ + ↵(y(i) � ✓

|x(i))x(i) (1.9)

move ✓ in the direction of x(i) weighted by ↵(y(i) � h✓(x(i))) 2 R, ↵

times the error.Called LMS (least mean squares) update rule, also called Widrow-Ho↵(early NN folks) learning rule.Batch Gradient Descent uses J(✓) = 1

Pni=1(h✓(x(i))� y

(i))2, andsince the gradient is a linear operator, this yields the following:

Algorithm 2: Batch Gradient descent learningInput : Training data D, learning rate ↵, initial parameter estimate ✓

Output: Learnt model parameters ✓

1 for t = 1, · · · , T do

2 ✓ ✓ + ↵Pn

i=1(y(i)� h✓(x(i)))x(i)

Return : the final parameters ✓

Logistics Review

More visualization: Batch Gradient Descent

Logistics Review

Incremental and Stochastic Gradient Descent

Algorithm 3: Incremental Gradient Descent (IGD) learning

Input : Training data D, learning rate ↵, initial parameterestimate ✓

Output: Learnt model parameters ✓

1 for t = 1, · · · , T do

2 for i = 1, · · · , n do

3 ✓ ✓ + ↵(y(i) � h✓(x(i)))x(i)

Return : the final parameters ✓

Optimization folks (e.g., Bertsekas) call this incremental gradient methods.It is called Stochastic Gradient Descent (SGD) if we randomize (with orwithout replacement) the order of the data items.

Logistics Review

More visualization: Stochastic Gradient Descent

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Underfitting vs. Overfitting

0 1 2 3 4 5 6 70

Fit a model with various input features, values of powers of x, goal isto predict y based on xy-pair samples D = {(x(i)

, y(i)

Fit models: left y = ✓0 + ✓1x; middle y = ✓0 + ✓1x + ✓2x2; right

j=0 ✓jxj .

Both left and right plots poorly fit the data, but they are poor fordi↵erent reasons.

The left could be underfitting, and the right could beoverfitting. The center plot looks better.

0 1 2 3 4 5 6 70

, y(i)

j=0 ✓jxj .

0 1 2 3 4 5 6 70

, y(i)

j=0 ✓jxj .

0 1 2 3 4 5 6 70

, y(i)

j=0 ✓jxj .

Both left and right plots poorly fit the data, but they are poor fordi↵erent reasons. The left could be underfitting, and the right could beoverfitting.

The center plot looks better.

0 1 2 3 4 5 6 70

, y(i)

j=0 ✓jxj .

Both left and right plots poorly fit the data, but they are poor fordi↵erent reasons. The left could be underfitting, and the right could beoverfitting. The center plot looks better.

Overfitting definition (T. Mitchell)

We say that a hypothesis overfits the training examples if some otherhypothesis that fits the training examples less well actually performsbetter over the entire distribution of instances (i.e., including instancesbeyond the training set).

Definition 2.3.1 (overfitting)

Given a hypothesis space H, a hypothesis h 2 H is said to overfit thetraining data if there exists some alternative hypothesis h

02 H, such that h

has smaller error than h0 over the training examples, but h

0 has a smalleroverall error than h over the entire distribution (or data set) of instances.

We’ll visit this topic again when we discuss bias/variance, but first letsdiscuss a few more models.

02 H, such that h

Linear Regression

Linear regression involved fitting a model of the formy = h✓(x) =

Pi ✓ixi where xi is the i

th input feature and ✓i is the ith

parameter.

model is linear in the parameters, h↵✓+↵0✓0(x) = ↵h✓(x) + ↵0h✓0(x)

that we “regress” to.

Reasonable starting model for when y 2 R.

Linear Regression

parameter.

Linear Regression

parameter.

Logistic Regression

What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.

Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.Correctness is not the same as certainty!!!It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.

Logistic Regression

What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).

Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.Correctness is not the same as certainty!!!It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.

Logistic Regression

What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).

Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.Correctness is not the same as certainty!!!It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.

Logistic Regression

What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].

With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.Correctness is not the same as certainty!!!It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.

Logistic Regression

What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).

Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.Correctness is not the same as certainty!!!It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.

Logistic Regression

What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.

Correctness is not the same as certainty!!!It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.

Logistic Regression

What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.Correctness is not the same as certainty!!!

It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.

Logistic Regression

What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.Correctness is not the same as certainty!!!It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.

Making a decision

Given a probability model p(y|x), how do we make a final decision?

Let y 2 {0, 1} be the true label and y 2 {0, 1} be a prediction.

Decide true if p(y = 1|x) � ⌧ where ⌧ 2 [0, 1] is a decision threshold,i.e., y = 1{p(y=1|x)�⌧}. Natural value of ⌧ = 0.5 but other values arealso not unreasonable.

Given a validation data set Dva =�(x(i)

, y(i))

i=1where on which a

classifier produces predictions�y(i) n

i=1, we can compute the following

quantities:

True positives TP =Pn

i=1 1{y(i)=1^y(i)=1}

True negatives TN =Pn

i=1 1{y(i)=0^y(i)=0}

False positives FP =Pn

i=1 1{y(i)=1^y(i)=0}

False negatives FN =Pn

i=1 1{y(i)=0^y(i)=1}

Note that number of samples n = TP + TN + FP + FN

Making a decision

, y(i))

i=1where on which a

quantities:

i=1 1{y(i)=1^y(i)=1}

i=1 1{y(i)=0^y(i)=0}

i=1 1{y(i)=1^y(i)=0}

i=1 1{y(i)=0^y(i)=1}

Making a decision

, y(i))

i=1where on which a

quantities:

i=1 1{y(i)=1^y(i)=1}

i=1 1{y(i)=0^y(i)=0}

i=1 1{y(i)=1^y(i)=0}

i=1 1{y(i)=0^y(i)=1}

Making a decision

, y(i))

i=1where on which a

quantities:

i=1 1{y(i)=1^y(i)=1}

i=1 1{y(i)=0^y(i)=0}

i=1 1{y(i)=1^y(i)=0}

i=1 1{y(i)=0^y(i)=1}

Making a decision

, y(i))

i=1where on which a

quantities:True positives TP =

Pni=1 1{y(i)=1^y(i)=1}

i=1 1{y(i)=0^y(i)=0}

i=1 1{y(i)=1^y(i)=0}

i=1 1{y(i)=0^y(i)=1}

Making a decision

, y(i))

i=1where on which a

Pni=1 1{y(i)=1^y(i)=1}

i=1 1{y(i)=0^y(i)=0}

i=1 1{y(i)=1^y(i)=0}

i=1 1{y(i)=0^y(i)=1}

Making a decision

, y(i))

i=1where on which a

Pni=1 1{y(i)=1^y(i)=1}

i=1 1{y(i)=0^y(i)=0}

i=1 1{y(i)=1^y(i)=0}

i=1 1{y(i)=0^y(i)=1}

Making a decision

, y(i))

i=1where on which a

Pni=1 1{y(i)=1^y(i)=1}

i=1 1{y(i)=0^y(i)=0}

i=1 1{y(i)=1^y(i)=0}

i=1 1{y(i)=0^y(i)=1}

Making a decision

, y(i))

i=1where on which a

Pni=1 1{y(i)=1^y(i)=1}

i=1 1{y(i)=0^y(i)=0}

i=1 1{y(i)=1^y(i)=0}

i=1 1{y(i)=0^y(i)=1}

Accuracy and Error

Overall accuracy

Accuracy =Total correct predictions

Total predictions=

TP + TN

TP + TN + FP + FN(2.1)

Error = 1.0� Accuracy =FP + FN

TP + TN + FP + FN(2.2)

Accuracy and Error

Overall accuracy

Accuracy =Total correct predictions

Total predictions=

TP + TN

TP + TN + FP + FN(2.1)

Error = 1.0� Accuracy =FP + FN

TP + TN + FP + FN(2.2)

Binary Confusion Matrix

Given n samples in a validation data set, we can plot the relationshipbetween TP, TN, FP, FN.

elPredicted Label

Numberof PositiveSamples

Numberof SamplesPredicted tobe Positive

Numberof SamplesPredicted tobe Negative

Numberof NegativeSamples

False Positive and False Negative

https://www.kdnuggets.com/2020/01/guide-precision-recall-confusion-matrix.html

Precision, Recall, and F-Feasure

Precision:

TP + FP=

number of predicted positives(2.3)

Recall

TP + FN=

number of positives(2.4)

F-measure (or F1-score), harmonic mean of precision and recall,

F-measure =2

1Precision

+ 1Recall

= 2Precision⇥ Recall

Precision + Recall(2.5)

Precision:

TP + FP=

Recall

TP + FN=

F-measure =2

1Precision

+ 1Recall

Precision:

TP + FP=

Recall

TP + FN=

F-measure =2

1Precision

+ 1Recall

×I In

ROC and AUC

Recall, we predict as y = 1{p(y=1|x)�⌧}, but how does TP and FPchange as we vary the decision threshold ⌧?

Receiver Operating Characteristic (ROC) curve is determined by ⌧ .

Area under the curve (AUC) gives an overall measure of how wellmodel is doing for all ⌧ . Higher AUC is better.

https://en.wikipedia.org/wiki/Receiver_operating_characteristic

ROC and AUC

Logistic Regression

Solution for binary classification: use a logistic function as in

Pr(y = 1|x) = h✓(x) = g(✓|x) =1

1 + exp(�✓|x)(2.6)

where ✓|x = ✓0 +

Pmi=1 ✓ixi, x0 ⌘ 1 so ✓0 is the bias/shift.

g(z) = 1/(1 + e�x)

is known as a logisticfunction.

A logistic function is one type of sigmoid function, others beinghyperbolic tangent, arctan, error function, etc. (seehttps://en.wikipedia.org/wiki/Sigmoid_function).

Logistic Regression

Pr(y = 1|x) = h✓(x) = g(✓|x) =1

1 + exp(�✓|x)(2.6)

where ✓|x = ✓0 +

g(z) = 1/(1 + e�x)

�4 �2 0 2 4z

logistic function g(z) = 1/(1 + e�z)

Logistic Regression

Pr(y = 1|x) = h✓(x) = g(✓|x) =1

1 + exp(�✓|x)(2.6)

where ✓|x = ✓0 +

g(z) = 1/(1 + e�x)

�4 �2 0 2 4z

Logistic with scale parameter �

Approximate a step function with scale parameter � 2 R+, givingg�(z) = 1/(1 + e

��x)

�4 �2 0 2 4z

� = 0.5

� = 1.0

� = 2.0

� = 5.0

� = 10.0

Yltexpc -Lo, x > )⑤← E.. I 00

Gradients of Logistic

A logistic function’s gradient is easy to compute. This follows since:

g0(z) =

1 + e�z(2.7)

= �1

(1 + e�z)2(�e

�z) (2.8)

1 + e�z·

1 + e�z(2.9)

1 + e�z

✓1�

1 + e�z

◆(2.10)

= g(z)(1� g(z)) (2.11)

Given this, we can derive a gradient descent learning rule, similar toLMS, but for logistic regression.

Gradients of Logistic

A logistic function’s gradient is easy to compute. This follows since:

g0(z) =

1 + e�z(2.7)

= �1

(1 + e�z)2(�e

�z) (2.8)

1 + e�z·

1 + e�z(2.9)

1 + e�z

✓1�

1 + e�z

◆(2.10)

= g(z)(1� g(z)) (2.11)

Given this, we can derive a gradient descent learning rule, similar toLMS, but for logistic regression.

Fitting Logistic Regression Using Gradient Descent

Training data D = {(x(i), y

(i))}i2[n], now y(i)2 {0, 1} is a binary label.

Goal: formulate likelihood (to maximize) in terms of parameters ✓.

Probability model. Pr(y = 1|x; ✓) = h✓(x) = g(✓|x) andPr(y = 0|x; ✓) = 1� h✓(x), thus for y 2 {0, 1},

Pr(y|x; ✓) = (h✓(x))y(1� h✓(x))1�y (2.12)

Likelihood L(✓) =Q

i Pr(y(i)|x(i); ✓) and log likelihood

`(✓) =nX

hy(i) log h✓(x

(i)) + (1� y(i)) log(1� h✓(x

(i)))i

(2.13)

Negative Log likelihood (or cost(✓) = �`(✓)) is convex in ✓.

Pr(y|x; ✓) = (h✓(x))y(1� h✓(x))1�y (2.12)

`(✓) =nX

hy(i) log h✓(x

(i)) + (1� y(i)) log(1� h✓(x

(i)))i

(2.13)

Pr(y|x; ✓) = (h✓(x))y(1� h✓(x))1�y (2.12)

`(✓) =nX

hy(i) log h✓(x

(i)) + (1� y(i)) log(1� h✓(x

(i)))i

(2.13)

Pr(y|x; ✓) = (h✓(x))y(1� h✓(x))1�y (2.12)

`(✓) =nX

hy(i) log h✓(x

(i)) + (1� y(i)) log(1� h✓(x

(i)))i

(2.13)

Pr(y|x; ✓) = (h✓(x))y(1� h✓(x))1�y (2.12)

`(✓) =nX

hy(i) log h✓(x

(i)) + (1� y(i)) log(1� h✓(x

(i)))i

(2.13)

Gradient of log likelihood on one training pair (x, y)

@✓`(✓) = (y � h✓(x))x. (2.14)

Derivation of this derivative, via chain rule, uses the logistic derivativeproperty g

0(z) = g(z)(1� g(z)).

Gradient descent steps

✓ ✓ + ↵(y(i) � h✓(x(i)))x(i) (2.15)

again, direction is given by x(i), by an amount equal to

↵ · error = ↵(y(i) � h✓(x(i)))

Error has same form as linear case (the answer, y(i) minus prediction

h✓(x(i))) but prediction is quite di↵erent from before.

@✓`(✓) = (y � h✓(x))x. (2.14)

0(z) = g(z)(1� g(z)).

✓ ✓ + ↵(y(i) � h✓(x(i)))x(i) (2.15)

↵ · error = ↵(y(i) � h✓(x(i)))

@✓`(✓) = (y � h✓(x))x. (2.14)

0(z) = g(z)(1� g(z)).

✓ ✓ + ↵(y(i) � h✓(x(i)))x(i) (2.15)

↵ · error = ↵(y(i) � h✓(x(i)))

Sealant .

,direction

in"EE"!.ae .

Fit Linear vs. Logistic Comparison

from https://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic.html

"" t" "

Logistic Regression, Linear Separability, and Overfitting

logistic regression andlinearly separable data.How smooth is thetransition in logistic re-gression?

Logistic regression negative log likelihood (i.e., cost) J(✓)

J(✓) = �mX

hy(i) log h✓(x

(i)) + (1� y(i)) log(1� h✓(x

(i)))i

(2.16)

to be minimized, where h✓(x) = 11+exp(�✓|x) .

What happens as cost decreases (likelihood improves)?

If y(i) = 1,

h✓(x(i))! 1; if y(i) = 0, h✓(x(i))! 0. Hence, J(✓)! 0. Requires

✓ !1. Should transition be allowed to be arbitrarily sudden?

XE frm m-2

J(✓) = �mX

hy(i) log h✓(x

(i)) + (1� y(i)) log(1� h✓(x

(i)))i

(2.16)

If y(i) = 1,

J(✓) = �mX

hy(i) log h✓(x

(i)) + (1� y(i)) log(1� h✓(x

(i)))i

(2.16)

If y(i) = 1,

J(✓) = �mX

hy(i) log h✓(x

(i)) + (1� y(i)) log(1� h✓(x

(i)))i

(2.16)

What happens as cost decreases (likelihood improves)? If y(i) = 1,

h✓(x(i))! 1; if y(i) = 0, h✓(x(i))! 0. Hence, J(✓)! 0.

Requires✓ !1. Should transition be allowed to be arbitrarily sudden?

J(✓) = �mX

hy(i) log h✓(x

(i)) + (1� y(i)) log(1� h✓(x

(i)))i

(2.16)

✓ !1.

Should transition be allowed to be arbitrarily sudden?

g(t)= +eE)

2-=<o ,

Oi- ti

260710

hour"') -g"

y'i'EE QB

J(✓) = �mX

hy(i) log h✓(x

(i)) + (1� y(i)) log(1� h✓(x

(i)))i

(2.16)

✓ !1. Should transition be allowed to be arbitrarily sudden?Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F41/67 (pg.83/163)

m notG-/ sum

region .

Overfitting and large magnitude parameters

Therefore, it seems that even large magnitude parameters can lead toa form of overfitting. Overfitting in that the i such that y

(i) = 0 getperfect zero prediction, and i such that y

(i) = 1 get perfect 1prediction, where such certainty is probably not warranted.

Better solution: don’t overfit, for points close to decision boundaryallow gradual prediction transition between 0 and 1 in region ofuncertainty.

This requires means putting a restriction on ✓ (not letting it get toobig).

One possible complexity penality, the 2-norm: ⌦(✓) = k✓k2, prefers“simple” models which in thise case are those with small coe�cients.

Lregularitch

Perceptron and Logistic Regression

https://towardsdatascience.com/what-the-hell-is-perceptron-626217814f53

✓m-1

Model h✓(x) = g(✓|x), perceptron uses a hard activation function

g(z) =

(�1 if z < 0

+1 if z � 0(2.17)

leads to same learning update rule ✓ ✓ + ↵(y(i) � h✓(x(i)))x(i)

= @jzqCHJ.a - I

Perceptron, Linear Models, and Linearly Separable Data

logistic regression andperceptron can doperfectly when thedata is (nicely) linearlyseparable.

Line designates boundary of a “ridge” or “cli↵” between the categories.

Review

The next three slides are review from Lecture 1.

Please read writeup “Underfitting and Overfitting in MachineLearning” to be posted to canvas.

Review

The next three slides are review from Lecture 1.

Please read writeup “Underfitting and Overfitting in MachineLearning” to be posted to canvas.

0 1 2 3 4 5 6 70

, y(i)

j=0 ✓jxj .

Both left and right plots poorly fit the data, but they are poor fordi↵erent reasons. The left could be underfitting, and the right could beoverfitting. The center plot looks better.

02 H, such that h

h,tie H

Some Definitions

Any data set D =�(x(1)

, y(1)

, (x(2), y

(2)), . . . , (x(n), y

(n)) drawn from

the a given distribution, meaning that (x(j), y

(j)) ⇠ p(x, y) for all1 j n.

Training a model by maximizing accuracy on a training set Dtr:

h 2 argmaxh2H

(x,y)2Dtr

A(yi, h(xi))� �⌦(h) (2.18)

Accuracy according to the sample distribution

accuracy(h) = Ep(x,y)[A(y, h(x))] =

Zp(x, y)A(y, h(x))]dxdy

(2.19)

Accuracy of a trained model on a data set D:

accuracyD(h) =1

(x,y)2D

A(yj , h(xj)) (2.20)

Training data set Dtr and validation (or development) data set Dva.

Some Definitions

, y(1)

, (x(2), y

(2)), . . . , (x(n), y

(n)) drawn from

(j)) ⇠ p(x, y) for all1 j n.Training a model by maximizing accuracy on a training set Dtr:

h 2 argmaxh2H

(x,y)2Dtr

A(yi, h(xi))� �⌦(h) (2.18)

(2.19)

accuracyD(h) =1

(x,y)2D

A(yj , h(xj)) (2.20)

& accuracy - regularizationtradeoff coefficient.

A hyperparameter .

Acy,huh ) is big

hlxlis a

predictorof y

undis smelt it

hun is a poor

predict oty .

Some Definitions

, y(1)

, (x(2), y

(2)), . . . , (x(n), y

(n)) drawn from

h 2 argmaxh2H

(x,y)2Dtr

A(yi, h(xi))� �⌦(h) (2.18)

(2.19)

accuracyD(h) =1

(x,y)2D

A(yj , h(xj)) (2.20)

Some Definitions

, y(1)

, (x(2), y

(2)), . . . , (x(n), y

(n)) drawn from

h 2 argmaxh2H

(x,y)2Dtr

A(yi, h(xi))� �⌦(h) (2.18)

(2.19)

accuracyD(h) =1

(x,y)2D

A(yj , h(xj)) (2.20)

Some Definitions

, y(1)

, (x(2), y

(2)), . . . , (x(n), y

(n)) drawn from

h 2 argmaxh2H

(x,y)2Dtr

A(yi, h(xi))� �⌦(h) (2.18)

(2.19)

accuracyD(h) =1

(x,y)2D

A(yj , h(xj)) (2.20)

Training data set Dtr and validation (or development) data set Dva.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F48/67 (pg.98/163)

Overfitting

We say that h 2 H overfits the training data Dtr if there exists h02 H

such that

accuracyDtr(h) > accuracyDtr

(h0) and accuracy(h) < accuracy(h0).(2.21)

Since we can’t compute accuracy(h) or accuracy(h0) as mentioned above, apractical definition of overfitting changes this to:

accuracyDtr(h) > accuracyDtr

(h0) and accuracyDva(h) < accuracyDva

(h0).(2.22)

Underfitting

Definition 2.5.2 (underfitting)

We say that h 2 H underfits the training data Dtr if there exists h002 H

such that

accuracyDtr(h) < accuracyDtr

(h00) and accuracy(h) < accuracy(h00).(2.23)

Overfitting and Underfitting

Training set accuracy

Validation set accuracy

Complexity (or capability, or capacity) of model

over!ttingrange

!ttingrange

under!ttingrange

(a) The highly typical caseis when the training setaccuracy is higher than thevalidation set accuracy.

over!ttingrange

!ttingrange

under!ttingrange

(b) It is possible, butsomewhat unlikely, thatthe validation set accuracycrosses the validation setaccuracy. Theoverfitting/underfittingranges are still the same.

over!ttingrange

!ttingrange

under!ttingrange

(c) It is possible, but veryunlikely, that thevalidation set accuracy ishigher than the trainingset accuracy, as shown inthis plot. But theoverfitting/underfittingranges are still the same.

Figure: Overfitting and underfitting shown as a function of model complexity for a fixed training set size ntr. Any h in the

red region overfits the training set. Any h in the yellow region underfits the training set. Any h in the green (middle) region

properly fits the training set. The regions are based all on the accuracy accuracyDva(h) computed on an validation data set Dva,

but the same principle would be true if it were possible to measure accuracy it on the entire distribution accuracy(h).

qualitative,

Learning Curves

How does overfitting/underfitting depends on the size of the training dataset?

A given model (with fixed complexity) will tend to overfit a smalltraining data set, and underfit a large training data set.

(a) Learning curve with a low complexity(i.e., ⌦(h) small) model.

(b) Learning curve with a high complexity(i.e., ⌦(h) large) model.

On the left, the model underfits since the model does not have muchcapability even with much training data. On the right, the model starts outoverfitting but eventually fits with enough training data.

• •tot.

Learning Curves

How does overfitting/underfitting depends on the size of the training dataset? A given model (with fixed complexity) will tend to overfit a smalltraining data set, and underfit a large training data set.

ntr number of training samples

under!ttingrange

over!ttingrange

!ttingrange

Iuse gradient

9# of validation higher , o-

someone"7÷%IEi¥¥: FEI: car, so#i:w⇒.

Learning Curves

How does overfitting/underfitting depends on the size of the training dataset? A given model (with fixed complexity) will tend to overfit a smalltraining data set, and underfit a large training data set.

under!ttingrange

over!ttingrange

!ttingrange

Real-world consequences of overfitting

The Fukushima nuclear power plant disaster in 2011 caused byoverfitting a model to data!! see https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/ and

https://mpra.ub.uni-muenchen.de/69383/1/MPRA_paper_69383.pdf

earthquake data going back 400 years and model fit

Non-linear model fit, predicts mag. 9 earthquake every ⇡13000 years.This was used to design Fukushima Daiichi nuclear power plant, towithstand 8.6 magnitude earthquake, tsunami of 5.7 meters.In 2011, 9.0 earthquake and 14 meter tsunami!!

Non-linear model fit, predicts mag. 9 earthquake every ⇡13000 years.

This was used to design Fukushima Daiichi nuclear power plant, towithstand 8.6 magnitude earthquake, tsunami of 5.7 meters.In 2011, 9.0 earthquake and 14 meter tsunami!!

Non-linear model fit, predicts mag. 9 earthquake every ⇡13000 years.This was used to design Fukushima Daiichi nuclear power plant, towithstand 8.6 magnitude earthquake, tsunami of 5.7 meters.

In 2011, 9.0 earthquake and 14 meter tsunami!!

log-linear model shows di↵erent trend:

linear model fit, predicts magnitude 9 earthquake about every 300years, would have led to quite a di↵erent reactor design.

Moral: Overfitting (or underfitting) can have huge consequences!

perham

i.: . fitwww.hod. . data -

Overfitting slider on the web

See “Visualization of the bias-variance tradeo↵” at the following linkentitled “The Bias-Variance Dilemma”: https:

//medium.com/@ml.at.berkeley/machine-learning-crash-course-part-4-the-bias-variance-dilemma-a94e60ec1d3

The random process of training

Recall training data D =�(x(i)

, y(i))

i=1where (x(i)

, y(i)) ⇠ p(x, y)

are drawn from some distribution, x(i)

2 Rm and y(i)

Training data is a random sample, and is itself random.

Fitting a model h✓ where ✓ is derived from the training procedure.

We can think of ✓(D) is the parameters of the model via the process ofmodel fitting. ✓(D) is a random variable since deterministic function ofrandom sample D of size n.

Each time we draw a di↵erent training set, we might get (hopefullyonly very slightly) di↵erent ✓.

, y(i))

i=1where (x(i)

, y(i)) ⇠ p(x, y)

2 Rm and y(i)

2 R.Training data is a random sample, and is itself random.

Pr CD) = II, PG"

⇒ D is a ruffian

, y(i))

i=1where (x(i)

, y(i)) ⇠ p(x, y)

2 Rm and y(i)

, y(i))

i=1where (x(i)

, y(i)) ⇠ p(x, y)

2 Rm and y(i)

, y(i))

i=1where (x(i)

, y(i)) ⇠ p(x, y)

2 Rm and y(i)

Bias/Variance Intuition (on one slide).

When are fitting a model h✓ to data D, and we optimize (or can chose) anymodel in the model family (e.g., linear model family, logistic model, do weinclude higher order polynomial features, etc.).

High bias, low variance: (underfitting)if model family is too simple (low complexity), it will never matchimportant characteristics in the data. It is, in such case, biased.Even if we train on multiple di↵erent data sets, we’ll get approximatelythe same model h✓ since the model family is not that capable orflexible. All are wrong in same way.Thus, the variance of the random variable h✓(D) is low!

Low bias, high variance: (overfitting)if model family is too complex, it will match unimportant idiosyncrasiesin the data. Bias is low, since it matches the data quite well.When we train on multiple di↵erent data sets, each with its own (oftenrandom) idiosyncrasies, we’ll get very di↵erent models h✓ since themodel family can match any idiosyncrasies in the data.Thus, the variance of the random variable h✓(D) is high!

When are fitting a model h✓ to data D, and we optimize (or can chose) anymodel in the model family (e.g., linear model family, logistic model, do weinclude higher order polynomial features, etc.).High bias, low variance: (underfitting)

if model family is too simple (low complexity), it will never matchimportant characteristics in the data. It is, in such case, biased.Even if we train on multiple di↵erent data sets, we’ll get approximatelythe same model h✓ since the model family is not that capable orflexible. All are wrong in same way.Thus, the variance of the random variable h✓(D) is low!

if model family is too simple (low complexity), it will never matchimportant characteristics in the data. It is, in such case, biased.

Even if we train on multiple di↵erent data sets, we’ll get approximatelythe same model h✓ since the model family is not that capable orflexible. All are wrong in same way.Thus, the variance of the random variable h✓(D) is low!

inductivebias

const.mobht.CH= c

if model family is too simple (low complexity), it will never matchimportant characteristics in the data. It is, in such case, biased.Even if we train on multiple di↵erent data sets, we’ll get approximatelythe same model h✓ since the model family is not that capable orflexible. All are wrong in same way.

Thus, the variance of the random variable h✓(D) is low!Low bias, high variance: (overfitting)

if model family is too complex, it will match unimportant idiosyncrasiesin the data. Bias is low, since it matches the data quite well.When we train on multiple di↵erent data sets, each with its own (oftenrandom) idiosyncrasies, we’ll get very di↵erent models h✓ since themodel family can match any idiosyncrasies in the data.Thus, the variance of the random variable h✓(D) is high!

Va- Choco, ) small.

wu C. v.

Low bias, high variance: (overfitting)

if model family is too complex, it will match unimportant idiosyncrasiesin the data. Bias is low, since it matches the data quite well.When we train on multiple di↵erent data sets, each with its own (oftenrandom) idiosyncrasies, we’ll get very di↵erent models h✓ since themodel family can match any idiosyncrasies in the data.Thus, the variance of the random variable h✓(D) is high!

Low bias, high variance: (overfitting)if model family is too complex, it will match unimportant idiosyncrasiesin the data. Bias is low, since it matches the data quite well.

When we train on multiple di↵erent data sets, each with its own (oftenrandom) idiosyncrasies, we’ll get very di↵erent models h✓ since themodel family can match any idiosyncrasies in the data.Thus, the variance of the random variable h✓(D) is high!

Low bias, high variance: (overfitting)if model family is too complex, it will match unimportant idiosyncrasiesin the data. Bias is low, since it matches the data quite well.When we train on multiple di↵erent data sets, each with its own (oftenrandom) idiosyncrasies, we’ll get very di↵erent models h✓ since themodel family can match any idiosyncrasies in the data.

Thus, the variance of the random variable h✓(D) is high!

Bias/Variance

Recall lecture 1 goal: find h✓ : X ! Y with minimum error.

Best solution, argminh Ep(x,y)[(y � h(x))2] leads to

h⇤(x) =

Zyp(y|x)dy = E[Y |x] = best association . (2.24)

assuming we have the distribution p, the resources to compute E[Y |x],and the model family spanned by h includes the functional form ofE[Y |x] (i.e., E[Y |x] is realizable, as apposed to the agnostic casewhich is not realizable). This is best we can do in theory.In practice, we have training data D and a limited model family H

(e.g., linear models) and instead do:

h 2 argminh2H

(y(i) � h(x(i)))2 (2.25)

Two reasons we might not be good: (a) wrong model family H and(b) not enough data (n too small).

Bias/Variance

Recall lecture 1 goal: find h✓ : X ! Y with minimum error.Best solution, argminh Ep(x,y)[(y � h(x))2] leads to

h⇤(x) =

assuming we have the distribution p, the resources to compute E[Y |x],and the model family spanned by h includes the functional form ofE[Y |x] (i.e., E[Y |x] is realizable, as apposed to the agnostic casewhich is not realizable). This is best we can do in theory.

In practice, we have training data D and a limited model family H

h 2 argminh2H

(y(i) � h(x(i)))2 (2.25)

Bias/Variance

h⇤(x) =

h 2 argminh2H

(y(i) � h(x(i)))2 (2.25)

Bias/Variance

h⇤(x) =

h 2 argminh2H

(y(i) � h(x(i)))2 (2.25)

Validation and Bias/Variance

How do we measure how good we are?We could consider measuring on the same training data we’ve got:

(y(i) � h(x(i)))2 =1

(x,y)2D

(y � h(x))2 (2.26)

Already, intuitively bad for bias/variance reasons we’ve discussed, eachD gives di↵erent solution, each h✓(D) can look good on its own data.

Ideal approach, tests all data (e.g., includes future samples we didn’ttrain on).

Ep(x,y)[(y � h(x))2] (2.27)

bad since impractical at best and (more likely) impossible.Typical approach, draw a separate validation data set Dva, withDva \ D = ;, and to try to get at generalization error, do:

(x,y)2Dva

(y � h(x))2 (2.28)

(y(i) � h(x(i)))2 =1

(x,y)2D

(y � h(x))2 (2.26)

Already, intuitively bad for bias/variance reasons we’ve discussed, eachD gives di↵erent solution, each h✓(D) can look good on its own data.Ideal approach, tests all data (e.g., includes future samples we didn’ttrain on).

Ep(x,y)[(y � h(x))2] (2.27)

bad since impractical at best and (more likely) impossible.

Typical approach, draw a separate validation data set Dva, withDva \ D = ;, and to try to get at generalization error, do:

(x,y)2Dva

(y � h(x))2 (2.28)

(y(i) � h(x(i)))2 =1

(x,y)2D

(y � h(x))2 (2.26)

Already, intuitively bad for bias/variance reasons we’ve discussed, eachD gives di↵erent solution, each h✓(D) can look good on its own data.Ideal approach, tests all data (e.g., includes future samples we didn’ttrain on).

Ep(x,y)[(y � h(x))2] (2.27)

bad since impractical at best and (more likely) impossible.Typical approach, draw a separate validation data set Dva, withDva \ D = ;, and to try to get at generalization error, do:

(x,y)2Dva

(y � h(x))2 (2.28)

Bias/Variance, Overall Error

Ideal solution h⇤(x) = E[Y |x] and random estimate from data set

hD(x) , h✓(D) = argminh2H

Pni=1(y

(i)� h(x(i)))2

Lets consider measuring the overall error for any h:

error(h)=Ep(x,y)[(h(x) � y)2] =

Z(h(x) � y)2p(y|x)p(x)dxdy (2.29)

= Ep(x,y)[(h(x) � E[Y |x] + E[Y |x] � y)2] (2.30)

= Ep(x,y)

h(h(x) � E[Y |x])2 (2.31)

+ 2(h(x) � E[Y |x])(E[Y |x] � y) (2.32)

+ (E[Y |x] � y)2i

(2.33)

= Ep(x,y)[(h(x) � E[Y |x])2] (2.34)

+ 2(E[h(x)] � E[Y |X])(E[Y |x] � E[Y |x]) (2.35)

+ Ep(x,y)[(E[Y |x] � y)2] (2.36)

Note, 2nd term cancels out.

Bias/Variance, Overall Error

Ideal solution h⇤(x) = E[Y |x] and random estimate from data set

hD(x) , h✓(D) = argminh2H

Pni=1(y

(i)� h(x(i)))2

Lets consider measuring the overall error for any h:

error(h)=Ep(x,y)[(h(x) � y)2] =

Z(h(x) � y)2p(y|x)p(x)dxdy (2.29)

= Ep(x,y)[(h(x) � E[Y |x] + E[Y |x] � y)2] (2.30)

= Ep(x,y)

h(h(x) � E[Y |x])2 (2.31)

+ 2(h(x) � E[Y |x])(E[Y |x] � y) (2.32)

+ (E[Y |x] � y)2i

(2.33)

= Ep(x,y)[(h(x) � E[Y |x])2] (2.34)

+ 2(E[h(x)] � E[Y |X])(E[Y |x] � E[Y |x]) (2.35)

+ Ep(x,y)[(E[Y |x] � y)2] (2.36)

Note, 2nd term cancels out.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F60/67 (pg.136/163)

2nd Term Cancels Out

Why does second term cancel out?

E[Y |x] =R

yp(y|x)dy = best association and is a deterministicfunction of x, an expected value of the random variable Y underconditional distribution p(y|x) for a given x.

Therefore,

Ep(x,y)

h2(h(x) � E[Y |x])(E[Y |x] � y)

i(2.37)

Z Zp(x)p(y|x)

h(h(x) � E[Y |x])(E[Y |x] � y)

idydx (2.38)

Zp(x)(h(x) � E[Y |x])

hZp(y|x)(E[Y |x] � y)dy

idx (2.39)

Zp(x)(h(x) � E[Y |x])

h(E[Y |x] � E[Y |x])

idx = 0 (2.40)

Asside: E[Y |X] =R

yp(y|X)dy is a deterministic function of r.v. X,and hence E[Y |X] is itself a r.v. with a mean E[E[Y |X]].

Thus, E[E[Y |X]] =R

E[Y |x]p(x)dx =R ⇥R

yp(y|x)p(x)dy⇤dx =R

p(y|x)p(x)dx⇤dy =

Ryp(y)dy = E[Y ].

E[Y |x] =R

yp(y|x)dy = best association and is a deterministicfunction of x, an expected value of the random variable Y underconditional distribution p(y|x) for a given x.

Therefore,

Ep(x,y)

h2(h(x) � E[Y |x])(E[Y |x] � y)

i(2.37)

Z Zp(x)p(y|x)

h(h(x) � E[Y |x])(E[Y |x] � y)

idydx (2.38)

Zp(x)(h(x) � E[Y |x])

idx (2.39)

Zp(x)(h(x) � E[Y |x])

h(E[Y |x] � E[Y |x])

idx = 0 (2.40)

Asside: E[Y |X] =R

Thus, E[E[Y |X]] =R

p(y|x)p(x)dx⇤dy =

Ryp(y)dy = E[Y ].

E[Y |x] =R

yp(y|x)dy = best association and is a deterministicfunction of x, an expected value of the random variable Y underconditional distribution p(y|x) for a given x. Therefore,

Ep(x,y)

h2(h(x) � E[Y |x])(E[Y |x] � y)

i(2.37)

Z Zp(x)p(y|x)

h(h(x) � E[Y |x])(E[Y |x] � y)

idydx (2.38)

Zp(x)(h(x) � E[Y |x])

idx (2.39)

Zp(x)(h(x) � E[Y |x])

h(E[Y |x] � E[Y |x])

idx = 0 (2.40)

Asside: E[Y |X] =R

Thus, E[E[Y |X]] =R

p(y|x)p(x)dx⇤dy =

Ryp(y)dy = E[Y ].

E[Y |x] =R

Ep(x,y)

h2(h(x) � E[Y |x])(E[Y |x] � y)

i(2.37)

Z Zp(x)p(y|x)

h(h(x) � E[Y |x])(E[Y |x] � y)

idydx (2.38)

Zp(x)(h(x) � E[Y |x])

idx (2.39)

Zp(x)(h(x) � E[Y |x])

h(E[Y |x] � E[Y |x])

idx = 0 (2.40)

Asside: E[Y |X] =R

Thus, E[E[Y |X]] =R

p(y|x)p(x)dx⇤dy =

Ryp(y)dy = E[Y ].

E[Y |x] =R

Ep(x,y)

h2(h(x) � E[Y |x])(E[Y |x] � y)

i(2.37)

Z Zp(x)p(y|x)

h(h(x) � E[Y |x])(E[Y |x] � y)

idydx (2.38)

Zp(x)(h(x) � E[Y |x])

idx (2.39)

Zp(x)(h(x) � E[Y |x])

h(E[Y |x] � E[Y |x])

idx = 0 (2.40)

Asside: E[Y |X] =R

Thus, E[E[Y |X]] =R

p(y|x)p(x)dx⇤dy =

Ryp(y)dy = E[Y ].

Bias/Variance, the r.v. hD(x)Thus the error has only two terms:

error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2] (2.41)

Second term has nothing to do with the model h whose error we aremeasuring, and is the inherent error, due to the random process andlabel noise. We can neglect this term in our study of bias/variance.

The first term can be simplified to Ep(x)[(h(x) � E[Y |x])2] and is thelearning error (or MSE). It is zero when h(x) = h

⇤(x) = E[Y |x].

hD(x) is a (random) learnt model from (random variable) data set D.Hence, hD(x) is a random variable (a deterministic function, thelearning/optimization process, of random variable D data set) and it

has a mean ED[hD(x)] and a variance ED

h�hD(x) � ED[hD(x)]

�2i.

To clarify notation, we’ll use D and D to express things like hD(x)’s

variance ED

h�hD(x) � E

�2ibut D and D are two

independent and identically distributed (iid) random variables overdatasets.

⇤(x) = E[Y |x].

�2i.

variance ED

h�hD(x) � E

⇤(x) = E[Y |x].

�2i.

variance ED

h�hD(x) � E

⇤(x) = E[Y |x].

�2i.

variance ED

h�hD(x) � E

⇤(x) = E[Y |x].

�2i.

variance ED

h�hD(x) � E

Bias/Variance Breakdown

We further analyze only this first term, and take ensemble averages over D,the random data sample. hD(x) is a (random) learnt model from (randomvariable) data set D.

For a given x, we have

ED[(hD(x) � E[Y |x])2] = ED[(hD(x) � ED

(x)] + ED

(x)] � E[Y |x])2]

h(hD(x) � E

D(x)])2 (2.42)

+ 2(hD(x) � ED

(x)])(ED

(x)] � E[Y |x])

(x)] � E[Y |x])2i

(2.43)

= ED[(hD(x) � ED

(x)])2] (2.44)

(x)] � E[Y |x])2 (2.45)

= variance(x) + bias squared (x) (2.46)

We then take Ep(x)[·] to get overall bias and variance over all x.

We further analyze only this first term, and take ensemble averages over D,the random data sample. hD(x) is a (random) learnt model from (randomvariable) data set D. For a given x, we have

(x)] + ED

(x)] � E[Y |x])2]

h(hD(x) � E

D(x)])2 (2.42)

+ 2(hD(x) � ED

(x)])(ED

(x)] � E[Y |x])

(x)] � E[Y |x])2i

(2.43)

= ED[(hD(x) � ED

(x)])2] (2.44)

(x)] � E[Y |x])2 (2.45)

We further analyze only this first term, and take ensemble averages over D,the random data sample. hD(x) is a (random) learnt model from (randomvariable) data set D. For a given x, we have

(x)] + ED

(x)] � E[Y |x])2]

h(hD(x) � E

D(x)])2 (2.42)

+ 2(hD(x) � ED

(x)])(ED

(x)] � E[Y |x])

(x)] � E[Y |x])2i

(2.43)

= ED[(hD(x) � ED

(x)])2] (2.44)

(x)] � E[Y |x])2 (2.45)

Error (MSE) and Bias/Variance Breakdown

MSE = Ep(x)[ED[(hD(x) � E[Y |x])2]] (2.47)

= Ep(x)[ED[(hD(x) � ED

(x)])2]] + Ep(x)[(ED[h

D(x)] � E[Y |x])2]

= variance + bias squared (2.48)

Bias/Variance, Unbiased estimator

To be an unbiased estimation procedure means thatED[hD(x)] = E[Y |x], or that the bias is zero.

We do sometimes have zero bias and variance that depends on the sizeof the data.

Bias/Variance, Unbiased estimator

To be an unbiased estimation procedure means thatED[hD(x)] = E[Y |x], or that the bias is zero.

We do sometimes have zero bias and variance that depends on the sizeof the data.

Bias/Variance for Linear Least Squares

Linear model with noise y = h✓⇤(x) + ✏ = ✓⇤|

x + ✏ where✏ ⇠ N (0, �

Suppose this is true generative process for some ✓⇤

Given training data D and corresponding n ⇥ m design matrix X andlength-n column vector ~y, we have relationship ~y = X✓ + ~✏, ~✏ islength-n vector of Gaussians, ✏i ⇠ N (0, �

2).MLE parameter estimate ✓ = (X|

X)�1X

|~y = ✓

⇤ + (X|X)�1

X|~✏, a

noisy (r.v.) version of ✓⇤.

Best estimate E[Y |x] = ✓⇤|

Recall error, which has only two terms, the model error (which can bebroken into the bias squared and variance) and the inherent error.

error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2]

(2.49)

In current case, inherent error is:

Ep(x,y)[(E[Y |x] � y)2] = �2 (2.50)

x + ✏ where✏ ⇠ N (0, �

2). Suppose this is true generative process for some ✓⇤

X)�1X

|~y = ✓

⇤ + (X|X)�1

X|~✏, a

(2.49)

Ep(x,y)[(E[Y |x] � y)2] = �2 (2.50)

x + ✏ where✏ ⇠ N (0, �

MLE parameter estimate ✓ = (X|X)�1

X|~y = ✓

⇤ + (X|X)�1

X|~✏, a

(2.49)

Ep(x,y)[(E[Y |x] � y)2] = �2 (2.50)

x + ✏ where✏ ⇠ N (0, �

X)�1X

|~y = ✓

⇤ + (X|X)�1

X|~✏, a

(2.49)

Ep(x,y)[(E[Y |x] � y)2] = �2 (2.50)

x + ✏ where✏ ⇠ N (0, �

X)�1X

|~y = ✓

⇤ + (X|X)�1

X|~✏, a

(2.49)

Ep(x,y)[(E[Y |x] � y)2] = �2 (2.50)

x + ✏ where✏ ⇠ N (0, �

X)�1X

|~y = ✓

⇤ + (X|X)�1

X|~✏, a

(2.49)

Ep(x,y)[(E[Y |x] � y)2] = �2 (2.50)

x + ✏ where✏ ⇠ N (0, �

X)�1X

|~y = ✓

⇤ + (X|X)�1

X|~✏, a

(2.49)

Ep(x,y)[(E[Y |x] � y)2] = �2 (2.50)

Random model h✓ based on random sample D, soED[h✓] = ED[h✓⇤+(X|X)�1X|~✏] = h✓⇤ since ED[✏] = 0.

Thus, bias squared at x:

(x)] � E[Y |x])2 = 0 (2.51)

Variance increases with m and decreases with n (sample size)

ED[(hD(x) � ED

(x)])2] =�2m

n(2.52)

A famous theorem (Gauss-Markov Theorem) states that among allunbiased estimators, the linear least squares (LLS) estimator has thesmallest variance and hence has the smallest smallest (mean squared)error of all unbiased linear estimators! I.e.,Var(h✓LLS(x)) Var(h✓any unbiased

(x)] � E[Y |x])2 = 0 (2.51)

ED[(hD(x) � ED

(x)])2] =�2m

n(2.52)

(x)] � E[Y |x])2 = 0 (2.51)

ED[(hD(x) � ED

(x)])2] =�2m

n(2.52)

(x)] � E[Y |x])2 = 0 (2.51)

ED[(hD(x) � ED

(x)])2] =�2m

n(2.52)

Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy,...

Documents