Post on 16-Aug-2021
transcript
Advanced Introduction to Machine Learning
— Spring Quarter, Week 2 —https://canvas.uw.edu/courses/1372141
Prof. Je↵ Bilmes
University of Washington, Seattle
Departments of: Electrical & Computer Engineering, Computer Science & Engineering
http://melodi.ee.washington.edu/~bilmes
April 6th/8th, 2020
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F1/67 (pg.1/163)
Logistics Review
Announcements
HW1 to be posted this evening, due in 1.5 weeks.
Virtual o�ce hours this week, Thursday night at 10:00pm via zoom(same link as class).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F2/67 (pg.2/163)
mmmm omma
Logistics Review
Class Road Map
W1(3/30,4/1): What is ML, Probability, Coins, Gaussians and linearregression, Associative Memories, Supervised LearningW2(4/6,4/8): More supervised, logistic regression, complexity andbias/variance tradeo↵W3(4/13,4/15): Bias/Variance, Regularization, Ridge, CrossVal,MulticlassW4(4/20,4/22): Multiclass classification, ERM, Gen/Disc, Naıve BayesW5(4/27,4/29): Lasso, Regularizers, Curse of DimensionalityW6(5/4,5/6): Curse of Dimensionality, Dimensionality Reduction, k-NNW7(5/11,5/13): k-NN, LSH, DTs, Bootstrap/Bagging, Boosting &Random Forests, GBDTsW8(5/18,5/20): Graphs; Graphical Models (Factorization, Inference,MRFs, BNs);W9(5/27,6/1): Learning Paradigms; Clustering; EM Algorithm;W10(6/3,6/8): Spectral Clustering, Graph SSL, Deep models, (SVMs,RL); The Future.
Last lecture is 6/8 since 5/25 is holiday (or we could just have lecture on 5/25).Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F3/67 (pg.3/163)
Logistics Review
Acknowledgments/References
Some of the below material was drawn from:
Bishop, 1996.
https://courses.cs.washington.edu/courses/cse546/18au/,https://courses.cs.washington.edu/courses/cse546/16au/,https://courses.cs.washington.edu/courses/cse546/14au/,http://cs229.stanford.edu/syllabus.html,https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F4/67 (pg.4/163)
Logistics Review
Review
This is where each day we will be reviewing previous lecture material.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F5/67 (pg.5/163)
Logistics Review
Some readings
Matrix cookbook https:
//www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf
Linear algebra noteshttp://cs229.stanford.edu/section/cs229-linalg.pdf
Writeup on Overfitting and Underfitting on our web page(https://canvas.uw.edu/courses/1372141). (see in particularhttps://canvas.uw.edu/courses/1372141/discussion_topics/
5384617)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F6/67 (pg.6/163)
-
Logistics Review
Class (and Machine Learning) overview
1. Introduction• What is ML• What is AI• Why are we so interested in these topics right now?
2. ML Paradigms/Concepts• Over!tting/Under!tting, model complexity, bias/variance• size of data, big data, sample complexity• ERM, loss + regularization, loss functions, regularizers• Supervised, unsupervised, and semi-supervised learning;• reinforcement learning, RL, multi-agent, planning/control• transfer and multi-task learning• federated and distributed learning• active learning, machine teaching• self-supervised, zero/one-shot, open-set learning
3. Dealing with Features• dimensionality reduction, PCA, LDA, MDS, T-SNE, UMAP • Locality sensitive hashing (LSH)• feature selection• feature engineering• matrix factorization & feature engineering• representation learning
4. Evaluation• accuracy/error, precision/recall, ROC, likelihood/posterior, cost/utility, margin • train/eval/test data splits• n-fold cross validation• method of the bootstrap
6. Inference Methods• probabilistic inference• MLE, MAP• belief propagation• forward/backpropagation• Monte Carlo methods
7. Models & Representation• linear least squares, linear regression, logistic regression, sparsi-ty, ridge, lasso• generative vs. discriminative models• Naive Bayes• k-nearest neighbors• clustering, k-means, k-mediods, EM & GMMs, single linkage• decision trees and random forests• support vector machines, kernel methods, max margin• perceptron, neural networks, DNNs• Gaussian processes• Bayesian nonparametric methods• ensemble methods• the bootstrap, bagging, and boosting• graphical models• time-series, HMMs, DBNs, RNNs, LSTMs, Attention, Transformers • structured prediction• grammars (as in NLP)
12. Other Techniques• compressed sensing• submodularity, diversity/homogeneity modeling
8. Philosophy, Humanity, Spirituality• arti!cial intelligence (AI)• arti!cal general intelligence (AGI)• arti!cial intelligence vs. science !ction
9. Applications• computational biology• social networks• computer vision• speech recognition• natural language processing• information retrieval• collaborative !ltering/matrix factorization
10. Programming• python• libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras, H2O, etc.• HPC: C/C++, CUDA, vector processing
11. Background• linear algebra• multivariate calculus• probability theory and statistics• information theory• mathematical (e.g., convex) optimization
x6
x3
x4
x5
x1
x3 x6
x2x4
x5
x1
x3 x6
x2x4
x5
x1
x3 x6
x2x4
x5
x1
x3 x6
x2x4
x5
x1
x3 x6
x2x4
x5
x1
x3 x6
x2x4
x5
x1
x3
x2x4
x5
x1
x3
x2x4
x1
x3
x2
x1
x2
x1
x3
x2x4
x5
x1
x2x4
x5
x1
x2
x5
x1
x2
=X
x3
X
x4
X
x5
(x1, x2) (x1, x3) (x3, x4) (x3, x5)X
x6
(x2, x6)
| {z }� 66,2(x2)
=X
x4
X
x5
(x1, x2)� 66,2(x2)X
x3
(x1, x3) (x3, x4) (x3, x5)
| {z }� 63,1,4,5(x1,x4,x5)
=X
x5
(x1, x2)� 66,2(x2)X
x4
� 63,1,4,5(x1, x4, x5)
| {z }� 63, 64,1,5(x1,x5)
= (x1, x2)� 66,2(x2)X
x5
� 63, 64,1,5(x1, x5)
| {z }� 63,65, 64,1(x1)
= (x1, x2)� 66,2(x2)� 63, 64, 65,1(x1)
p(x1, x2) =X
x3
X
x4
· · ·
X
x6
p(x1, x2, . . . , x6)
X
x3
X
x4
X
x5
(x1, x2) (x1, x3) (x3, x4) (x3, x5)X
x6
(x2, x6)
| {z }� 66,2(x2)
=X
x3
X
x4
(x1, x2)� 66,2(x2) (x1, x3) (x3, x4)X
x5
(x3, x5)
| {z }� 65,3(x3)
=X
x3
(x1, x2)� 66,2(x2) (x1, x3)� 65,3(x3)X
x4
(x3, x4)
| {z }� 64,3(x3)
= (x1, x2)� 66,2(x2)X
x3
(x1, x3)� 65,3(x3)� 64,3(x3)
| {z }� 65, 64, 63,1(x1)
= (x1, x2)� 66,2(x2)� 65,64, 63,1(x1)
p(x1, x2) =X
x3
X
x4
· · ·
X
x6
p(x1, x2, . . . , x6)
=
Reconstituted Graph Reconstituted Graph
O(r2)
O(r4)
O(r3)
O(r2)
x6
x5
x4
x3
O(r2)
O(r2)
O(r2)
O(r2)
GraphicalTransformation
CorrespondingMarginalization Operation
GraphicalTransformation
CorrespondingMarginalization Operation
Variableto
Eliminateand
Complexity
Variableto
Eliminateand
Complexity
InputLayer
HiddenLayer 1
HiddenLayer 2
HiddenLayer 3
HiddenLayer 4
HiddenLayer 5
HiddenLayer 6
HiddenLayer 7
OutputUnit
5. Optimization Methods• Unconstrained Continuous Optimization: (stochastic) gradient descent (SGD), adap-tive learning rates, conjugate gradient, 2nd order Newton• Constrained Continuous Optimization : Frank-Wolf (conditional gradient descent), projected gradient, linear, quadratic, and convex programming• Discrete optimization - greedy, beam search, branch-and-bound, submodular optimization.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F7/67 (pg.7/163)
Logistics Review
Strategy
Strategy for the next period of time.
For some topic in (2)for subtopic in subset of (6) relevant to topic.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F8/67 (pg.8/163)
nmnmmqq
Logistics Review
Traditional Computer Programming vs. ML
Let us change our traditional attitude to the construc-tion of programs. Instead of imagining that our main task is to instruct a computer what to do, let us concen-trate rather on explaining to human beings what we want a computer to do. -- Donald Knuth
writesHuman Programming a Computer
Algorithm/Computer Program
Computer
Something Seemingly
Useful
produces
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F9/67 (pg.9/163)
Logistics Review
Traditional Computer Programming vs. ML
writes
Machine Learning is the art of repeatedly telling a computer what one wants the computer to tell a second computer about a lot of data. This continues until the second computer gets it right.
Human Programming a Computer
Algorithm/Computer Program
Something Seemingly Intelligent
Training Data
TestData
writes producesAlgorithm/Computer Program
Computer
Computer
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F10/67 (pg.10/163)
Logistics Review
Traditional Computer Programming vs. ML
https://imarticus.org/what-is-machine-learning-and-does-it-matter/
other defs of ML: https://www.kdnuggets.com/2018/12/essence-machine-learning.html
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F11/67 (pg.11/163)
Logistics Review
Probability and Uncertainty
Key point: the world is a complicated place, we cannot knoweverything, and even what we think we know we can (nor should) notalways be certain. Uncertainty abounds!Need a representation of uncertainty.Probability has a precise mathematical definition (Kolmogorovaxioms), but we use it in deference to the inevitable uncertaintysurrounding all decisions.Simple and subjective working definition:
probability =number of cases something happened
number of total cases. (1.2)
Good for repeatable measurable events (e.g., coins flips, dice, etc.).Harder for future events (probability it will rain tomorrow, probabilityManchester City wins Liverpool, etc.).Despite shortcomings, used as representation of uncertainty/certainty(i.e., probability that image x contains face of person y).Machine learning often strives for the “best” probabilities in data usinglearning algorithms.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F12/67 (pg.12/163)
Logistics Review
Coin Flipping and ML
D = {b1, b2, . . . , bn} is series of n independent and identical coin flips,bi 2 {H,T}.
k = |{i : bi = H}| is the count of the number of heads in D
How true, or likely, is it that ✓ is probability of heads?
Pr(D|✓) = ✓k(1� ✓)n�k = Likelihood of D given ✓ (1.2)
How to find the most likely explanation of D? Maximum likelihood
✓MLE = argmax✓2[0,1]
Pr(D|✓) = argmax✓2[0,1]
log Pr(D|✓) (1.3)
How to find ✓MLE, calculus,@@✓ log Pr(D|✓) = 0 leads to
✓MLE = k/n (1.4)
Thus, computing k and dividing by n is a simple way to learn!Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F13/67 (pg.13/163)
Logistics Review
Learning Gaussians
Given the data sample D without knowing µ, C, how likely is thesample under some hypothesized parameters µ, C.
log Pr(D|µ, C) =nX
i=1
log Pr(xi|µ, C) (1.3)
, log Likelihood of D given µ, C (1.4)
How to find the most likely explanation of D? Maximum likelihood
[µMLE, CMLE] = argmaxµ2Rm,C2P(m)
log Pr(D|µ, C) (1.5)
How to find MLE quantities, again calculus, @@µ log Pr(D|µ, C) = 0
and @@C log Pr(D|µ, C) = 0 leads to
µMLE =1
n
nX
i=1
xi and CMLE =1
n
nX
i=1
(xi � µMLE)(xi � µMLE)|
(1.6)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F14/67 (pg.14/163)
Logistics Review
Associations and Associative Memories
Associative memory, auto-associative memory, or hetero-associativememory. In general, associate x 2 X to y 2 Y via h : X ! Y.
Examples: memory subsystem (separate address for each x 2 X ), datastructures like hash tables, or red-black trees, etc.
Often X , Y is very large, and we have only a sample associationsD = {(xi, yi)}
ni=1 where xi 2 X , yi 2 Y where n⌧ |X |.
We want to build an associative memory that works even outside of D.That is, even for x /2 {x : x = xi for some i 2 [n], (xi, yi) 2 D}.
Why? D might not be complete, variation, noise, or possible datacorruption not fully captured in D. Also, X might be infinitely large.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F15/67 (pg.15/163)
Logistics Review
Associations and Associative Memories
Machine learning: Write an algorithm that, given large enough D,produces a program h that generalizes (works) well on unseen samples.Respond reasonably to variation, noise, data corruption (be robust).Do this computationally as e�ciently as possible, and (ideally)understand it mathematically.
Boils down to finding a good h : X ! Y that can do the mapping(association). Sometimes we choose some h 2 H where H is largecollection of possible associators. More frequently, h is parameterizedvia some parameters ✓ and we find a good ✓ leading to h✓.
Many ways to do this, depends on nature of X , Y, how big the data is(number of samples n), and available resources (compute, coremachine memory/RAM, storage/disk, communication(latency/bandwidth), time, money, energy usage).
Often, x 2 Rm is an m-dimensional vector of features. In general, x isknown as a feature vector.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F16/67 (pg.16/163)
Logistics Review
Statistical parameter estimation
Training data D =�(x(i)
, y(i))
n
i=1where (x(i)
, y(i)) ⇠ p(x, y) are
drawn from some distribution, x(i)2 Rm and y
(i)2 R.
x(i) is m-dimensional column vector of features, y
(i) is scalar.Goal: find h✓ : X ! Y with minimum error, where
Errori = ei = h✓(x(i))� y
(i) (1.3)
E[e2] = Ep(x,y)[(h✓(x)� y)2] =
Zp(x, y)(h✓(x)� y)2dxdy
(1.4)
=
Zp(x)
Z(h✓(x)� y)2p(y|x)dydx (1.5)
and ✓ 2 Rm is a parameter vector, ✓ = (✓1, ✓2, . . . , ✓m), ✓i 2 R.Taking derivatives and setting to zero, we get best solution:
h✓(x) =
Zyp(y|x)dy = E[Y |x] = best association. (1.6)
This assumes we have the distribution p and also the resources tocompute E[Y |x].
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F17/67 (pg.17/163)
Logistics Review
Linear estimator: Objective Optimization
Recall, h✓(x) , ✓|x is parameterized by parameters ✓ so
J(✓) =1
n
nX
i=1
(h✓(x(i))� y
(i))2 (1.3)
Taking derivative of error objective J(✓) w.r.t. ✓ and set to zero gets:
@J
@✓=
2
n
nX
n=1
(h✓(x(i))� y
(i))@h✓(x(i))
@✓= 0 (1.4)
Linear h✓(x) = x|✓ assumption, yields @h✓(x(i))
@✓ = x(i).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F18/67 (pg.18/163)
Logistics Review
Linear Least Squares
This gives objective to be minimized (smallest, or least of the sum ofsquares of the errors).
@J(✓)
@✓=
2
n
nX
n=1
(x(i)|✓ � y
(i))x(i) = 0 (1.3)
We simplify this a bit by defining matrices associated with thesequantities. First define a n⇥m design matrix X and length-n columnvector ~y
X =
0
BBB@
x(1)|
x(2)|
...x(n)|
1
CCCA, and ~y =
0
BBB@
y1
y2...
yn
1
CCCA(1.4)
Objective Equation (??), equivalent matrix-vector form:
J(✓) =1
2(X✓ � ~y)|(X✓ � ~y) (1.5)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F19/67 (pg.19/163)
Logistics Review
Normal Equations
With this, we get the “normal equations”
r✓J(✓) = X|(X✓ � ~y) = ~0 (1.3)
i.e., modeling ~y to be in column space of matrix X (linearcombinations of columns of X), when ~y is being approximated by X✓.
Called normal equations because column space of X is orthogonal tothe residual error E = (~y �X✓), giving solution ✓ = ✓ as shown.
what is to beapproximated
error
y y � X ✓
X ✓actualapproximation,
closest point to y
{y:y=
X✓, ✓
2 R m} space of possible approximations,
column space of X
If X|X invertible (typical
if n � m), solution hasform:
✓ = (X|X)�1
X|~y
where (X|X)�1
X| is
known as the Moore-Penrose pseudo-inverse ofmatrix X.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F20/67 (pg.20/163)
Logistics Review
Gradient Descent, Batch Gradient Descent
Gradient updates for all elements of ✓ at the same time and for samplepair (x(i)
, y(i))
✓ ✓ + ↵(y(i) � h✓(x(i)))x(i) = ✓ + ↵(y(i) � ✓
|x(i))x(i) (1.9)
move ✓ in the direction of x(i) weighted by ↵(y(i) � h✓(x(i))) 2 R, ↵
times the error.Called LMS (least mean squares) update rule, also called Widrow-Ho↵(early NN folks) learning rule.Batch Gradient Descent uses J(✓) = 1
n
Pni=1(h✓(x(i))� y
(i))2, andsince the gradient is a linear operator, this yields the following:
Algorithm 2: Batch Gradient descent learningInput : Training data D, learning rate ↵, initial parameter estimate ✓
Output: Learnt model parameters ✓
1 for t = 1, · · · , T do
2 ✓ ✓ + ↵Pn
i=1(y(i)� h✓(x(i)))x(i)
Return : the final parameters ✓
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F21/67 (pg.21/163)
Logistics Review
More visualization: Batch Gradient Descent
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F22/67 (pg.22/163)
Logistics Review
Incremental and Stochastic Gradient Descent
Algorithm 3: Incremental Gradient Descent (IGD) learning
Input : Training data D, learning rate ↵, initial parameterestimate ✓
Output: Learnt model parameters ✓
1 for t = 1, · · · , T do
2 for i = 1, · · · , n do
3 ✓ ✓ + ↵(y(i) � h✓(x(i)))x(i)
Return : the final parameters ✓
Optimization folks (e.g., Bertsekas) call this incremental gradient methods.It is called Stochastic Gradient Descent (SGD) if we randomize (with orwithout replacement) the order of the data items.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F23/67 (pg.23/163)
Logistics Review
More visualization: Stochastic Gradient Descent
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F24/67 (pg.24/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Underfitting vs. Overfitting
0 1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x
y
0 1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x
y
0 1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x
y
Fit a model with various input features, values of powers of x, goal isto predict y based on xy-pair samples D = {(x(i)
, y(i)
}i.
Fit models: left y = ✓0 + ✓1x; middle y = ✓0 + ✓1x + ✓2x2; right
y =P5
j=0 ✓jxj .
Both left and right plots poorly fit the data, but they are poor fordi↵erent reasons.
The left could be underfitting, and the right could beoverfitting. The center plot looks better.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F25/67 (pg.25/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Underfitting vs. Overfitting
0 1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x
y
0 1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x
y
0 1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x
y
Fit a model with various input features, values of powers of x, goal isto predict y based on xy-pair samples D = {(x(i)
, y(i)
}i.
Fit models: left y = ✓0 + ✓1x; middle y = ✓0 + ✓1x + ✓2x2; right
y =P5
j=0 ✓jxj .
Both left and right plots poorly fit the data, but they are poor fordi↵erent reasons.
The left could be underfitting, and the right could beoverfitting. The center plot looks better.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F25/67 (pg.26/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Underfitting vs. Overfitting
0 1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x
y
0 1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x
y
0 1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x
y
Fit a model with various input features, values of powers of x, goal isto predict y based on xy-pair samples D = {(x(i)
, y(i)
}i.
Fit models: left y = ✓0 + ✓1x; middle y = ✓0 + ✓1x + ✓2x2; right
y =P5
j=0 ✓jxj .
Both left and right plots poorly fit the data, but they are poor fordi↵erent reasons.
The left could be underfitting, and the right could beoverfitting. The center plot looks better.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F25/67 (pg.27/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Underfitting vs. Overfitting
0 1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x
y
0 1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x
y
0 1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x
y
Fit a model with various input features, values of powers of x, goal isto predict y based on xy-pair samples D = {(x(i)
, y(i)
}i.
Fit models: left y = ✓0 + ✓1x; middle y = ✓0 + ✓1x + ✓2x2; right
y =P5
j=0 ✓jxj .
Both left and right plots poorly fit the data, but they are poor fordi↵erent reasons. The left could be underfitting, and the right could beoverfitting.
The center plot looks better.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F25/67 (pg.28/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Underfitting vs. Overfitting
0 1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x
y
0 1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x
y
0 1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x
y
Fit a model with various input features, values of powers of x, goal isto predict y based on xy-pair samples D = {(x(i)
, y(i)
}i.
Fit models: left y = ✓0 + ✓1x; middle y = ✓0 + ✓1x + ✓2x2; right
y =P5
j=0 ✓jxj .
Both left and right plots poorly fit the data, but they are poor fordi↵erent reasons. The left could be underfitting, and the right could beoverfitting. The center plot looks better.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F25/67 (pg.29/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Overfitting definition (T. Mitchell)
We say that a hypothesis overfits the training examples if some otherhypothesis that fits the training examples less well actually performsbetter over the entire distribution of instances (i.e., including instancesbeyond the training set).
Definition 2.3.1 (overfitting)
Given a hypothesis space H, a hypothesis h 2 H is said to overfit thetraining data if there exists some alternative hypothesis h
02 H, such that h
has smaller error than h0 over the training examples, but h
0 has a smalleroverall error than h over the entire distribution (or data set) of instances.
We’ll visit this topic again when we discuss bias/variance, but first letsdiscuss a few more models.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F26/67 (pg.30/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Overfitting definition (T. Mitchell)
We say that a hypothesis overfits the training examples if some otherhypothesis that fits the training examples less well actually performsbetter over the entire distribution of instances (i.e., including instancesbeyond the training set).
Definition 2.3.1 (overfitting)
Given a hypothesis space H, a hypothesis h 2 H is said to overfit thetraining data if there exists some alternative hypothesis h
02 H, such that h
has smaller error than h0 over the training examples, but h
0 has a smalleroverall error than h over the entire distribution (or data set) of instances.
We’ll visit this topic again when we discuss bias/variance, but first letsdiscuss a few more models.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F26/67 (pg.31/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Overfitting definition (T. Mitchell)
We say that a hypothesis overfits the training examples if some otherhypothesis that fits the training examples less well actually performsbetter over the entire distribution of instances (i.e., including instancesbeyond the training set).
Definition 2.3.1 (overfitting)
Given a hypothesis space H, a hypothesis h 2 H is said to overfit thetraining data if there exists some alternative hypothesis h
02 H, such that h
has smaller error than h0 over the training examples, but h
0 has a smalleroverall error than h over the entire distribution (or data set) of instances.
We’ll visit this topic again when we discuss bias/variance, but first letsdiscuss a few more models.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F26/67 (pg.32/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Linear Regression
Linear regression involved fitting a model of the formy = h✓(x) =
Pi ✓ixi where xi is the i
th input feature and ✓i is the ith
parameter.
model is linear in the parameters, h↵✓+↵0✓0(x) = ↵h✓(x) + ↵0h✓0(x)
that we “regress” to.
Reasonable starting model for when y 2 R.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F27/67 (pg.33/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Linear Regression
Linear regression involved fitting a model of the formy = h✓(x) =
Pi ✓ixi where xi is the i
th input feature and ✓i is the ith
parameter.
model is linear in the parameters, h↵✓+↵0✓0(x) = ↵h✓(x) + ↵0h✓0(x)
that we “regress” to.
Reasonable starting model for when y 2 R.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F27/67 (pg.34/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Linear Regression
Linear regression involved fitting a model of the formy = h✓(x) =
Pi ✓ixi where xi is the i
th input feature and ✓i is the ith
parameter.
model is linear in the parameters, h↵✓+↵0✓0(x) = ↵h✓(x) + ↵0h✓0(x)
that we “regress” to.
Reasonable starting model for when y 2 R.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F27/67 (pg.35/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Logistic Regression
What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.
Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.Correctness is not the same as certainty!!!It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F28/67 (pg.36/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Logistic Regression
What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).
Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.Correctness is not the same as certainty!!!It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F28/67 (pg.37/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Logistic Regression
What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).
Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.Correctness is not the same as certainty!!!It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F28/67 (pg.38/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Logistic Regression
What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].
With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.Correctness is not the same as certainty!!!It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F28/67 (pg.39/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Logistic Regression
What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).
Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.Correctness is not the same as certainty!!!It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F28/67 (pg.40/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Logistic Regression
What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.
Correctness is not the same as certainty!!!It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F28/67 (pg.41/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Logistic Regression
What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.Correctness is not the same as certainty!!!
It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F28/67 (pg.42/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Logistic Regression
What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.Correctness is not the same as certainty!!!It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F28/67 (pg.43/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Making a decision
Given a probability model p(y|x), how do we make a final decision?
Let y 2 {0, 1} be the true label and y 2 {0, 1} be a prediction.
Decide true if p(y = 1|x) � ⌧ where ⌧ 2 [0, 1] is a decision threshold,i.e., y = 1{p(y=1|x)�⌧}. Natural value of ⌧ = 0.5 but other values arealso not unreasonable.
Given a validation data set Dva =�(x(i)
, y(i))
n
i=1where on which a
classifier produces predictions�y(i) n
i=1, we can compute the following
quantities:
True positives TP =Pn
i=1 1{y(i)=1^y(i)=1}
True negatives TN =Pn
i=1 1{y(i)=0^y(i)=0}
False positives FP =Pn
i=1 1{y(i)=1^y(i)=0}
False negatives FN =Pn
i=1 1{y(i)=0^y(i)=1}
Note that number of samples n = TP + TN + FP + FN
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F29/67 (pg.44/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Making a decision
Given a probability model p(y|x), how do we make a final decision?
Let y 2 {0, 1} be the true label and y 2 {0, 1} be a prediction.
Decide true if p(y = 1|x) � ⌧ where ⌧ 2 [0, 1] is a decision threshold,i.e., y = 1{p(y=1|x)�⌧}. Natural value of ⌧ = 0.5 but other values arealso not unreasonable.
Given a validation data set Dva =�(x(i)
, y(i))
n
i=1where on which a
classifier produces predictions�y(i) n
i=1, we can compute the following
quantities:
True positives TP =Pn
i=1 1{y(i)=1^y(i)=1}
True negatives TN =Pn
i=1 1{y(i)=0^y(i)=0}
False positives FP =Pn
i=1 1{y(i)=1^y(i)=0}
False negatives FN =Pn
i=1 1{y(i)=0^y(i)=1}
Note that number of samples n = TP + TN + FP + FN
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F29/67 (pg.45/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Making a decision
Given a probability model p(y|x), how do we make a final decision?
Let y 2 {0, 1} be the true label and y 2 {0, 1} be a prediction.
Decide true if p(y = 1|x) � ⌧ where ⌧ 2 [0, 1] is a decision threshold,i.e., y = 1{p(y=1|x)�⌧}. Natural value of ⌧ = 0.5 but other values arealso not unreasonable.
Given a validation data set Dva =�(x(i)
, y(i))
n
i=1where on which a
classifier produces predictions�y(i) n
i=1, we can compute the following
quantities:
True positives TP =Pn
i=1 1{y(i)=1^y(i)=1}
True negatives TN =Pn
i=1 1{y(i)=0^y(i)=0}
False positives FP =Pn
i=1 1{y(i)=1^y(i)=0}
False negatives FN =Pn
i=1 1{y(i)=0^y(i)=1}
Note that number of samples n = TP + TN + FP + FN
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F29/67 (pg.46/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Making a decision
Given a probability model p(y|x), how do we make a final decision?
Let y 2 {0, 1} be the true label and y 2 {0, 1} be a prediction.
Decide true if p(y = 1|x) � ⌧ where ⌧ 2 [0, 1] is a decision threshold,i.e., y = 1{p(y=1|x)�⌧}. Natural value of ⌧ = 0.5 but other values arealso not unreasonable.
Given a validation data set Dva =�(x(i)
, y(i))
n
i=1where on which a
classifier produces predictions�y(i) n
i=1, we can compute the following
quantities:
True positives TP =Pn
i=1 1{y(i)=1^y(i)=1}
True negatives TN =Pn
i=1 1{y(i)=0^y(i)=0}
False positives FP =Pn
i=1 1{y(i)=1^y(i)=0}
False negatives FN =Pn
i=1 1{y(i)=0^y(i)=1}
Note that number of samples n = TP + TN + FP + FN
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F29/67 (pg.47/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Making a decision
Given a probability model p(y|x), how do we make a final decision?
Let y 2 {0, 1} be the true label and y 2 {0, 1} be a prediction.
Decide true if p(y = 1|x) � ⌧ where ⌧ 2 [0, 1] is a decision threshold,i.e., y = 1{p(y=1|x)�⌧}. Natural value of ⌧ = 0.5 but other values arealso not unreasonable.
Given a validation data set Dva =�(x(i)
, y(i))
n
i=1where on which a
classifier produces predictions�y(i) n
i=1, we can compute the following
quantities:True positives TP =
Pni=1 1{y(i)=1^y(i)=1}
True negatives TN =Pn
i=1 1{y(i)=0^y(i)=0}
False positives FP =Pn
i=1 1{y(i)=1^y(i)=0}
False negatives FN =Pn
i=1 1{y(i)=0^y(i)=1}
Note that number of samples n = TP + TN + FP + FN
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F29/67 (pg.48/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Making a decision
Given a probability model p(y|x), how do we make a final decision?
Let y 2 {0, 1} be the true label and y 2 {0, 1} be a prediction.
Decide true if p(y = 1|x) � ⌧ where ⌧ 2 [0, 1] is a decision threshold,i.e., y = 1{p(y=1|x)�⌧}. Natural value of ⌧ = 0.5 but other values arealso not unreasonable.
Given a validation data set Dva =�(x(i)
, y(i))
n
i=1where on which a
classifier produces predictions�y(i) n
i=1, we can compute the following
quantities:True positives TP =
Pni=1 1{y(i)=1^y(i)=1}
True negatives TN =Pn
i=1 1{y(i)=0^y(i)=0}
False positives FP =Pn
i=1 1{y(i)=1^y(i)=0}
False negatives FN =Pn
i=1 1{y(i)=0^y(i)=1}
Note that number of samples n = TP + TN + FP + FN
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F29/67 (pg.49/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Making a decision
Given a probability model p(y|x), how do we make a final decision?
Let y 2 {0, 1} be the true label and y 2 {0, 1} be a prediction.
Decide true if p(y = 1|x) � ⌧ where ⌧ 2 [0, 1] is a decision threshold,i.e., y = 1{p(y=1|x)�⌧}. Natural value of ⌧ = 0.5 but other values arealso not unreasonable.
Given a validation data set Dva =�(x(i)
, y(i))
n
i=1where on which a
classifier produces predictions�y(i) n
i=1, we can compute the following
quantities:True positives TP =
Pni=1 1{y(i)=1^y(i)=1}
True negatives TN =Pn
i=1 1{y(i)=0^y(i)=0}
False positives FP =Pn
i=1 1{y(i)=1^y(i)=0}
False negatives FN =Pn
i=1 1{y(i)=0^y(i)=1}
Note that number of samples n = TP + TN + FP + FN
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F29/67 (pg.50/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Making a decision
Given a probability model p(y|x), how do we make a final decision?
Let y 2 {0, 1} be the true label and y 2 {0, 1} be a prediction.
Decide true if p(y = 1|x) � ⌧ where ⌧ 2 [0, 1] is a decision threshold,i.e., y = 1{p(y=1|x)�⌧}. Natural value of ⌧ = 0.5 but other values arealso not unreasonable.
Given a validation data set Dva =�(x(i)
, y(i))
n
i=1where on which a
classifier produces predictions�y(i) n
i=1, we can compute the following
quantities:True positives TP =
Pni=1 1{y(i)=1^y(i)=1}
True negatives TN =Pn
i=1 1{y(i)=0^y(i)=0}
False positives FP =Pn
i=1 1{y(i)=1^y(i)=0}
False negatives FN =Pn
i=1 1{y(i)=0^y(i)=1}
Note that number of samples n = TP + TN + FP + FN
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F29/67 (pg.51/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Making a decision
Given a probability model p(y|x), how do we make a final decision?
Let y 2 {0, 1} be the true label and y 2 {0, 1} be a prediction.
Decide true if p(y = 1|x) � ⌧ where ⌧ 2 [0, 1] is a decision threshold,i.e., y = 1{p(y=1|x)�⌧}. Natural value of ⌧ = 0.5 but other values arealso not unreasonable.
Given a validation data set Dva =�(x(i)
, y(i))
n
i=1where on which a
classifier produces predictions�y(i) n
i=1, we can compute the following
quantities:True positives TP =
Pni=1 1{y(i)=1^y(i)=1}
True negatives TN =Pn
i=1 1{y(i)=0^y(i)=0}
False positives FP =Pn
i=1 1{y(i)=1^y(i)=0}
False negatives FN =Pn
i=1 1{y(i)=0^y(i)=1}
Note that number of samples n = TP + TN + FP + FN
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F29/67 (pg.52/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Accuracy and Error
Overall accuracy
Accuracy =Total correct predictions
Total predictions=
TP + TN
TP + TN + FP + FN(2.1)
Error
Error = 1.0� Accuracy =FP + FN
TP + TN + FP + FN(2.2)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F30/67 (pg.53/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Accuracy and Error
Overall accuracy
Accuracy =Total correct predictions
Total predictions=
TP + TN
TP + TN + FP + FN(2.1)
Error
Error = 1.0� Accuracy =FP + FN
TP + TN + FP + FN(2.2)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F30/67 (pg.54/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Binary Confusion Matrix
Given n samples in a validation data set, we can plot the relationshipbetween TP, TN, FP, FN.
True
Lab
elPredicted Label
Numberof PositiveSamples
TP
TN
FN
FP
Numberof SamplesPredicted tobe Positive
Numberof SamplesPredicted tobe Negative
Numberof NegativeSamples
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F31/67 (pg.55/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
False Positive and False Negative
https://www.kdnuggets.com/2020/01/guide-precision-recall-confusion-matrix.html
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F32/67 (pg.56/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Precision, Recall, and F-Feasure
Precision:
P =TP
TP + FP=
TP
number of predicted positives(2.3)
Recall
R =TP
TP + FN=
TP
number of positives(2.4)
F-measure (or F1-score), harmonic mean of precision and recall,
F-measure =2
1Precision
+ 1Recall
= 2Precision⇥ Recall
Precision + Recall(2.5)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F33/67 (pg.57/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Precision, Recall, and F-Feasure
Precision:
P =TP
TP + FP=
TP
number of predicted positives(2.3)
Recall
R =TP
TP + FN=
TP
number of positives(2.4)
F-measure (or F1-score), harmonic mean of precision and recall,
F-measure =2
1Precision
+ 1Recall
= 2Precision⇥ Recall
Precision + Recall(2.5)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F33/67 (pg.58/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Precision, Recall, and F-Feasure
Precision:
P =TP
TP + FP=
TP
number of predicted positives(2.3)
Recall
R =TP
TP + FN=
TP
number of positives(2.4)
F-measure (or F1-score), harmonic mean of precision and recall,
F-measure =2
1Precision
+ 1Recall
= 2Precision⇥ Recall
Precision + Recall(2.5)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F33/67 (pg.59/163)
×I In
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
ROC and AUC
Recall, we predict as y = 1{p(y=1|x)�⌧}, but how does TP and FPchange as we vary the decision threshold ⌧?
Receiver Operating Characteristic (ROC) curve is determined by ⌧ .
Area under the curve (AUC) gives an overall measure of how wellmodel is doing for all ⌧ . Higher AUC is better.
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F34/67 (pg.60/163)
F
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
ROC and AUC
Recall, we predict as y = 1{p(y=1|x)�⌧}, but how does TP and FPchange as we vary the decision threshold ⌧?
Receiver Operating Characteristic (ROC) curve is determined by ⌧ .
Area under the curve (AUC) gives an overall measure of how wellmodel is doing for all ⌧ . Higher AUC is better.
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F34/67 (pg.61/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
ROC and AUC
Recall, we predict as y = 1{p(y=1|x)�⌧}, but how does TP and FPchange as we vary the decision threshold ⌧?
Receiver Operating Characteristic (ROC) curve is determined by ⌧ .
Area under the curve (AUC) gives an overall measure of how wellmodel is doing for all ⌧ . Higher AUC is better.
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F34/67 (pg.62/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Logistic Regression
Solution for binary classification: use a logistic function as in
Pr(y = 1|x) = h✓(x) = g(✓|x) =1
1 + exp(�✓|x)(2.6)
where ✓|x = ✓0 +
Pmi=1 ✓ixi, x0 ⌘ 1 so ✓0 is the bias/shift.
g(z) = 1/(1 + e�x)
is known as a logisticfunction.
A logistic function is one type of sigmoid function, others beinghyperbolic tangent, arctan, error function, etc. (seehttps://en.wikipedia.org/wiki/Sigmoid_function).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F35/67 (pg.63/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Logistic Regression
Solution for binary classification: use a logistic function as in
Pr(y = 1|x) = h✓(x) = g(✓|x) =1
1 + exp(�✓|x)(2.6)
where ✓|x = ✓0 +
Pmi=1 ✓ixi, x0 ⌘ 1 so ✓0 is the bias/shift.
g(z) = 1/(1 + e�x)
is known as a logisticfunction.
�4 �2 0 2 4z
0.0
0.2
0.4
0.6
0.8
1.0
g(z
)
logistic function g(z) = 1/(1 + e�z)
A logistic function is one type of sigmoid function, others beinghyperbolic tangent, arctan, error function, etc. (seehttps://en.wikipedia.org/wiki/Sigmoid_function).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F35/67 (pg.64/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Logistic Regression
Solution for binary classification: use a logistic function as in
Pr(y = 1|x) = h✓(x) = g(✓|x) =1
1 + exp(�✓|x)(2.6)
where ✓|x = ✓0 +
Pmi=1 ✓ixi, x0 ⌘ 1 so ✓0 is the bias/shift.
g(z) = 1/(1 + e�x)
is known as a logisticfunction.
�4 �2 0 2 4z
0.0
0.2
0.4
0.6
0.8
1.0
g(z
)
logistic function g(z) = 1/(1 + e�z)
A logistic function is one type of sigmoid function, others beinghyperbolic tangent, arctan, error function, etc. (seehttps://en.wikipedia.org/wiki/Sigmoid_function).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F35/67 (pg.65/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Logistic with scale parameter �
Approximate a step function with scale parameter � 2 R+, givingg�(z) = 1/(1 + e
��x)
�4 �2 0 2 4z
0.0
0.2
0.4
0.6
0.8
1.0
g(z
)
logistic function g(z) = 1/(1 + e�z)
� = 0.5
� = 1.0
� = 2.0
� = 5.0
� = 10.0
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F36/67 (pg.66/163)
Yltexpc -Lo, x > )⑤← E.. I 00
TERM
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Gradients of Logistic
A logistic function’s gradient is easy to compute. This follows since:
g0(z) =
d
dz
1
1 + e�z(2.7)
= �1
(1 + e�z)2(�e
�z) (2.8)
=1
1 + e�z·
e�z
1 + e�z(2.9)
=1
1 + e�z
✓1�
1
1 + e�z
◆(2.10)
= g(z)(1� g(z)) (2.11)
Given this, we can derive a gradient descent learning rule, similar toLMS, but for logistic regression.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F37/67 (pg.67/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Gradients of Logistic
A logistic function’s gradient is easy to compute. This follows since:
g0(z) =
d
dz
1
1 + e�z(2.7)
= �1
(1 + e�z)2(�e
�z) (2.8)
=1
1 + e�z·
e�z
1 + e�z(2.9)
=1
1 + e�z
✓1�
1
1 + e�z
◆(2.10)
= g(z)(1� g(z)) (2.11)
Given this, we can derive a gradient descent learning rule, similar toLMS, but for logistic regression.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F37/67 (pg.68/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Fitting Logistic Regression Using Gradient Descent
Training data D = {(x(i), y
(i))}i2[n], now y(i)2 {0, 1} is a binary label.
Goal: formulate likelihood (to maximize) in terms of parameters ✓.
Probability model. Pr(y = 1|x; ✓) = h✓(x) = g(✓|x) andPr(y = 0|x; ✓) = 1� h✓(x), thus for y 2 {0, 1},
Pr(y|x; ✓) = (h✓(x))y(1� h✓(x))1�y (2.12)
Likelihood L(✓) =Q
i Pr(y(i)|x(i); ✓) and log likelihood
`(✓) =nX
i=1
hy(i) log h✓(x
(i)) + (1� y(i)) log(1� h✓(x
(i)))i
(2.13)
Negative Log likelihood (or cost(✓) = �`(✓)) is convex in ✓.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F38/67 (pg.69/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Fitting Logistic Regression Using Gradient Descent
Training data D = {(x(i), y
(i))}i2[n], now y(i)2 {0, 1} is a binary label.
Goal: formulate likelihood (to maximize) in terms of parameters ✓.
Probability model. Pr(y = 1|x; ✓) = h✓(x) = g(✓|x) andPr(y = 0|x; ✓) = 1� h✓(x), thus for y 2 {0, 1},
Pr(y|x; ✓) = (h✓(x))y(1� h✓(x))1�y (2.12)
Likelihood L(✓) =Q
i Pr(y(i)|x(i); ✓) and log likelihood
`(✓) =nX
i=1
hy(i) log h✓(x
(i)) + (1� y(i)) log(1� h✓(x
(i)))i
(2.13)
Negative Log likelihood (or cost(✓) = �`(✓)) is convex in ✓.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F38/67 (pg.70/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Fitting Logistic Regression Using Gradient Descent
Training data D = {(x(i), y
(i))}i2[n], now y(i)2 {0, 1} is a binary label.
Goal: formulate likelihood (to maximize) in terms of parameters ✓.
Probability model. Pr(y = 1|x; ✓) = h✓(x) = g(✓|x) andPr(y = 0|x; ✓) = 1� h✓(x), thus for y 2 {0, 1},
Pr(y|x; ✓) = (h✓(x))y(1� h✓(x))1�y (2.12)
Likelihood L(✓) =Q
i Pr(y(i)|x(i); ✓) and log likelihood
`(✓) =nX
i=1
hy(i) log h✓(x
(i)) + (1� y(i)) log(1� h✓(x
(i)))i
(2.13)
Negative Log likelihood (or cost(✓) = �`(✓)) is convex in ✓.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F38/67 (pg.71/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Fitting Logistic Regression Using Gradient Descent
Training data D = {(x(i), y
(i))}i2[n], now y(i)2 {0, 1} is a binary label.
Goal: formulate likelihood (to maximize) in terms of parameters ✓.
Probability model. Pr(y = 1|x; ✓) = h✓(x) = g(✓|x) andPr(y = 0|x; ✓) = 1� h✓(x), thus for y 2 {0, 1},
Pr(y|x; ✓) = (h✓(x))y(1� h✓(x))1�y (2.12)
Likelihood L(✓) =Q
i Pr(y(i)|x(i); ✓) and log likelihood
`(✓) =nX
i=1
hy(i) log h✓(x
(i)) + (1� y(i)) log(1� h✓(x
(i)))i
(2.13)
Negative Log likelihood (or cost(✓) = �`(✓)) is convex in ✓.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F38/67 (pg.72/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Fitting Logistic Regression Using Gradient Descent
Training data D = {(x(i), y
(i))}i2[n], now y(i)2 {0, 1} is a binary label.
Goal: formulate likelihood (to maximize) in terms of parameters ✓.
Probability model. Pr(y = 1|x; ✓) = h✓(x) = g(✓|x) andPr(y = 0|x; ✓) = 1� h✓(x), thus for y 2 {0, 1},
Pr(y|x; ✓) = (h✓(x))y(1� h✓(x))1�y (2.12)
Likelihood L(✓) =Q
i Pr(y(i)|x(i); ✓) and log likelihood
`(✓) =nX
i=1
hy(i) log h✓(x
(i)) + (1� y(i)) log(1� h✓(x
(i)))i
(2.13)
Negative Log likelihood (or cost(✓) = �`(✓)) is convex in ✓.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F38/67 (pg.73/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Fitting Logistic Regression Using Gradient Descent
Gradient of log likelihood on one training pair (x, y)
@
@✓`(✓) = (y � h✓(x))x. (2.14)
Derivation of this derivative, via chain rule, uses the logistic derivativeproperty g
0(z) = g(z)(1� g(z)).
Gradient descent steps
✓ ✓ + ↵(y(i) � h✓(x(i)))x(i) (2.15)
again, direction is given by x(i), by an amount equal to
↵ · error = ↵(y(i) � h✓(x(i)))
Error has same form as linear case (the answer, y(i) minus prediction
h✓(x(i))) but prediction is quite di↵erent from before.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F39/67 (pg.74/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Fitting Logistic Regression Using Gradient Descent
Gradient of log likelihood on one training pair (x, y)
@
@✓`(✓) = (y � h✓(x))x. (2.14)
Derivation of this derivative, via chain rule, uses the logistic derivativeproperty g
0(z) = g(z)(1� g(z)).
Gradient descent steps
✓ ✓ + ↵(y(i) � h✓(x(i)))x(i) (2.15)
again, direction is given by x(i), by an amount equal to
↵ · error = ↵(y(i) � h✓(x(i)))
Error has same form as linear case (the answer, y(i) minus prediction
h✓(x(i))) but prediction is quite di↵erent from before.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F39/67 (pg.75/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Fitting Logistic Regression Using Gradient Descent
Gradient of log likelihood on one training pair (x, y)
@
@✓`(✓) = (y � h✓(x))x. (2.14)
Derivation of this derivative, via chain rule, uses the logistic derivativeproperty g
0(z) = g(z)(1� g(z)).
Gradient descent steps
✓ ✓ + ↵(y(i) � h✓(x(i)))x(i) (2.15)
again, direction is given by x(i), by an amount equal to
↵ · error = ↵(y(i) � h✓(x(i)))
Error has same form as linear case (the answer, y(i) minus prediction
h✓(x(i))) but prediction is quite di↵erent from before.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F39/67 (pg.76/163)
Sealant .
,direction
in"EE"!.ae .
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Fit Linear vs. Logistic Comparison
from https://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic.html
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F40/67 (pg.77/163)
"" t" "
I*..
m.
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Logistic Regression, Linear Separability, and Overfitting
logistic regression andlinearly separable data.How smooth is thetransition in logistic re-gression?
Logistic regression negative log likelihood (i.e., cost) J(✓)
J(✓) = �mX
i=1
hy(i) log h✓(x
(i)) + (1� y(i)) log(1� h✓(x
(i)))i
(2.16)
to be minimized, where h✓(x) = 11+exp(�✓|x) .
What happens as cost decreases (likelihood improves)?
If y(i) = 1,
h✓(x(i))! 1; if y(i) = 0, h✓(x(i))! 0. Hence, J(✓)! 0. Requires
✓ !1. Should transition be allowed to be arbitrarily sudden?
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F41/67 (pg.78/163)
V2
XI
XE frm m-2
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Logistic Regression, Linear Separability, and Overfitting
logistic regression andlinearly separable data.How smooth is thetransition in logistic re-gression?
Logistic regression negative log likelihood (i.e., cost) J(✓)
J(✓) = �mX
i=1
hy(i) log h✓(x
(i)) + (1� y(i)) log(1� h✓(x
(i)))i
(2.16)
to be minimized, where h✓(x) = 11+exp(�✓|x) .
What happens as cost decreases (likelihood improves)?
If y(i) = 1,
h✓(x(i))! 1; if y(i) = 0, h✓(x(i))! 0. Hence, J(✓)! 0. Requires
✓ !1. Should transition be allowed to be arbitrarily sudden?
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F41/67 (pg.79/163)
an
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Logistic Regression, Linear Separability, and Overfitting
logistic regression andlinearly separable data.How smooth is thetransition in logistic re-gression?
Logistic regression negative log likelihood (i.e., cost) J(✓)
J(✓) = �mX
i=1
hy(i) log h✓(x
(i)) + (1� y(i)) log(1� h✓(x
(i)))i
(2.16)
to be minimized, where h✓(x) = 11+exp(�✓|x) .
What happens as cost decreases (likelihood improves)?
If y(i) = 1,
h✓(x(i))! 1; if y(i) = 0, h✓(x(i))! 0. Hence, J(✓)! 0. Requires
✓ !1. Should transition be allowed to be arbitrarily sudden?
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F41/67 (pg.80/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Logistic Regression, Linear Separability, and Overfitting
logistic regression andlinearly separable data.How smooth is thetransition in logistic re-gression?
Logistic regression negative log likelihood (i.e., cost) J(✓)
J(✓) = �mX
i=1
hy(i) log h✓(x
(i)) + (1� y(i)) log(1� h✓(x
(i)))i
(2.16)
to be minimized, where h✓(x) = 11+exp(�✓|x) .
What happens as cost decreases (likelihood improves)? If y(i) = 1,
h✓(x(i))! 1; if y(i) = 0, h✓(x(i))! 0. Hence, J(✓)! 0.
Requires✓ !1. Should transition be allowed to be arbitrarily sudden?
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F41/67 (pg.81/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Logistic Regression, Linear Separability, and Overfitting
logistic regression andlinearly separable data.How smooth is thetransition in logistic re-gression?
Logistic regression negative log likelihood (i.e., cost) J(✓)
J(✓) = �mX
i=1
hy(i) log h✓(x
(i)) + (1� y(i)) log(1� h✓(x
(i)))i
(2.16)
to be minimized, where h✓(x) = 11+exp(�✓|x) .
What happens as cost decreases (likelihood improves)? If y(i) = 1,
h✓(x(i))! 1; if y(i) = 0, h✓(x(i))! 0. Hence, J(✓)! 0. Requires
✓ !1.
Should transition be allowed to be arbitrarily sudden?
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F41/67 (pg.82/163)
g(t)= +eE)
2-=<o ,
x>=
Oi- ti
*Rina
Br( O
260710
hour"') -g"
y'i'EE QB
@
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Logistic Regression, Linear Separability, and Overfitting
logistic regression andlinearly separable data.How smooth is thetransition in logistic re-gression?
Logistic regression negative log likelihood (i.e., cost) J(✓)
J(✓) = �mX
i=1
hy(i) log h✓(x
(i)) + (1� y(i)) log(1� h✓(x
(i)))i
(2.16)
to be minimized, where h✓(x) = 11+exp(�✓|x) .
What happens as cost decreases (likelihood improves)? If y(i) = 1,
h✓(x(i))! 1; if y(i) = 0, h✓(x(i))! 0. Hence, J(✓)! 0. Requires
✓ !1. Should transition be allowed to be arbitrarily sudden?Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F41/67 (pg.83/163)
m notG-/ sum
region .
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Overfitting and large magnitude parameters
Therefore, it seems that even large magnitude parameters can lead toa form of overfitting. Overfitting in that the i such that y
(i) = 0 getperfect zero prediction, and i such that y
(i) = 1 get perfect 1prediction, where such certainty is probably not warranted.
Better solution: don’t overfit, for points close to decision boundaryallow gradual prediction transition between 0 and 1 in region ofuncertainty.
This requires means putting a restriction on ✓ (not letting it get toobig).
One possible complexity penality, the 2-norm: ⌦(✓) = k✓k2, prefers“simple” models which in thise case are those with small coe�cients.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F42/67 (pg.84/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Overfitting and large magnitude parameters
Therefore, it seems that even large magnitude parameters can lead toa form of overfitting. Overfitting in that the i such that y
(i) = 0 getperfect zero prediction, and i such that y
(i) = 1 get perfect 1prediction, where such certainty is probably not warranted.
Better solution: don’t overfit, for points close to decision boundaryallow gradual prediction transition between 0 and 1 in region ofuncertainty.
This requires means putting a restriction on ✓ (not letting it get toobig).
One possible complexity penality, the 2-norm: ⌦(✓) = k✓k2, prefers“simple” models which in thise case are those with small coe�cients.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F42/67 (pg.85/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Overfitting and large magnitude parameters
Therefore, it seems that even large magnitude parameters can lead toa form of overfitting. Overfitting in that the i such that y
(i) = 0 getperfect zero prediction, and i such that y
(i) = 1 get perfect 1prediction, where such certainty is probably not warranted.
Better solution: don’t overfit, for points close to decision boundaryallow gradual prediction transition between 0 and 1 in region ofuncertainty.
This requires means putting a restriction on ✓ (not letting it get toobig).
One possible complexity penality, the 2-norm: ⌦(✓) = k✓k2, prefers“simple” models which in thise case are those with small coe�cients.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F42/67 (pg.86/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Overfitting and large magnitude parameters
Therefore, it seems that even large magnitude parameters can lead toa form of overfitting. Overfitting in that the i such that y
(i) = 0 getperfect zero prediction, and i such that y
(i) = 1 get perfect 1prediction, where such certainty is probably not warranted.
Better solution: don’t overfit, for points close to decision boundaryallow gradual prediction transition between 0 and 1 in region ofuncertainty.
This requires means putting a restriction on ✓ (not letting it get toobig).
One possible complexity penality, the 2-norm: ⌦(✓) = k✓k2, prefers“simple” models which in thise case are those with small coe�cients.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F42/67 (pg.87/163)
Lregularitch
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Perceptron and Logistic Regression
https://towardsdatascience.com/what-the-hell-is-perceptron-626217814f53
m
m
✓0
✓1
✓m-1
✓m
Model h✓(x) = g(✓|x), perceptron uses a hard activation function
g(z) =
(�1 if z < 0
+1 if z � 0(2.17)
leads to same learning update rule ✓ ✓ + ↵(y(i) � h✓(x(i)))x(i)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F43/67 (pg.88/163)
= @jzqCHJ.a - I
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Perceptron, Linear Models, and Linearly Separable Data
logistic regression andperceptron can doperfectly when thedata is (nicely) linearlyseparable.
Line designates boundary of a “ridge” or “cli↵” between the categories.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F44/67 (pg.89/163)
¥no
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Review
The next three slides are review from Lecture 1.
Please read writeup “Underfitting and Overfitting in MachineLearning” to be posted to canvas.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F45/67 (pg.90/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Review
The next three slides are review from Lecture 1.
Please read writeup “Underfitting and Overfitting in MachineLearning” to be posted to canvas.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F45/67 (pg.91/163)
MM
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Underfitting vs. Overfitting
0 1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x
y
0 1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x
y
0 1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x
y
Fit a model with various input features, values of powers of x, goal isto predict y based on xy-pair samples D = {(x(i)
, y(i)
}i.
Fit models: left y = ✓0 + ✓1x; middle y = ✓0 + ✓1x + ✓2x2; right
y =P5
j=0 ✓jxj .
Both left and right plots poorly fit the data, but they are poor fordi↵erent reasons. The left could be underfitting, and the right could beoverfitting. The center plot looks better.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F46/67 (pg.92/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Overfitting definition (T. Mitchell)
We say that a hypothesis overfits the training examples if some otherhypothesis that fits the training examples less well actually performsbetter over the entire distribution of instances (i.e., including instancesbeyond the training set).
Definition 2.5.1 (overfitting)
Given a hypothesis space H, a hypothesis h 2 H is said to overfit thetraining data if there exists some alternative hypothesis h
02 H, such that h
has smaller error than h0 over the training examples, but h
0 has a smalleroverall error than h over the entire distribution (or data set) of instances.
We’ll visit this topic again when we discuss bias/variance, but first letsdiscuss a few more models.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F47/67 (pg.93/163)
h,tie H
'
op
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Some Definitions
Any data set D =�(x(1)
, y(1)
, (x(2), y
(2)), . . . , (x(n), y
(n)) drawn from
the a given distribution, meaning that (x(j), y
(j)) ⇠ p(x, y) for all1 j n.
Training a model by maximizing accuracy on a training set Dtr:
h 2 argmaxh2H
X
(x,y)2Dtr
A(yi, h(xi))� �⌦(h) (2.18)
Accuracy according to the sample distribution
accuracy(h) = Ep(x,y)[A(y, h(x))] =
Zp(x, y)A(y, h(x))]dxdy
(2.19)
Accuracy of a trained model on a data set D:
accuracyD(h) =1
|D|
X
(x,y)2D
A(yj , h(xj)) (2.20)
Training data set Dtr and validation (or development) data set Dva.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F48/67 (pg.94/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Some Definitions
Any data set D =�(x(1)
, y(1)
, (x(2), y
(2)), . . . , (x(n), y
(n)) drawn from
the a given distribution, meaning that (x(j), y
(j)) ⇠ p(x, y) for all1 j n.Training a model by maximizing accuracy on a training set Dtr:
h 2 argmaxh2H
X
(x,y)2Dtr
A(yi, h(xi))� �⌦(h) (2.18)
Accuracy according to the sample distribution
accuracy(h) = Ep(x,y)[A(y, h(x))] =
Zp(x, y)A(y, h(x))]dxdy
(2.19)
Accuracy of a trained model on a data set D:
accuracyD(h) =1
|D|
X
(x,y)2D
A(yj , h(xj)) (2.20)
Training data set Dtr and validation (or development) data set Dva.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F48/67 (pg.95/163)
& accuracy - regularizationtradeoff coefficient.
A hyperparameter .
Acy,huh ) is big
it
hlxlis a
good
predictorof y
undis smelt it
hun is a poor
predict oty .
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Some Definitions
Any data set D =�(x(1)
, y(1)
, (x(2), y
(2)), . . . , (x(n), y
(n)) drawn from
the a given distribution, meaning that (x(j), y
(j)) ⇠ p(x, y) for all1 j n.Training a model by maximizing accuracy on a training set Dtr:
h 2 argmaxh2H
X
(x,y)2Dtr
A(yi, h(xi))� �⌦(h) (2.18)
Accuracy according to the sample distribution
accuracy(h) = Ep(x,y)[A(y, h(x))] =
Zp(x, y)A(y, h(x))]dxdy
(2.19)
Accuracy of a trained model on a data set D:
accuracyD(h) =1
|D|
X
(x,y)2D
A(yj , h(xj)) (2.20)
Training data set Dtr and validation (or development) data set Dva.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F48/67 (pg.96/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Some Definitions
Any data set D =�(x(1)
, y(1)
, (x(2), y
(2)), . . . , (x(n), y
(n)) drawn from
the a given distribution, meaning that (x(j), y
(j)) ⇠ p(x, y) for all1 j n.Training a model by maximizing accuracy on a training set Dtr:
h 2 argmaxh2H
X
(x,y)2Dtr
A(yi, h(xi))� �⌦(h) (2.18)
Accuracy according to the sample distribution
accuracy(h) = Ep(x,y)[A(y, h(x))] =
Zp(x, y)A(y, h(x))]dxdy
(2.19)
Accuracy of a trained model on a data set D:
accuracyD(h) =1
|D|
X
(x,y)2D
A(yj , h(xj)) (2.20)
Training data set Dtr and validation (or development) data set Dva.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F48/67 (pg.97/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Some Definitions
Any data set D =�(x(1)
, y(1)
, (x(2), y
(2)), . . . , (x(n), y
(n)) drawn from
the a given distribution, meaning that (x(j), y
(j)) ⇠ p(x, y) for all1 j n.Training a model by maximizing accuracy on a training set Dtr:
h 2 argmaxh2H
X
(x,y)2Dtr
A(yi, h(xi))� �⌦(h) (2.18)
Accuracy according to the sample distribution
accuracy(h) = Ep(x,y)[A(y, h(x))] =
Zp(x, y)A(y, h(x))]dxdy
(2.19)
Accuracy of a trained model on a data set D:
accuracyD(h) =1
|D|
X
(x,y)2D
A(yj , h(xj)) (2.20)
Training data set Dtr and validation (or development) data set Dva.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F48/67 (pg.98/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Overfitting
Definition 2.5.1 (overfitting)
We say that h 2 H overfits the training data Dtr if there exists h02 H
such that
accuracyDtr(h) > accuracyDtr
(h0) and accuracy(h) < accuracy(h0).(2.21)
Since we can’t compute accuracy(h) or accuracy(h0) as mentioned above, apractical definition of overfitting changes this to:
accuracyDtr(h) > accuracyDtr
(h0) and accuracyDva(h) < accuracyDva
(h0).(2.22)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F49/67 (pg.99/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Underfitting
Definition 2.5.2 (underfitting)
We say that h 2 H underfits the training data Dtr if there exists h002 H
such that
accuracyDtr(h) < accuracyDtr
(h00) and accuracy(h) < accuracy(h00).(2.23)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F50/67 (pg.100/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Overfitting and Underfitting
Training set accuracy
Validation set accuracy
Complexity (or capability, or capacity) of model
Accu
racy
over!ttingrange
!ttingrange
under!ttingrange
(a) The highly typical caseis when the training setaccuracy is higher than thevalidation set accuracy.
Training set accuracy
Validation set accuracy
Complexity (or capability, or capacity) of model
Accu
racy
over!ttingrange
!ttingrange
under!ttingrange
(b) It is possible, butsomewhat unlikely, thatthe validation set accuracycrosses the validation setaccuracy. Theoverfitting/underfittingranges are still the same.
Training set accuracy
Validation set accuracy
Complexity (or capability, or capacity) of model
Accu
racy
over!ttingrange
!ttingrange
under!ttingrange
(c) It is possible, but veryunlikely, that thevalidation set accuracy ishigher than the trainingset accuracy, as shown inthis plot. But theoverfitting/underfittingranges are still the same.
Figure: Overfitting and underfitting shown as a function of model complexity for a fixed training set size ntr. Any h in the
red region overfits the training set. Any h in the yellow region underfits the training set. Any h in the green (middle) region
properly fits the training set. The regions are based all on the accuracy accuracyDva(h) computed on an validation data set Dva,
but the same principle would be true if it were possible to measure accuracy it on the entire distribution accuracy(h).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F51/67 (pg.101/163)
qualitative,
Nch)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Learning Curves
How does overfitting/underfitting depends on the size of the training dataset?
A given model (with fixed complexity) will tend to overfit a smalltraining data set, and underfit a large training data set.
(a) Learning curve with a low complexity(i.e., ⌦(h) small) model.
(b) Learning curve with a high complexity(i.e., ⌦(h) large) model.
On the left, the model underfits since the model does not have muchcapability even with much training data. On the right, the model starts outoverfitting but eventually fits with enough training data.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F52/67 (pg.102/163)
y
• •tot.
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Learning Curves
How does overfitting/underfitting depends on the size of the training dataset? A given model (with fixed complexity) will tend to overfit a smalltraining data set, and underfit a large training data set.
Accu
racy
ntr number of training samples
Training set accuracy
Validation set accuracy
high
var
ianc
e
100%
under!ttingrange
low
var
ianc
e
(a) Learning curve with a low complexity(i.e., ⌦(h) small) model.
Accu
racy
ntr number of training samples
Training set accuracy
Validation set accuracy
100%
over!ttingrange
!ttingrange
high
var
ianc
e
low
var
ianc
e
(b) Learning curve with a high complexity(i.e., ⌦(h) large) model.
On the left, the model underfits since the model does not have muchcapability even with much training data. On the right, the model starts outoverfitting but eventually fits with enough training data.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F52/67 (pg.103/163)
Iuse gradient
To
-
9# of validation higher , o-
someone"7÷%IEi¥¥: FEI: car, so#i:w⇒.
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Learning Curves
How does overfitting/underfitting depends on the size of the training dataset? A given model (with fixed complexity) will tend to overfit a smalltraining data set, and underfit a large training data set.
Accu
racy
ntr number of training samples
Training set accuracy
Validation set accuracy
high
var
ianc
e
100%
under!ttingrange
low
var
ianc
e
(a) Learning curve with a low complexity(i.e., ⌦(h) small) model.
Accu
racy
ntr number of training samples
Training set accuracy
Validation set accuracy
100%
over!ttingrange
!ttingrange
high
var
ianc
e
low
var
ianc
e
(b) Learning curve with a high complexity(i.e., ⌦(h) large) model.
On the left, the model underfits since the model does not have muchcapability even with much training data. On the right, the model starts outoverfitting but eventually fits with enough training data.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F52/67 (pg.104/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Real-world consequences of overfitting
The Fukushima nuclear power plant disaster in 2011 caused byoverfitting a model to data!! see https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/ and
https://mpra.ub.uni-muenchen.de/69383/1/MPRA_paper_69383.pdf
earthquake data going back 400 years and model fit
Non-linear model fit, predicts mag. 9 earthquake every ⇡13000 years.This was used to design Fukushima Daiichi nuclear power plant, towithstand 8.6 magnitude earthquake, tsunami of 5.7 meters.In 2011, 9.0 earthquake and 14 meter tsunami!!
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F53/67 (pg.105/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Real-world consequences of overfitting
The Fukushima nuclear power plant disaster in 2011 caused byoverfitting a model to data!! see https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/ and
https://mpra.ub.uni-muenchen.de/69383/1/MPRA_paper_69383.pdf
earthquake data going back 400 years and model fit
Non-linear model fit, predicts mag. 9 earthquake every ⇡13000 years.This was used to design Fukushima Daiichi nuclear power plant, towithstand 8.6 magnitude earthquake, tsunami of 5.7 meters.In 2011, 9.0 earthquake and 14 meter tsunami!!
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F53/67 (pg.106/163)
X
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Real-world consequences of overfitting
The Fukushima nuclear power plant disaster in 2011 caused byoverfitting a model to data!! see https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/ and
https://mpra.ub.uni-muenchen.de/69383/1/MPRA_paper_69383.pdf
earthquake data going back 400 years and model fit
Non-linear model fit, predicts mag. 9 earthquake every ⇡13000 years.
This was used to design Fukushima Daiichi nuclear power plant, towithstand 8.6 magnitude earthquake, tsunami of 5.7 meters.In 2011, 9.0 earthquake and 14 meter tsunami!!
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F53/67 (pg.107/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Real-world consequences of overfitting
The Fukushima nuclear power plant disaster in 2011 caused byoverfitting a model to data!! see https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/ and
https://mpra.ub.uni-muenchen.de/69383/1/MPRA_paper_69383.pdf
earthquake data going back 400 years and model fit
Non-linear model fit, predicts mag. 9 earthquake every ⇡13000 years.This was used to design Fukushima Daiichi nuclear power plant, towithstand 8.6 magnitude earthquake, tsunami of 5.7 meters.
In 2011, 9.0 earthquake and 14 meter tsunami!!
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F53/67 (pg.108/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Real-world consequences of overfitting
The Fukushima nuclear power plant disaster in 2011 caused byoverfitting a model to data!! see https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/ and
https://mpra.ub.uni-muenchen.de/69383/1/MPRA_paper_69383.pdf
earthquake data going back 400 years and model fit
Non-linear model fit, predicts mag. 9 earthquake every ⇡13000 years.This was used to design Fukushima Daiichi nuclear power plant, towithstand 8.6 magnitude earthquake, tsunami of 5.7 meters.In 2011, 9.0 earthquake and 14 meter tsunami!!
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F53/67 (pg.109/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Real-world consequences of overfitting
log-linear model shows di↵erent trend:
linear model fit, predicts magnitude 9 earthquake about every 300years, would have led to quite a di↵erent reactor design.
Moral: Overfitting (or underfitting) can have huge consequences!
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F54/67 (pg.110/163)
perham
i.: . fitwww.hod. . data -
'
..
.
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Real-world consequences of overfitting
log-linear model shows di↵erent trend:
linear model fit, predicts magnitude 9 earthquake about every 300years, would have led to quite a di↵erent reactor design.
Moral: Overfitting (or underfitting) can have huge consequences!
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F54/67 (pg.111/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Real-world consequences of overfitting
log-linear model shows di↵erent trend:
linear model fit, predicts magnitude 9 earthquake about every 300years, would have led to quite a di↵erent reactor design.
Moral: Overfitting (or underfitting) can have huge consequences!
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F54/67 (pg.112/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Overfitting slider on the web
See “Visualization of the bias-variance tradeo↵” at the following linkentitled “The Bias-Variance Dilemma”: https:
//medium.com/@ml.at.berkeley/machine-learning-crash-course-part-4-the-bias-variance-dilemma-a94e60ec1d3
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F55/67 (pg.113/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
The random process of training
Recall training data D =�(x(i)
, y(i))
n
i=1where (x(i)
, y(i)) ⇠ p(x, y)
are drawn from some distribution, x(i)
2 Rm and y(i)
2 R.
Training data is a random sample, and is itself random.
Fitting a model h✓ where ✓ is derived from the training procedure.
We can think of ✓(D) is the parameters of the model via the process ofmodel fitting. ✓(D) is a random variable since deterministic function ofrandom sample D of size n.
Each time we draw a di↵erent training set, we might get (hopefullyonly very slightly) di↵erent ✓.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F56/67 (pg.114/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
The random process of training
Recall training data D =�(x(i)
, y(i))
n
i=1where (x(i)
, y(i)) ⇠ p(x, y)
are drawn from some distribution, x(i)
2 Rm and y(i)
2 R.Training data is a random sample, and is itself random.
Fitting a model h✓ where ✓ is derived from the training procedure.
We can think of ✓(D) is the parameters of the model via the process ofmodel fitting. ✓(D) is a random variable since deterministic function ofrandom sample D of size n.
Each time we draw a di↵erent training set, we might get (hopefullyonly very slightly) di↵erent ✓.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F56/67 (pg.115/163)
Pr CD) = II, PG"
is"')
⇒ D is a ruffian
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
The random process of training
Recall training data D =�(x(i)
, y(i))
n
i=1where (x(i)
, y(i)) ⇠ p(x, y)
are drawn from some distribution, x(i)
2 Rm and y(i)
2 R.Training data is a random sample, and is itself random.
Fitting a model h✓ where ✓ is derived from the training procedure.
We can think of ✓(D) is the parameters of the model via the process ofmodel fitting. ✓(D) is a random variable since deterministic function ofrandom sample D of size n.
Each time we draw a di↵erent training set, we might get (hopefullyonly very slightly) di↵erent ✓.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F56/67 (pg.116/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
The random process of training
Recall training data D =�(x(i)
, y(i))
n
i=1where (x(i)
, y(i)) ⇠ p(x, y)
are drawn from some distribution, x(i)
2 Rm and y(i)
2 R.Training data is a random sample, and is itself random.
Fitting a model h✓ where ✓ is derived from the training procedure.
We can think of ✓(D) is the parameters of the model via the process ofmodel fitting. ✓(D) is a random variable since deterministic function ofrandom sample D of size n.
Each time we draw a di↵erent training set, we might get (hopefullyonly very slightly) di↵erent ✓.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F56/67 (pg.117/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
The random process of training
Recall training data D =�(x(i)
, y(i))
n
i=1where (x(i)
, y(i)) ⇠ p(x, y)
are drawn from some distribution, x(i)
2 Rm and y(i)
2 R.Training data is a random sample, and is itself random.
Fitting a model h✓ where ✓ is derived from the training procedure.
We can think of ✓(D) is the parameters of the model via the process ofmodel fitting. ✓(D) is a random variable since deterministic function ofrandom sample D of size n.
Each time we draw a di↵erent training set, we might get (hopefullyonly very slightly) di↵erent ✓.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F56/67 (pg.118/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance Intuition (on one slide).
When are fitting a model h✓ to data D, and we optimize (or can chose) anymodel in the model family (e.g., linear model family, logistic model, do weinclude higher order polynomial features, etc.).
High bias, low variance: (underfitting)if model family is too simple (low complexity), it will never matchimportant characteristics in the data. It is, in such case, biased.Even if we train on multiple di↵erent data sets, we’ll get approximatelythe same model h✓ since the model family is not that capable orflexible. All are wrong in same way.Thus, the variance of the random variable h✓(D) is low!
Low bias, high variance: (overfitting)if model family is too complex, it will match unimportant idiosyncrasiesin the data. Bias is low, since it matches the data quite well.When we train on multiple di↵erent data sets, each with its own (oftenrandom) idiosyncrasies, we’ll get very di↵erent models h✓ since themodel family can match any idiosyncrasies in the data.Thus, the variance of the random variable h✓(D) is high!
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F57/67 (pg.119/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance Intuition (on one slide).
When are fitting a model h✓ to data D, and we optimize (or can chose) anymodel in the model family (e.g., linear model family, logistic model, do weinclude higher order polynomial features, etc.).High bias, low variance: (underfitting)
if model family is too simple (low complexity), it will never matchimportant characteristics in the data. It is, in such case, biased.Even if we train on multiple di↵erent data sets, we’ll get approximatelythe same model h✓ since the model family is not that capable orflexible. All are wrong in same way.Thus, the variance of the random variable h✓(D) is low!
Low bias, high variance: (overfitting)if model family is too complex, it will match unimportant idiosyncrasiesin the data. Bias is low, since it matches the data quite well.When we train on multiple di↵erent data sets, each with its own (oftenrandom) idiosyncrasies, we’ll get very di↵erent models h✓ since themodel family can match any idiosyncrasies in the data.Thus, the variance of the random variable h✓(D) is high!
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F57/67 (pg.120/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance Intuition (on one slide).
When are fitting a model h✓ to data D, and we optimize (or can chose) anymodel in the model family (e.g., linear model family, logistic model, do weinclude higher order polynomial features, etc.).High bias, low variance: (underfitting)
if model family is too simple (low complexity), it will never matchimportant characteristics in the data. It is, in such case, biased.
Even if we train on multiple di↵erent data sets, we’ll get approximatelythe same model h✓ since the model family is not that capable orflexible. All are wrong in same way.Thus, the variance of the random variable h✓(D) is low!
Low bias, high variance: (overfitting)if model family is too complex, it will match unimportant idiosyncrasiesin the data. Bias is low, since it matches the data quite well.When we train on multiple di↵erent data sets, each with its own (oftenrandom) idiosyncrasies, we’ll get very di↵erent models h✓ since themodel family can match any idiosyncrasies in the data.Thus, the variance of the random variable h✓(D) is high!
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F57/67 (pg.121/163)
"
inductivebias
"
const.mobht.CH= c
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance Intuition (on one slide).
When are fitting a model h✓ to data D, and we optimize (or can chose) anymodel in the model family (e.g., linear model family, logistic model, do weinclude higher order polynomial features, etc.).High bias, low variance: (underfitting)
if model family is too simple (low complexity), it will never matchimportant characteristics in the data. It is, in such case, biased.Even if we train on multiple di↵erent data sets, we’ll get approximatelythe same model h✓ since the model family is not that capable orflexible. All are wrong in same way.
Thus, the variance of the random variable h✓(D) is low!Low bias, high variance: (overfitting)
if model family is too complex, it will match unimportant idiosyncrasiesin the data. Bias is low, since it matches the data quite well.When we train on multiple di↵erent data sets, each with its own (oftenrandom) idiosyncrasies, we’ll get very di↵erent models h✓ since themodel family can match any idiosyncrasies in the data.Thus, the variance of the random variable h✓(D) is high!
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F57/67 (pg.122/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance Intuition (on one slide).
When are fitting a model h✓ to data D, and we optimize (or can chose) anymodel in the model family (e.g., linear model family, logistic model, do weinclude higher order polynomial features, etc.).High bias, low variance: (underfitting)
if model family is too simple (low complexity), it will never matchimportant characteristics in the data. It is, in such case, biased.Even if we train on multiple di↵erent data sets, we’ll get approximatelythe same model h✓ since the model family is not that capable orflexible. All are wrong in same way.Thus, the variance of the random variable h✓(D) is low!
Low bias, high variance: (overfitting)if model family is too complex, it will match unimportant idiosyncrasiesin the data. Bias is low, since it matches the data quite well.When we train on multiple di↵erent data sets, each with its own (oftenrandom) idiosyncrasies, we’ll get very di↵erent models h✓ since themodel family can match any idiosyncrasies in the data.Thus, the variance of the random variable h✓(D) is high!
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F57/67 (pg.123/163)
Va- Choco, ) small.
wu C. v.
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance Intuition (on one slide).
When are fitting a model h✓ to data D, and we optimize (or can chose) anymodel in the model family (e.g., linear model family, logistic model, do weinclude higher order polynomial features, etc.).High bias, low variance: (underfitting)
if model family is too simple (low complexity), it will never matchimportant characteristics in the data. It is, in such case, biased.Even if we train on multiple di↵erent data sets, we’ll get approximatelythe same model h✓ since the model family is not that capable orflexible. All are wrong in same way.Thus, the variance of the random variable h✓(D) is low!
Low bias, high variance: (overfitting)
if model family is too complex, it will match unimportant idiosyncrasiesin the data. Bias is low, since it matches the data quite well.When we train on multiple di↵erent data sets, each with its own (oftenrandom) idiosyncrasies, we’ll get very di↵erent models h✓ since themodel family can match any idiosyncrasies in the data.Thus, the variance of the random variable h✓(D) is high!
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F57/67 (pg.124/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance Intuition (on one slide).
When are fitting a model h✓ to data D, and we optimize (or can chose) anymodel in the model family (e.g., linear model family, logistic model, do weinclude higher order polynomial features, etc.).High bias, low variance: (underfitting)
if model family is too simple (low complexity), it will never matchimportant characteristics in the data. It is, in such case, biased.Even if we train on multiple di↵erent data sets, we’ll get approximatelythe same model h✓ since the model family is not that capable orflexible. All are wrong in same way.Thus, the variance of the random variable h✓(D) is low!
Low bias, high variance: (overfitting)if model family is too complex, it will match unimportant idiosyncrasiesin the data. Bias is low, since it matches the data quite well.
When we train on multiple di↵erent data sets, each with its own (oftenrandom) idiosyncrasies, we’ll get very di↵erent models h✓ since themodel family can match any idiosyncrasies in the data.Thus, the variance of the random variable h✓(D) is high!
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F57/67 (pg.125/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance Intuition (on one slide).
When are fitting a model h✓ to data D, and we optimize (or can chose) anymodel in the model family (e.g., linear model family, logistic model, do weinclude higher order polynomial features, etc.).High bias, low variance: (underfitting)
if model family is too simple (low complexity), it will never matchimportant characteristics in the data. It is, in such case, biased.Even if we train on multiple di↵erent data sets, we’ll get approximatelythe same model h✓ since the model family is not that capable orflexible. All are wrong in same way.Thus, the variance of the random variable h✓(D) is low!
Low bias, high variance: (overfitting)if model family is too complex, it will match unimportant idiosyncrasiesin the data. Bias is low, since it matches the data quite well.When we train on multiple di↵erent data sets, each with its own (oftenrandom) idiosyncrasies, we’ll get very di↵erent models h✓ since themodel family can match any idiosyncrasies in the data.
Thus, the variance of the random variable h✓(D) is high!
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F57/67 (pg.126/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance Intuition (on one slide).
When are fitting a model h✓ to data D, and we optimize (or can chose) anymodel in the model family (e.g., linear model family, logistic model, do weinclude higher order polynomial features, etc.).High bias, low variance: (underfitting)
if model family is too simple (low complexity), it will never matchimportant characteristics in the data. It is, in such case, biased.Even if we train on multiple di↵erent data sets, we’ll get approximatelythe same model h✓ since the model family is not that capable orflexible. All are wrong in same way.Thus, the variance of the random variable h✓(D) is low!
Low bias, high variance: (overfitting)if model family is too complex, it will match unimportant idiosyncrasiesin the data. Bias is low, since it matches the data quite well.When we train on multiple di↵erent data sets, each with its own (oftenrandom) idiosyncrasies, we’ll get very di↵erent models h✓ since themodel family can match any idiosyncrasies in the data.Thus, the variance of the random variable h✓(D) is high!
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F57/67 (pg.127/163)
9
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance
Recall lecture 1 goal: find h✓ : X ! Y with minimum error.
Best solution, argminh Ep(x,y)[(y � h(x))2] leads to
h⇤(x) =
Zyp(y|x)dy = E[Y |x] = best association . (2.24)
assuming we have the distribution p, the resources to compute E[Y |x],and the model family spanned by h includes the functional form ofE[Y |x] (i.e., E[Y |x] is realizable, as apposed to the agnostic casewhich is not realizable). This is best we can do in theory.In practice, we have training data D and a limited model family H
(e.g., linear models) and instead do:
h 2 argminh2H
1
n
nX
i=1
(y(i) � h(x(i)))2 (2.25)
Two reasons we might not be good: (a) wrong model family H and(b) not enough data (n too small).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F58/67 (pg.128/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance
Recall lecture 1 goal: find h✓ : X ! Y with minimum error.Best solution, argminh Ep(x,y)[(y � h(x))2] leads to
h⇤(x) =
Zyp(y|x)dy = E[Y |x] = best association . (2.24)
assuming we have the distribution p, the resources to compute E[Y |x],and the model family spanned by h includes the functional form ofE[Y |x] (i.e., E[Y |x] is realizable, as apposed to the agnostic casewhich is not realizable). This is best we can do in theory.
In practice, we have training data D and a limited model family H
(e.g., linear models) and instead do:
h 2 argminh2H
1
n
nX
i=1
(y(i) � h(x(i)))2 (2.25)
Two reasons we might not be good: (a) wrong model family H and(b) not enough data (n too small).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F58/67 (pg.129/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance
Recall lecture 1 goal: find h✓ : X ! Y with minimum error.Best solution, argminh Ep(x,y)[(y � h(x))2] leads to
h⇤(x) =
Zyp(y|x)dy = E[Y |x] = best association . (2.24)
assuming we have the distribution p, the resources to compute E[Y |x],and the model family spanned by h includes the functional form ofE[Y |x] (i.e., E[Y |x] is realizable, as apposed to the agnostic casewhich is not realizable). This is best we can do in theory.In practice, we have training data D and a limited model family H
(e.g., linear models) and instead do:
h 2 argminh2H
1
n
nX
i=1
(y(i) � h(x(i)))2 (2.25)
Two reasons we might not be good: (a) wrong model family H and(b) not enough data (n too small).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F58/67 (pg.130/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance
Recall lecture 1 goal: find h✓ : X ! Y with minimum error.Best solution, argminh Ep(x,y)[(y � h(x))2] leads to
h⇤(x) =
Zyp(y|x)dy = E[Y |x] = best association . (2.24)
assuming we have the distribution p, the resources to compute E[Y |x],and the model family spanned by h includes the functional form ofE[Y |x] (i.e., E[Y |x] is realizable, as apposed to the agnostic casewhich is not realizable). This is best we can do in theory.In practice, we have training data D and a limited model family H
(e.g., linear models) and instead do:
h 2 argminh2H
1
n
nX
i=1
(y(i) � h(x(i)))2 (2.25)
Two reasons we might not be good: (a) wrong model family H and(b) not enough data (n too small).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F58/67 (pg.131/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Validation and Bias/Variance
How do we measure how good we are?We could consider measuring on the same training data we’ve got:
1
n
nX
i=1
(y(i) � h(x(i)))2 =1
|D|
X
(x,y)2D
(y � h(x))2 (2.26)
Already, intuitively bad for bias/variance reasons we’ve discussed, eachD gives di↵erent solution, each h✓(D) can look good on its own data.
Ideal approach, tests all data (e.g., includes future samples we didn’ttrain on).
Ep(x,y)[(y � h(x))2] (2.27)
bad since impractical at best and (more likely) impossible.Typical approach, draw a separate validation data set Dva, withDva \ D = ;, and to try to get at generalization error, do:
1
|Dva|
X
(x,y)2Dva
(y � h(x))2 (2.28)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F59/67 (pg.132/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Validation and Bias/Variance
How do we measure how good we are?We could consider measuring on the same training data we’ve got:
1
n
nX
i=1
(y(i) � h(x(i)))2 =1
|D|
X
(x,y)2D
(y � h(x))2 (2.26)
Already, intuitively bad for bias/variance reasons we’ve discussed, eachD gives di↵erent solution, each h✓(D) can look good on its own data.Ideal approach, tests all data (e.g., includes future samples we didn’ttrain on).
Ep(x,y)[(y � h(x))2] (2.27)
bad since impractical at best and (more likely) impossible.
Typical approach, draw a separate validation data set Dva, withDva \ D = ;, and to try to get at generalization error, do:
1
|Dva|
X
(x,y)2Dva
(y � h(x))2 (2.28)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F59/67 (pg.133/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Validation and Bias/Variance
How do we measure how good we are?We could consider measuring on the same training data we’ve got:
1
n
nX
i=1
(y(i) � h(x(i)))2 =1
|D|
X
(x,y)2D
(y � h(x))2 (2.26)
Already, intuitively bad for bias/variance reasons we’ve discussed, eachD gives di↵erent solution, each h✓(D) can look good on its own data.Ideal approach, tests all data (e.g., includes future samples we didn’ttrain on).
Ep(x,y)[(y � h(x))2] (2.27)
bad since impractical at best and (more likely) impossible.Typical approach, draw a separate validation data set Dva, withDva \ D = ;, and to try to get at generalization error, do:
1
|Dva|
X
(x,y)2Dva
(y � h(x))2 (2.28)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F59/67 (pg.134/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance, Overall Error
Ideal solution h⇤(x) = E[Y |x] and random estimate from data set
hD(x) , h✓(D) = argminh2H
1n
Pni=1(y
(i)� h(x(i)))2
Lets consider measuring the overall error for any h:
error(h)=Ep(x,y)[(h(x) � y)2] =
Z(h(x) � y)2p(y|x)p(x)dxdy (2.29)
= Ep(x,y)[(h(x) � E[Y |x] + E[Y |x] � y)2] (2.30)
= Ep(x,y)
h(h(x) � E[Y |x])2 (2.31)
+ 2(h(x) � E[Y |x])(E[Y |x] � y) (2.32)
+ (E[Y |x] � y)2i
(2.33)
= Ep(x,y)[(h(x) � E[Y |x])2] (2.34)
+ 2(E[h(x)] � E[Y |X])(E[Y |x] � E[Y |x]) (2.35)
+ Ep(x,y)[(E[Y |x] � y)2] (2.36)
Note, 2nd term cancels out.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F60/67 (pg.135/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance, Overall Error
Ideal solution h⇤(x) = E[Y |x] and random estimate from data set
hD(x) , h✓(D) = argminh2H
1n
Pni=1(y
(i)� h(x(i)))2
Lets consider measuring the overall error for any h:
error(h)=Ep(x,y)[(h(x) � y)2] =
Z(h(x) � y)2p(y|x)p(x)dxdy (2.29)
= Ep(x,y)[(h(x) � E[Y |x] + E[Y |x] � y)2] (2.30)
= Ep(x,y)
h(h(x) � E[Y |x])2 (2.31)
+ 2(h(x) � E[Y |x])(E[Y |x] � y) (2.32)
+ (E[Y |x] � y)2i
(2.33)
= Ep(x,y)[(h(x) � E[Y |x])2] (2.34)
+ 2(E[h(x)] � E[Y |X])(E[Y |x] � E[Y |x]) (2.35)
+ Ep(x,y)[(E[Y |x] � y)2] (2.36)
Note, 2nd term cancels out.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F60/67 (pg.136/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
2nd Term Cancels Out
Why does second term cancel out?
E[Y |x] =R
yp(y|x)dy = best association and is a deterministicfunction of x, an expected value of the random variable Y underconditional distribution p(y|x) for a given x.
Therefore,
Ep(x,y)
h2(h(x) � E[Y |x])(E[Y |x] � y)
i(2.37)
= 2
Z Zp(x)p(y|x)
h(h(x) � E[Y |x])(E[Y |x] � y)
idydx (2.38)
= 2
Zp(x)(h(x) � E[Y |x])
hZp(y|x)(E[Y |x] � y)dy
idx (2.39)
= 2
Zp(x)(h(x) � E[Y |x])
h(E[Y |x] � E[Y |x])
idx = 0 (2.40)
Asside: E[Y |X] =R
yp(y|X)dy is a deterministic function of r.v. X,and hence E[Y |X] is itself a r.v. with a mean E[E[Y |X]].
Thus, E[E[Y |X]] =R
E[Y |x]p(x)dx =R ⇥R
yp(y|x)p(x)dy⇤dx =R
y⇥R
p(y|x)p(x)dx⇤dy =
Ryp(y)dy = E[Y ].
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F61/67 (pg.137/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
2nd Term Cancels Out
Why does second term cancel out?
E[Y |x] =R
yp(y|x)dy = best association and is a deterministicfunction of x, an expected value of the random variable Y underconditional distribution p(y|x) for a given x.
Therefore,
Ep(x,y)
h2(h(x) � E[Y |x])(E[Y |x] � y)
i(2.37)
= 2
Z Zp(x)p(y|x)
h(h(x) � E[Y |x])(E[Y |x] � y)
idydx (2.38)
= 2
Zp(x)(h(x) � E[Y |x])
hZp(y|x)(E[Y |x] � y)dy
idx (2.39)
= 2
Zp(x)(h(x) � E[Y |x])
h(E[Y |x] � E[Y |x])
idx = 0 (2.40)
Asside: E[Y |X] =R
yp(y|X)dy is a deterministic function of r.v. X,and hence E[Y |X] is itself a r.v. with a mean E[E[Y |X]].
Thus, E[E[Y |X]] =R
E[Y |x]p(x)dx =R ⇥R
yp(y|x)p(x)dy⇤dx =R
y⇥R
p(y|x)p(x)dx⇤dy =
Ryp(y)dy = E[Y ].
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F61/67 (pg.138/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
2nd Term Cancels Out
Why does second term cancel out?
E[Y |x] =R
yp(y|x)dy = best association and is a deterministicfunction of x, an expected value of the random variable Y underconditional distribution p(y|x) for a given x. Therefore,
Ep(x,y)
h2(h(x) � E[Y |x])(E[Y |x] � y)
i(2.37)
= 2
Z Zp(x)p(y|x)
h(h(x) � E[Y |x])(E[Y |x] � y)
idydx (2.38)
= 2
Zp(x)(h(x) � E[Y |x])
hZp(y|x)(E[Y |x] � y)dy
idx (2.39)
= 2
Zp(x)(h(x) � E[Y |x])
h(E[Y |x] � E[Y |x])
idx = 0 (2.40)
Asside: E[Y |X] =R
yp(y|X)dy is a deterministic function of r.v. X,and hence E[Y |X] is itself a r.v. with a mean E[E[Y |X]].
Thus, E[E[Y |X]] =R
E[Y |x]p(x)dx =R ⇥R
yp(y|x)p(x)dy⇤dx =R
y⇥R
p(y|x)p(x)dx⇤dy =
Ryp(y)dy = E[Y ].
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F61/67 (pg.139/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
2nd Term Cancels Out
Why does second term cancel out?
E[Y |x] =R
yp(y|x)dy = best association and is a deterministicfunction of x, an expected value of the random variable Y underconditional distribution p(y|x) for a given x. Therefore,
Ep(x,y)
h2(h(x) � E[Y |x])(E[Y |x] � y)
i(2.37)
= 2
Z Zp(x)p(y|x)
h(h(x) � E[Y |x])(E[Y |x] � y)
idydx (2.38)
= 2
Zp(x)(h(x) � E[Y |x])
hZp(y|x)(E[Y |x] � y)dy
idx (2.39)
= 2
Zp(x)(h(x) � E[Y |x])
h(E[Y |x] � E[Y |x])
idx = 0 (2.40)
Asside: E[Y |X] =R
yp(y|X)dy is a deterministic function of r.v. X,and hence E[Y |X] is itself a r.v. with a mean E[E[Y |X]].
Thus, E[E[Y |X]] =R
E[Y |x]p(x)dx =R ⇥R
yp(y|x)p(x)dy⇤dx =R
y⇥R
p(y|x)p(x)dx⇤dy =
Ryp(y)dy = E[Y ].
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F61/67 (pg.140/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
2nd Term Cancels Out
Why does second term cancel out?
E[Y |x] =R
yp(y|x)dy = best association and is a deterministicfunction of x, an expected value of the random variable Y underconditional distribution p(y|x) for a given x. Therefore,
Ep(x,y)
h2(h(x) � E[Y |x])(E[Y |x] � y)
i(2.37)
= 2
Z Zp(x)p(y|x)
h(h(x) � E[Y |x])(E[Y |x] � y)
idydx (2.38)
= 2
Zp(x)(h(x) � E[Y |x])
hZp(y|x)(E[Y |x] � y)dy
idx (2.39)
= 2
Zp(x)(h(x) � E[Y |x])
h(E[Y |x] � E[Y |x])
idx = 0 (2.40)
Asside: E[Y |X] =R
yp(y|X)dy is a deterministic function of r.v. X,and hence E[Y |X] is itself a r.v. with a mean E[E[Y |X]].
Thus, E[E[Y |X]] =R
E[Y |x]p(x)dx =R ⇥R
yp(y|x)p(x)dy⇤dx =R
y⇥R
p(y|x)p(x)dx⇤dy =
Ryp(y)dy = E[Y ].
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F61/67 (pg.141/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance, the r.v. hD(x)Thus the error has only two terms:
error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2] (2.41)
Second term has nothing to do with the model h whose error we aremeasuring, and is the inherent error, due to the random process andlabel noise. We can neglect this term in our study of bias/variance.
The first term can be simplified to Ep(x)[(h(x) � E[Y |x])2] and is thelearning error (or MSE). It is zero when h(x) = h
⇤(x) = E[Y |x].
hD(x) is a (random) learnt model from (random variable) data set D.Hence, hD(x) is a random variable (a deterministic function, thelearning/optimization process, of random variable D data set) and it
has a mean ED[hD(x)] and a variance ED
h�hD(x) � ED[hD(x)]
�2i.
To clarify notation, we’ll use D and D to express things like hD(x)’s
variance ED
h�hD(x) � E
D[h
D(x)]
�2ibut D and D are two
independent and identically distributed (iid) random variables overdatasets.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F62/67 (pg.142/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance, the r.v. hD(x)Thus the error has only two terms:
error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2] (2.41)
Second term has nothing to do with the model h whose error we aremeasuring, and is the inherent error, due to the random process andlabel noise. We can neglect this term in our study of bias/variance.
The first term can be simplified to Ep(x)[(h(x) � E[Y |x])2] and is thelearning error (or MSE). It is zero when h(x) = h
⇤(x) = E[Y |x].
hD(x) is a (random) learnt model from (random variable) data set D.Hence, hD(x) is a random variable (a deterministic function, thelearning/optimization process, of random variable D data set) and it
has a mean ED[hD(x)] and a variance ED
h�hD(x) � ED[hD(x)]
�2i.
To clarify notation, we’ll use D and D to express things like hD(x)’s
variance ED
h�hD(x) � E
D[h
D(x)]
�2ibut D and D are two
independent and identically distributed (iid) random variables overdatasets.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F62/67 (pg.143/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance, the r.v. hD(x)Thus the error has only two terms:
error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2] (2.41)
Second term has nothing to do with the model h whose error we aremeasuring, and is the inherent error, due to the random process andlabel noise. We can neglect this term in our study of bias/variance.
The first term can be simplified to Ep(x)[(h(x) � E[Y |x])2] and is thelearning error (or MSE). It is zero when h(x) = h
⇤(x) = E[Y |x].
hD(x) is a (random) learnt model from (random variable) data set D.Hence, hD(x) is a random variable (a deterministic function, thelearning/optimization process, of random variable D data set) and it
has a mean ED[hD(x)] and a variance ED
h�hD(x) � ED[hD(x)]
�2i.
To clarify notation, we’ll use D and D to express things like hD(x)’s
variance ED
h�hD(x) � E
D[h
D(x)]
�2ibut D and D are two
independent and identically distributed (iid) random variables overdatasets.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F62/67 (pg.144/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance, the r.v. hD(x)Thus the error has only two terms:
error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2] (2.41)
Second term has nothing to do with the model h whose error we aremeasuring, and is the inherent error, due to the random process andlabel noise. We can neglect this term in our study of bias/variance.
The first term can be simplified to Ep(x)[(h(x) � E[Y |x])2] and is thelearning error (or MSE). It is zero when h(x) = h
⇤(x) = E[Y |x].
hD(x) is a (random) learnt model from (random variable) data set D.Hence, hD(x) is a random variable (a deterministic function, thelearning/optimization process, of random variable D data set) and it
has a mean ED[hD(x)] and a variance ED
h�hD(x) � ED[hD(x)]
�2i.
To clarify notation, we’ll use D and D to express things like hD(x)’s
variance ED
h�hD(x) � E
D[h
D(x)]
�2ibut D and D are two
independent and identically distributed (iid) random variables overdatasets.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F62/67 (pg.145/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance, the r.v. hD(x)Thus the error has only two terms:
error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2] (2.41)
Second term has nothing to do with the model h whose error we aremeasuring, and is the inherent error, due to the random process andlabel noise. We can neglect this term in our study of bias/variance.
The first term can be simplified to Ep(x)[(h(x) � E[Y |x])2] and is thelearning error (or MSE). It is zero when h(x) = h
⇤(x) = E[Y |x].
hD(x) is a (random) learnt model from (random variable) data set D.Hence, hD(x) is a random variable (a deterministic function, thelearning/optimization process, of random variable D data set) and it
has a mean ED[hD(x)] and a variance ED
h�hD(x) � ED[hD(x)]
�2i.
To clarify notation, we’ll use D and D to express things like hD(x)’s
variance ED
h�hD(x) � E
D[h
D(x)]
�2ibut D and D are two
independent and identically distributed (iid) random variables overdatasets.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F62/67 (pg.146/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance Breakdown
We further analyze only this first term, and take ensemble averages over D,the random data sample. hD(x) is a (random) learnt model from (randomvariable) data set D.
For a given x, we have
ED[(hD(x) � E[Y |x])2] = ED[(hD(x) � ED
[hD
(x)] + ED
[hD
(x)] � E[Y |x])2]
= ED
h(hD(x) � E
D[h
D(x)])2 (2.42)
+ 2(hD(x) � ED
[hD
(x)])(ED
[hD
(x)] � E[Y |x])
+ (ED
[hD
(x)] � E[Y |x])2i
(2.43)
= ED[(hD(x) � ED
[hD
(x)])2] (2.44)
+ (ED
[hD
(x)] � E[Y |x])2 (2.45)
= variance(x) + bias squared (x) (2.46)
We then take Ep(x)[·] to get overall bias and variance over all x.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F63/67 (pg.147/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance Breakdown
We further analyze only this first term, and take ensemble averages over D,the random data sample. hD(x) is a (random) learnt model from (randomvariable) data set D. For a given x, we have
ED[(hD(x) � E[Y |x])2] = ED[(hD(x) � ED
[hD
(x)] + ED
[hD
(x)] � E[Y |x])2]
= ED
h(hD(x) � E
D[h
D(x)])2 (2.42)
+ 2(hD(x) � ED
[hD
(x)])(ED
[hD
(x)] � E[Y |x])
+ (ED
[hD
(x)] � E[Y |x])2i
(2.43)
= ED[(hD(x) � ED
[hD
(x)])2] (2.44)
+ (ED
[hD
(x)] � E[Y |x])2 (2.45)
= variance(x) + bias squared (x) (2.46)
We then take Ep(x)[·] to get overall bias and variance over all x.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F63/67 (pg.148/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance Breakdown
We further analyze only this first term, and take ensemble averages over D,the random data sample. hD(x) is a (random) learnt model from (randomvariable) data set D. For a given x, we have
ED[(hD(x) � E[Y |x])2] = ED[(hD(x) � ED
[hD
(x)] + ED
[hD
(x)] � E[Y |x])2]
= ED
h(hD(x) � E
D[h
D(x)])2 (2.42)
+ 2(hD(x) � ED
[hD
(x)])(ED
[hD
(x)] � E[Y |x])
+ (ED
[hD
(x)] � E[Y |x])2i
(2.43)
= ED[(hD(x) � ED
[hD
(x)])2] (2.44)
+ (ED
[hD
(x)] � E[Y |x])2 (2.45)
= variance(x) + bias squared (x) (2.46)
We then take Ep(x)[·] to get overall bias and variance over all x.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F63/67 (pg.149/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Error (MSE) and Bias/Variance Breakdown
MSE = Ep(x)[ED[(hD(x) � E[Y |x])2]] (2.47)
= Ep(x)[ED[(hD(x) � ED
[hD
(x)])2]] + Ep(x)[(ED[h
D(x)] � E[Y |x])2]
= variance + bias squared (2.48)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F64/67 (pg.150/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance, Unbiased estimator
To be an unbiased estimation procedure means thatED[hD(x)] = E[Y |x], or that the bias is zero.
We do sometimes have zero bias and variance that depends on the sizeof the data.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F65/67 (pg.151/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance, Unbiased estimator
To be an unbiased estimation procedure means thatED[hD(x)] = E[Y |x], or that the bias is zero.
We do sometimes have zero bias and variance that depends on the sizeof the data.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F65/67 (pg.152/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance for Linear Least Squares
Linear model with noise y = h✓⇤(x) + ✏ = ✓⇤|
x + ✏ where✏ ⇠ N (0, �
2).
Suppose this is true generative process for some ✓⇤
Given training data D and corresponding n ⇥ m design matrix X andlength-n column vector ~y, we have relationship ~y = X✓ + ~✏, ~✏ islength-n vector of Gaussians, ✏i ⇠ N (0, �
2).MLE parameter estimate ✓ = (X|
X)�1X
|~y = ✓
⇤ + (X|X)�1
X|~✏, a
noisy (r.v.) version of ✓⇤.
Best estimate E[Y |x] = ✓⇤|
x
Recall error, which has only two terms, the model error (which can bebroken into the bias squared and variance) and the inherent error.
error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2]
(2.49)
In current case, inherent error is:
Ep(x,y)[(E[Y |x] � y)2] = �2 (2.50)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F66/67 (pg.153/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance for Linear Least Squares
Linear model with noise y = h✓⇤(x) + ✏ = ✓⇤|
x + ✏ where✏ ⇠ N (0, �
2). Suppose this is true generative process for some ✓⇤
Given training data D and corresponding n ⇥ m design matrix X andlength-n column vector ~y, we have relationship ~y = X✓ + ~✏, ~✏ islength-n vector of Gaussians, ✏i ⇠ N (0, �
2).MLE parameter estimate ✓ = (X|
X)�1X
|~y = ✓
⇤ + (X|X)�1
X|~✏, a
noisy (r.v.) version of ✓⇤.
Best estimate E[Y |x] = ✓⇤|
x
Recall error, which has only two terms, the model error (which can bebroken into the bias squared and variance) and the inherent error.
error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2]
(2.49)
In current case, inherent error is:
Ep(x,y)[(E[Y |x] � y)2] = �2 (2.50)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F66/67 (pg.154/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance for Linear Least Squares
Linear model with noise y = h✓⇤(x) + ✏ = ✓⇤|
x + ✏ where✏ ⇠ N (0, �
2). Suppose this is true generative process for some ✓⇤
Given training data D and corresponding n ⇥ m design matrix X andlength-n column vector ~y, we have relationship ~y = X✓ + ~✏, ~✏ islength-n vector of Gaussians, ✏i ⇠ N (0, �
2).
MLE parameter estimate ✓ = (X|X)�1
X|~y = ✓
⇤ + (X|X)�1
X|~✏, a
noisy (r.v.) version of ✓⇤.
Best estimate E[Y |x] = ✓⇤|
x
Recall error, which has only two terms, the model error (which can bebroken into the bias squared and variance) and the inherent error.
error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2]
(2.49)
In current case, inherent error is:
Ep(x,y)[(E[Y |x] � y)2] = �2 (2.50)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F66/67 (pg.155/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance for Linear Least Squares
Linear model with noise y = h✓⇤(x) + ✏ = ✓⇤|
x + ✏ where✏ ⇠ N (0, �
2). Suppose this is true generative process for some ✓⇤
Given training data D and corresponding n ⇥ m design matrix X andlength-n column vector ~y, we have relationship ~y = X✓ + ~✏, ~✏ islength-n vector of Gaussians, ✏i ⇠ N (0, �
2).MLE parameter estimate ✓ = (X|
X)�1X
|~y = ✓
⇤ + (X|X)�1
X|~✏, a
noisy (r.v.) version of ✓⇤.
Best estimate E[Y |x] = ✓⇤|
x
Recall error, which has only two terms, the model error (which can bebroken into the bias squared and variance) and the inherent error.
error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2]
(2.49)
In current case, inherent error is:
Ep(x,y)[(E[Y |x] � y)2] = �2 (2.50)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F66/67 (pg.156/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance for Linear Least Squares
Linear model with noise y = h✓⇤(x) + ✏ = ✓⇤|
x + ✏ where✏ ⇠ N (0, �
2). Suppose this is true generative process for some ✓⇤
Given training data D and corresponding n ⇥ m design matrix X andlength-n column vector ~y, we have relationship ~y = X✓ + ~✏, ~✏ islength-n vector of Gaussians, ✏i ⇠ N (0, �
2).MLE parameter estimate ✓ = (X|
X)�1X
|~y = ✓
⇤ + (X|X)�1
X|~✏, a
noisy (r.v.) version of ✓⇤.
Best estimate E[Y |x] = ✓⇤|
x
Recall error, which has only two terms, the model error (which can bebroken into the bias squared and variance) and the inherent error.
error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2]
(2.49)
In current case, inherent error is:
Ep(x,y)[(E[Y |x] � y)2] = �2 (2.50)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F66/67 (pg.157/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance for Linear Least Squares
Linear model with noise y = h✓⇤(x) + ✏ = ✓⇤|
x + ✏ where✏ ⇠ N (0, �
2). Suppose this is true generative process for some ✓⇤
Given training data D and corresponding n ⇥ m design matrix X andlength-n column vector ~y, we have relationship ~y = X✓ + ~✏, ~✏ islength-n vector of Gaussians, ✏i ⇠ N (0, �
2).MLE parameter estimate ✓ = (X|
X)�1X
|~y = ✓
⇤ + (X|X)�1
X|~✏, a
noisy (r.v.) version of ✓⇤.
Best estimate E[Y |x] = ✓⇤|
x
Recall error, which has only two terms, the model error (which can bebroken into the bias squared and variance) and the inherent error.
error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2]
(2.49)
In current case, inherent error is:
Ep(x,y)[(E[Y |x] � y)2] = �2 (2.50)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F66/67 (pg.158/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance for Linear Least Squares
Linear model with noise y = h✓⇤(x) + ✏ = ✓⇤|
x + ✏ where✏ ⇠ N (0, �
2). Suppose this is true generative process for some ✓⇤
Given training data D and corresponding n ⇥ m design matrix X andlength-n column vector ~y, we have relationship ~y = X✓ + ~✏, ~✏ islength-n vector of Gaussians, ✏i ⇠ N (0, �
2).MLE parameter estimate ✓ = (X|
X)�1X
|~y = ✓
⇤ + (X|X)�1
X|~✏, a
noisy (r.v.) version of ✓⇤.
Best estimate E[Y |x] = ✓⇤|
x
Recall error, which has only two terms, the model error (which can bebroken into the bias squared and variance) and the inherent error.
error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2]
(2.49)
In current case, inherent error is:
Ep(x,y)[(E[Y |x] � y)2] = �2 (2.50)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F66/67 (pg.159/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance for Linear Least Squares
Random model h✓ based on random sample D, soED[h✓] = ED[h✓⇤+(X|X)�1X|~✏] = h✓⇤ since ED[✏] = 0.
Thus, bias squared at x:
(ED
[hD
(x)] � E[Y |x])2 = 0 (2.51)
Variance increases with m and decreases with n (sample size)
ED[(hD(x) � ED
[hD
(x)])2] =�2m
n(2.52)
A famous theorem (Gauss-Markov Theorem) states that among allunbiased estimators, the linear least squares (LLS) estimator has thesmallest variance and hence has the smallest smallest (mean squared)error of all unbiased linear estimators! I.e.,Var(h✓LLS(x)) Var(h✓any unbiased
(x)).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F67/67 (pg.160/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance for Linear Least Squares
Random model h✓ based on random sample D, soED[h✓] = ED[h✓⇤+(X|X)�1X|~✏] = h✓⇤ since ED[✏] = 0.
Thus, bias squared at x:
(ED
[hD
(x)] � E[Y |x])2 = 0 (2.51)
Variance increases with m and decreases with n (sample size)
ED[(hD(x) � ED
[hD
(x)])2] =�2m
n(2.52)
A famous theorem (Gauss-Markov Theorem) states that among allunbiased estimators, the linear least squares (LLS) estimator has thesmallest variance and hence has the smallest smallest (mean squared)error of all unbiased linear estimators! I.e.,Var(h✓LLS(x)) Var(h✓any unbiased
(x)).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F67/67 (pg.161/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance for Linear Least Squares
Random model h✓ based on random sample D, soED[h✓] = ED[h✓⇤+(X|X)�1X|~✏] = h✓⇤ since ED[✏] = 0.
Thus, bias squared at x:
(ED
[hD
(x)] � E[Y |x])2 = 0 (2.51)
Variance increases with m and decreases with n (sample size)
ED[(hD(x) � ED
[hD
(x)])2] =�2m
n(2.52)
A famous theorem (Gauss-Markov Theorem) states that among allunbiased estimators, the linear least squares (LLS) estimator has thesmallest variance and hence has the smallest smallest (mean squared)error of all unbiased linear estimators! I.e.,Var(h✓LLS(x)) Var(h✓any unbiased
(x)).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F67/67 (pg.162/163)
On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance
Bias/Variance for Linear Least Squares
Random model h✓ based on random sample D, soED[h✓] = ED[h✓⇤+(X|X)�1X|~✏] = h✓⇤ since ED[✏] = 0.
Thus, bias squared at x:
(ED
[hD
(x)] � E[Y |x])2 = 0 (2.51)
Variance increases with m and decreases with n (sample size)
ED[(hD(x) � ED
[hD
(x)])2] =�2m
n(2.52)
A famous theorem (Gauss-Markov Theorem) states that among allunbiased estimators, the linear least squares (LLS) estimator has thesmallest variance and hence has the smallest smallest (mean squared)error of all unbiased linear estimators! I.e.,Var(h✓LLS(x)) Var(h✓any unbiased
(x)).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F67/67 (pg.163/163)