1/8/17
1
CSE 446: Machine Learning
Linear Regression: Model and Algorithms CSE 446: Machine Learning Emily Fox University of Washington January 9, 2017
©2017 Emily Fox
CSE 446: Machine Learning
Linear regression: The model
©2017 Emily Fox
1/8/17
2
CSE 446: Machine Learning 3
How much is my house worth?
©2017 Emily Fox
I want to list my house
for sale
CSE 446: Machine Learning 4
How much is my house worth?
©2017 Emily Fox
$$ ????
1/8/17
3
CSE 446: Machine Learning
Data
©2017 Emily Fox
(x1 = sq.ft., y1 = $)
(x2 = sq.ft., y2 = $)
(x3 = sq.ft., y3 = $)
(x4 = sq.ft., y4 = $) Input vs. Output: • y is the quantity of interest
• assume y can be predicted from x
input output
…
(x5 = sq.ft., y5 = $)
CSE 446: Machine Learning 6 ©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
y
Model – How we assume the world works
Regression model:
1/8/17
4
CSE 446: Machine Learning 7 ©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
y
“Essen/ally,allmodelsarewrong,butsomeareuseful.”
GeorgeBox,1987.
Model – How we assume the world works
CSE 446: Machine Learning 8 ©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
y
Simple linear regression model
yi = w0+w1 xi + εi
f(x) = w0+w1 x
1/8/17
5
CSE 446: Machine Learning 9 ©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
y
Simple linear regression model
parameters: regression coefficients
yi = w0+w1 xi + εi
CSE 446: Machine Learning 10 ©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
y
f(x) = w0 + w1 x+ w2
x2
What about a quadratic function?
1/8/17
6
CSE 446: Machine Learning 11
Even higher order polynomial
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
y
f(x) = w0 + w1 x+ w2
x2 + … + wp xp
CSE 446: Machine Learning 12
Polynomial regression
Model: yi = w0 + w1
xi+ w2 xi
2 + … + wp xi
p + εi
feature 1 = 1 (constant) feature 2 = x feature 3 = x2 … feature p+1 = xp
©2017 Emily Fox
treat as different features
parameter 1 = w0 parameter 2 = w1 parameter 3 = w2 … parameter p+1 = wp
1/8/17
7
CSE 446: Machine Learning 13
Generic basis expansion
Model: yi = w0h0(xi) + w1
h1(xi) + … + wD hD(xi)+ εi
= wj hj(xi) + εi
feature 1 = h0(x)…often 1 feature 2 = h1(x)… e.g., x feature 3 = h2(x)… e.g., x2 or sin(2πx/12) … feature p+1 = hp(x)… e.g., xp
©2017 Emily Fox
jth regression coefficient or weight
jth feature
DX
j=0
CSE 446: Machine Learning 14
Generic basis expansion
Model: yi = w0h0(xi) + w1
h1(xi) + … + wD hD(xi)+ εi
= wj hj(xi) + εi
feature 1 = h0(x)…often 1 (constant) feature 2 = h1(x)… e.g., x feature 3 = h2(x)… e.g., x2 or sin(2πx/12) … feature D+1 = hD(x)… e.g., xp
©2017 Emily Fox
DX
j=0
1/8/17
8
CSE 446: Machine Learning 15
Predictions just based on house size
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
y
Only 1 bathroom! Not same as my
3 bathrooms
CSE 446: Machine Learning 16
Add more inputs
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x[1]
y
x[2]
f(x) = w0 + w1 sq.ft.
+ w2 #bath
1/8/17
9
CSE 446: Machine Learning 17
Many possible inputs
- Square feet
- # bathrooms
- # bedrooms
- Lot size
- Year built
- …
©2017 Emily Fox
CSE 446: Machine Learning 18
General notation
Output: y Inputs: x = (x[1],x[2],…, x[d]) Notational conventions:
x[j] = jth input (scalar) hj(x) = jth feature (scalar) xi = input of ith data point (vector) xi[j] = jth input of ith data point (scalar)
©2017 Emily Fox
d-dim vector
scalar
1/8/17
10
CSE 446: Machine Learning 19
Generic linear regression model
©2017 Emily Fox
Model: yi = w0 h0(xi) + w1
h1(xi) + … + wD hD(xi) + εi
= wj hj(xi) + εi
feature 1 = h0(x) … e.g., 1 feature 2 = h1(x) … e.g., x[1] = sq. ft. feature 3 = h2(x) … e.g., x[2] = #bath
or, log(x[7]) x[2] = log(#bed) x #bath … feature D+1 = hD(x) … some other function of x[1],…, x[d]
DX
j=0
CSE 446: Machine Learning
Fitting the linear regression model
©2017 Emily Fox
1/8/17
11
CSE 446: Machine Learning
Step 1: Rewrite the regression model
©2017 Emily Fox
CSE 446: Machine Learning 22
Rewrite in matrix notation
For observation i
©2017 Emily Fox
yi = wj hj(xi) + εi
= + εi
yi
DX
j=0
= + εi
1/8/17
12
CSE 446: Machine Learning 23
Rewrite in matrix notation
For all observations together
©2017 Emily Fox
= +
CSE 446: Machine Learning
Step 2: Compute the cost
©2017 Emily Fox
1/8/17
13
CSE 446: Machine Learning 25
“Cost” of using a given line
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
y Residual sum of squares (RSS)
RSS(w0,w1) = (yi-[w0+w1xi])2
CSE 446: Machine Learning 26
RSS for multiple regression
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
y
RSS(w) = (yi- )2
= (y-Hw)T(y-Hw)
x[1]
x[2]
1/8/17
14
CSE 446: Machine Learning
Step 3: Take the gradient
©2017 Emily Fox
CSE 446: Machine Learning 28
Gradient of RSS
©2017 Emily Fox
RSS(w) = [(y-Hw)T(y-Hw)]
= -2HT(y-Hw)
Δ Δ
Why? By analogy to 1D case:
1/8/17
15
CSE 446: Machine Learning
Step 4, Approach 1: Set the gradient = 0
©2017 Emily Fox
CSE 446: Machine Learning 30
Closed-form solution
©2017 Emily Fox
RSS(w) = -2HT(y-Hw) = 0
Δ
Solve for w:
1/8/17
16
CSE 446: Machine Learning 31
Closed-form solution
©2017 Emily Fox
ŵ = ( HTH )-1 HTy
Invertible if: Complexity of inverse:
CSE 446: Machine Learning
Step 4, Approach 2: Gradient descent
©2017 Emily Fox
1/8/17
17
CSE 446: Machine Learning 33
Gradient descent
©2017 Emily Fox
while not converged
w(t+1) ß w(t) - η RSS(w(t))
Δ
-2HT(y-Hw)
CSE 446: Machine Learning 34
Interpreting elementwise
©2017 Emily Fox
wj(t+1) ß wj
(t) + 2η hj(xi)(yi-ŷi(w(t)))
Update to jth feature weight:
square feet (sq.ft.)
pri
ce
($
)
y
x[1]
x[2]
1/8/17
18
CSE 446: Machine Learning 35
Summary of gradient descent for multiple regression
©2017 Emily Fox
init w(1)=0 (or randomly, or smartly), t=1
while || RSS(w(t))|| > ε for j=0,…,D
partial[j] =-2 hj(xi)(yi-ŷi(w(t)))
wj(t+1) ß wj
(t) – η partial[j]
t ß t + 1
Δ
CSE 446: Machine Learning
Why min RSS?
©2017 Emily Fox
1/8/17
19
CSE 446: Machine Learning 37
Assuming Gaussian noise
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
y Model for εi: Implied distribution on yi:
CSE 446: Machine Learning 38
Maximum likelihood estimate of params
Maximize log-likelihood wrt w
©2017 Emily Fox
ln p(D | w,�) = ln
✓1
�p2⇡
◆N NY
j=1
e�(y
i
�P
j
w
j
h
j
(xi
))2
2�2
1/8/17
20
CSE 446: Machine Learning
Interpreting the fitted function
©2015 Emily Fox & Carlos Guestrin
CSE 446: Machine Learning 40 ©2015 Emily Fox & Carlos Guestrin
square feet (sq.ft.)
pri
ce
($
)
x
y
Interpreting the coefficients – Simple linear regression
ŷ = ŵ0 + ŵ1 x
1 sq. ft.
predicted change in $
1/8/17
21
CSE 446: Machine Learning 41
fix
©2015 Emily Fox & Carlos Guestrin
Interpreting the coefficients – Two linear features
square feet (sq.ft.)
pri
ce
($
)
x[1]
y
x[2]
ŷ = ŵ0 + ŵ1 x[1] + ŵ2
x[2]
CSE 446: Machine Learning 42
fix
©2015 Emily Fox & Carlos Guestrin
Interpreting the coefficients – Two linear features
ŷ = ŵ0 + ŵ1 x[1] + ŵ2
x[2]
# bathrooms
pri
ce
($
)
x[2]
1 bathroom
predicted change in $
y
For fixed # sq.ft.!
1/8/17
22
CSE 446: Machine Learning 43
fix fix fix fix
©2015 Emily Fox & Carlos Guestrin
Interpreting the coefficients – Multiple linear features ŷ = ŵ0 + ŵ1
x[1] + …+ŵj x[j] + … + ŵd
x[d]
square feet (sq.ft.)
pri
ce
($
)
x[1]
y
x[2]
CSE 446: Machine Learning 44 ©2015 Emily Fox & Carlos Guestrin
Interpreting the coefficients- Polynomial regression
ŷ = ŵ0 + ŵ1x +… + ŵj xj + … + ŵp
xp
square feet (sq.ft.)
pri
ce
($
)
x
y
Can’t hold other features
fixed!
1/8/17
23
CSE 446: Machine Learning
Recap of concepts
©2017 Emily Fox
CSE 446: Machine Learning 46
What you can do now… • Describe polynomial regression • Write a regression model using multiple inputs or
features thereof • Cast both polynomial regression and regression
with multiple inputs as regression with multiple features
• Calculate a goodness-of-fit metric (e.g., RSS) • Estimate model parameters of a general multiple
regression model to minimize RSS: - In closed form - Using an iterative gradient descent algorithm
• Interpret the coefficients of a non-featurized multiple regression fit
• Exploit the estimated model to form predictions
©2017 Emily Fox