Regression
CS294 Practical Machine Learning
Romain Thibaux09/18/06
Outline
• Ordinary Least Squares regression– Derivation from minimizing the sum of squares– Probabilistic interpretation– Online version (LMS)
• Overfitting and Regularization
• Numerical stability
• L1 Regression
• Kernel Regression, Spline Regression• Multiple Adaptive Regression Splines (MARS)
Classification (reminder)
X ! YAnything:
• continuous (, d, …)
• discrete ({0,1}, {1,…k}, …)
• structured (tree, string, …)
• …
• discrete:
– {0,1} binary
– {1,…k} multi-class
– tree, etc. structured
Classification (reminder)
XAnything:
• continuous (, d, …)
• discrete ({0,1}, {1,…k}, …)
• structured (tree, string, …)
• …
Classification (reminder)
XAnything:
• continuous (, d, …)
• discrete ({0,1}, {1,…k}, …)
• structured (tree, string, …)
• …
Perceptron
Logistic Regression
Support Vector Machine
Decision TreeDecision TreeRandom ForestRandom Forest
Kernel trickKernel trick
Regression
X ! Y• continuous:
– , dAnything:
• continuous (, d, …)
• discrete ({0,1}, {1,…k}, …)
• structured (tree, string, …)
• …
1
Examples
• Voltage ! Temperature
• Processes, memory ! Power consumption• Protein structure ! Energy [next week]
• Robot arm controls ! Torque at effector
• Location, industry, past losses ! Premium
Linear regression
010
2030
40
0
10
20
30
20
22
24
26
Tem
pera
ture
0 10 200
20
40
[start Matlab demo lecture2.m]
Given examples
Predict given a new point
0 200
20
40
010
2030
40
0
10
20
30
20
22
24
26
Tem
pera
ture
Linear regression
Prediction Prediction
Ordinary Least Squares (OLS)
0 200
Error or “residual”
Prediction
Observation
Sum squared error
Minimize the sum squared error
Sum squared error
Linear equation
Linear system
Alternative derivation
n
d Solve the system (it’s better not to invert the matrix)
LMS Algorithm(Least Mean Squares)
where
Online algorithm
Beyond lines and planes
everything is the same with
still linear in
0 10 200
20
40
Geometric interpretation
[Matlab demo]
010
200
100
200
300
400
-10
0
10
20
Ordinary Least Squares [summary]
n
d
Let
For example
Let
Minimize by solving
Given examples
Predict
Probabilistic interpretation
0 200
Likelihood
Assumptions vs. Reality
Voltage
0 1 2 3 4 5 6 70
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Intel sensor network data
Temperature
Overfitting
0 2 4 6 8 10 12 14 16 18 20-15
-10
-5
0
5
10
15
20
25
30
[Matlab demo]
Degree 15 polynomial
Ridge Regression(Regularization)
0 2 4 6 8 10 12 14 16 18 20-10
-5
0
5
10
15Effect of regularization (degree 19)
with “small”
Minimize by solving
Probabilistic interpretation
Likelihood
Prior
Posterior
Numerical Accuracy
Condition number
vs
We want covariates as perpendicular as possible, and roughly the same scale• Regularization• Preconditioning
Errors in Variables(Total Least Squares)
00
Sensitivity to outliers
High weight given to outliers
010
2030
40
0
10
20
30
5
10
15
20
25
Temperature at noon
Influence function
L1 Regression
Linear program Influence function
Kernel Regression
0 2 4 6 8 10 12 14 16 18 20-10
-5
0
5
10
15Kernel regression (sigma=1)
Spline RegressionRegression on each interval
5200 5400 5600 5800
50
60
70
Spline RegressionWith equality constraints
5200 5400 5600 5800
50
60
70
Spline RegressionWith L1 cost
5200 5400 5600 5800
50
60
70
0 1 20
#requests per minute
Time (days)
5000
Heteroscedasticity
MARSMultivariate Adaptive Regression Splines
…on the board…
Further topics
• Generalized Linear Models
• Gaussian process regression
• Local Linear regression• Feature Selection [next class]