Date post: | 13-Dec-2015 |
Category: |
Documents |
Upload: | poppy-hill |
View: | 223 times |
Download: | 0 times |
Data Analysis OverviewExperimentalenvironment
prototypereal sys
exec-driven
sim
trace-driven
sim
stochasticsim
Workloadparameters
SystemConfig
parameters
Factorlevels Raw Data
Samples
Samples
Samples
.
.
.. . .D
iffer
ent e
xper
imen
ts
Data Analysis OverviewExperimentalenvironment
prototypereal sys
exec-driven
sim
trace-driven
sim
stochasticsim
Workloadparameters
SystemConfig
parameters
Factorlevels Raw Data
. . .1
set o
f exp
erim
ents
Rep
eate
d se
t
Comparison of AlternativesCommon case – one samplepoint for each.• Conclusions only about this set of experiments.
Data Analysis OverviewExperimentalenvironment
prototypereal sys
exec-driven
sim
trace-driven
sim
stochasticsim
Workloadparameters
SystemConfig
parameters
Factorlevels Raw Data
1 repeated experiment
Samples
Samples
.
.
.
Characterizing this sample data set• Central tendency – means, mode, median• Variability – range, std dev, COV, quantiles• Fit to known distribution
Sample data vs.
Population• Confidence interval for mean• Significance
level • Sample size n given r% accuracy
Data Analysis OverviewExperimentalenvironment
prototypereal sys
exec-driven
sim
trace-driven
sim
stochasticsim
Workloadparameters
SystemConfig
parameters
Factorlevels Raw Data
1 experiment
Samples
Samples
.
.
.. . .1
set o
f exp
erim
ents
Comparison of AlternativesPaired Observations• As one sample of pairwise differences ai - bi
• Confidence interval
A
B
Data Analysis OverviewExperimentalenvironment
prototypereal sys
exec-driven
sim
trace-driven
sim
stochasticsim
Workloadparameters
SystemConfig
parameters
Factorlevels Raw Data
1 experiment
Samples
Samples
.
.
.. . .1
set o
f exp
erim
ents
Unpaired Observations• As multiple samples, sample means and overlapping CIs• t-test on mean difference: xa - xb
xa , sa , CIa
xb , sb , CIb
Data Analysis OverviewExperimentalenvironment
prototypereal sys
exec-driven
sim
trace-driven
sim
stochasticsim
Workloadparameters
SystemConfig
parameters
Factorlevels Raw Data
Samples
Samples
.
.
.
Predictorvalues x,
factor levels
Samples of response
. . .
y1
y2
yn
Regression models• response var = f (predictors)
© 1998, Geoff Kuenning
Linear Regression Models
What is a (good) model?Estimating model parametersAllocating variation (R2)• Confidence intervals for regressionsVerifying assumptions visually
© 1998, Geoff Kuenning
Confidence Intervals for Regressions
• Regression is done from a single sample (size n)– Different sample might give different
results– True model is y = 0 + 1x– Parameters b0 and b1 are really means
taken from a population sample
© 1998, Geoff Kuenning
Calculating Intervalsfor Regression
Parameters• Standard deviations of parameters:
• Confidence intervals are bi t sbi
• where t has n - 2 degrees of freedom
s sn
x
x nx
ss
x nx
b e
be
0
1
1 2
2 2
2 2
© 1998, Geoff Kuenning
Example of Regression Confidence Intervals
• Recall se = 0.13, n = 5, x2 = 264, = 6.8
• So
• Using a 90% confidence level, t0.95;3 = 2.353
x
s
s
b
b
0
1
0 131
5
6 8
264 5 6 80 16
0 13
264 5 6 80 004
2
2
2
.( . )
( . ).
.
( . ).
© 1998, Geoff Kuenning
0.29 2.353(0.004) = (0.28,0.30)
Regression Confidence Example, cont’d
• Thus, b0 interval is
– Not significant at 90%
• And b1 is
– Significant at 90% (and would survive even 99.9% test)
0.35 2.353(0.16) = (-0.03,0.73)
© 1998, Geoff Kuenning
Confidence Intervalsfor Predictions
• Previous confidence intervals are for parameters– How certain can we be that the parameters
are correct?• Purpose of regression is prediction
– How accurate are the predictions?– Regression gives mean of predicted
response, based on sample we took
© 1998, Geoff Kuenning
Predicting m Samples
• Standard deviation for mean of future sample of m observations at xp is
• Note deviation drops as m • Variance minimal at x = • Use t-quantiles with n–2 DOF for interval
s s
m n
x x
x nxy ep
mp
1 1
2
2 2
x
ymp
S
© 1998, Geoff Kuenning
Example of Confidenceof Predictions
• Using previous equation, what is predicted time for a single run of 8 loops?
• Time = 0.35 + 0.29(8) = 2.67• Standard deviation of errors se = 0.13
• 90% interval is then
sy p .
.
( . ).
10 13 1
1
5
8 6 8
264 5 6 80 14
2
yp
S
2.67 2.353(0.14) = (2.34, 3.00)
© 1998, Geoff Kuenning
• Multiple linear regression – more than one predictor variable
• Categorical predictors – some of the predictors aren’t quantitative but represent categories
• Curvilinear regression – nonlinear relationship• Transformations – when errors not normally
distributed or variance not constant• Handling outliers• Common mistakes in regression analysis
Other Regression Methods
© 1998, Geoff Kuenning
Multiple Linear Regression
• Models with more than one predictor variable• But each predictor variable has a linear
relationship to the response variable• Conceptually, plotting a regression line in n-
dimensional space, instead of 2-dimensional
© 1998, Geoff Kuenning
Basic Multiple Linear Regression Formula
• Response y is a function of k predictor variables x1,x2, . . . , xk
y = b0 + b1x1 + b2x2 + . . . + bkxk + e
© 1998, Geoff Kuenning
A Multiple Linear Regression Model
Given sample of n observations
model consists of n equations (note typo in book):
y b b x b x b x ek k1 0 1 11 2 21 1 1 y b b x b x b x ek k2 0 1 12 2 22 2 2
y b b x b x b x en n n k kn n 0 1 1 2 2
x x x y x x x yk n n kn n11 21 1 1 1 2, , , , , , , , , , . . . . . . . . .
. . .
. . .
. . .
.
.
.
© 1998, Geoff Kuenning
Looks Like It’s Matrix Arithmetic Time
y = Xb +e
y
y
y
x x x
x x x
x x x
b
b
b
e
e
en
k
k
n n kn k n
1
2
11 21 1
12 22 2
2 2
0
1
1
2
1
1
1
.
.
.
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
. . .
. . .
. . .
© 1998, Geoff Kuenning
Analysis ofMultiple Linear
Regression• Listed in box 15.1 of Jain• Not terribly important (for our purposes) how
they were derived– This isn’t a class on statistics
• But you need to know how to use them• Mostly matrix analogs to simple linear
regression results
© 1998, Geoff Kuenning
Example ofMultiple Linear
Regression• Internet Movie Database keeps popularity
ratings of movies (in numerical form)• Postulate popularity of Academy Award
winning films is based on two factors -– Age– Running time
• Produce a regression
rating = b0 + b1(length) +b2(age)
© 1998, Geoff Kuenning
Some Sample Data
Title Age LengthRating
Silence of the Lambs 5 118 8.1Terms of Endearment 13 132 6.8Rocky 20 119 7.0Oliver! 28 153 7.4Marty 41 91 7.7Gentleman’s Agreement 49 118 7.5Mutiny on the Bounty 61 132 7.6It Happened One Night 62 105 8.0
© 1998, Geoff Kuenning
Now for Some Tedious Matrix Arithmetic
• We need to calculate X, XT, XTX, (XTX)-1, and XTy
• Because• We will see that
b = (8.373, .005, -.009 )• Meaning the regression predicts:
rating = 8.373 + 0.005*age – 0.009*length
b X X X yT T1
© 1998, Geoff Kuenning
X Matrix for Example
105621
132611
118491
91411
153281
119201
132131
11851
X
© 1998, Geoff Kuenning
Transpose to Get XT
10513211891153119132118
626149412820135
11111111TX
© 1998, Geoff Kuenning
Multiply To Get XTX
11957233045968
3304513025279
9682798
XXT
© 1998, Geoff Kuenning
Invert to Get (XTX)-1
0004.00001.00562.0
0001.00003.002270
0562.00227.07134.7
.1T XX
© 1998, Geoff Kuenning
Multiply to Get XTy
57247
92118
160
.
.
.
yXT
© 1998, Geoff Kuenning
Multiply (XTX)-1(XTy)to Get b
0090
0050
378
.
.
.
b
© 1998, Geoff Kuenning
How Good Is ThisRegression Model?
• How accurately does the model predict the rating of a film based on its age and running time?
• Best way to determine this analytically is to calculate the errors
or
SSE T T T y y b X y
SSE ei 2
© 1998, Geoff Kuenning
Calculating the ErrorsEstimated
Rating Age Length Rating ei ei2
8.1 5 118 7.4 -0.71 0.516.8 13 132 7.3 0.51 0.267.0 20 119 7.4 0.45 0.217.4 28 153 7.2 -0.20 0.047.7 41 91 7.8 0.10 0.017.5 49 118 7.6 0.11 0.017.6 61 132 7.5 -0.05 0.008.0 62 105 7.8 -0.21 0.04
© 1998, Geoff Kuenning
Calculating the Errors, Continued
• So SSE = 1.08• SSY =• SS0 = • SST = SSY - SS0 = 452.9- 451.5 = 1.4• SSR = SST - SSE = .33
•
• In other words, this regression stinks
914522 .yi
54512 .yn
23.41.1
33.2 SST
SSRR
© 1998, Geoff Kuenning
Why Does It Stink?
• Let’s look at the properties of the regression parameters
• Now calculate standard deviations of the regression parameters
46.5
08.1
3
n
SSEse
© 1998, Geoff Kuenning
Calculating STDEVof Regression Parameters• Estimations only, since we’re working with a
sample• Estimated stdev of
2914.171.746.000 csb e
0097.0003.46.111 csb e
0083.0004.46.222 csb e
© 1998, Geoff Kuenning
Calculating Confidence Intervals
• At the 90% level, for instance• Confidence intervals for
• Only b0 is significant, at this level
b0 = 8.37 (2.015)(1.29) = (5.77, 10.97)
b1 = .005 (2.015)(.01) = (-.02, .02)
b2 = -.009 (2.015)(.008) = (-.03, .01)
© 1998, Geoff Kuenning
Analysis of Variance
• So, can we really say that none of the predictor variables are significant?– Not yet; predictors may be correlated
• F-test can be used for this purpose– E.g., to determine if the SSR is significantly
higher than the SSE– Equivalent to testing that y does not
depend on any of the predictor variables
© 1998, Geoff Kuenning
Running an F-Test
• Need to calculate SSR and SSE• From those, calculate mean squares of the
regression (MSR) and the errors (MSE)• MSR/MSE has an F distribution• If MSR/MSE > F-table, predictors explain a
significant fraction of response variation• Note typos in book’s table 15.3
– SSR has k degrees of freedom– SST matches y y
© 1998, Geoff Kuenning
F-Test for Our Example
• SSR = .33• SSE = 1.08• MSR = SSR/k = .33/2 = .16• MSE = SSE/(n-k-1) = 1.08/(8 - 2 - 1) = .22• F-computed = MSR/MSE = .76• F[90; 2,5] = 3.78 (at 90%)• So it fails the F-test at 90% (miserably)
© 1998, Geoff Kuenning
Multicollinearity
• If two predictor variables are linearly dependent, they are collinear– Meaning they are related– And thus the second variable does not
improve the regression– In fact, it can make it worse
• Typical symptom is inconsistent results from various significance tests
© 1998, Geoff Kuenning
Finding Multicollinearity
• Must test correlation between predictor variables
• If it’s high, eliminate one and repeat the regression without it
• If the significance of regression improves, it’s probably due to collinearity between the variables
© 1998, Geoff Kuenning
Is Multicollinearity a Problem in Our Example?
• Probably not, since the significance tests are consistent
• But let’s check, anyway• Calculate correlation of age and length• After tedious calculation, -.25
– Not especially correlated• Important point - adding a predictor variable
does not always improve a regression
© 1998, Geoff Kuenning
Why Didn’t RegressionWork Well Here?
• Check the scatter plots– Rating vs. age– Rating vs. length
• Regardless of how good or bad regressions look, always check the scatter plots
© 1998, Geoff Kuenning
Rating vs. Length
6
6.5
7
7.5
8
8.5
9
80 100 120 140 160
Length
Ra
tin
g
© 1998, Geoff Kuenning
Rating vs. Age
6
6.5
7
7.5
8
8.5
9
0 20 40 60 80Age
Ra
tin
g
© 1998, Geoff Kuenning
Regression WithCategorical Predictors
• Regression methods discussed so far assume numerical variables
• What if some of your variables are categorical in nature?
• Use techniques discussed later in the class if all predictors are categorical
• Levels - number of values a category can take
© 1998, Geoff Kuenning
HandlingCategorical Predictors
• If only two levels, define bi as follows– bi = 0 for first value– bi = 1 for second value
• This definition is missing from book in section 15.2
• Can use +1 and -1 as values, instead• Need k-1 predictor variables for k levels
– To avoid implying order in categories
© 1998, Geoff Kuenning
Categorical Variables Example
• Which is a better predictor of a high rating in the movie database, winning an Oscar,winning the Golden Palm at Cannes, or winning the New York Critics Circle?
© 1998, Geoff Kuenning
Choosing Variables
• Categories are not mutually exclusive• x1= 1 if Oscar
0 if otherwise• x2= 1 if Golden Palm
0 if otherwise• x3= 1 if Critics Circle Award
0 if otherwise• y = b0+b1 x1+b2 x2+b3 x3
© 1998, Geoff Kuenning
A Few Data Points
Title Rating Oscar Palm NYCGentleman’s Agreement 7.5 X X
Mutiny on the Bounty 7.6 XMarty 7.4 X X XIf 7.8 XLa Dolce Vita 8.1 XKagemusha 8.2 XThe Defiant Ones 7.5 XReds 6.6 XHigh Noon 8.1 X
© 1998, Geoff Kuenning
And Regression Says . . .
• • How good is that?• R2 is 34% of the variation
– Better than age and length– But still no great shakes
• Are the regression parameters significant at the 90% level?
321 4.2.1.8.7ˆ xxxy
© 1998, Geoff Kuenning
Curvilinear Regression
• Linear regression assumes a linear relationship between predictor and response
• What if it isn’t linear?• You need to fit some other type of function to
the relationship
© 1998, Geoff Kuenning
When To UseCurvilinear Regression
• Easiest to tell by sight • Make a scatter plot
– If plot looks non-linear, try curvilinear regression
• Or if non-linear relationship is suspected for other reasons
• Relationship should be convertible to a linear form
© 1998, Geoff Kuenning
Types ofCurvilinear Regression
• Many possible types, based on a variety of relationships:
• Many others
y bx a
y abxy a b
x
© 1998, Geoff Kuenning
Transform Themto Linear Forms
• Apply logarithms, multiplication, division, whatever to produce something in linear form
• I.e., y = a + b*something• Or a similar form• If predictor appears in more than one
transformed predictor variable, correlation likely
© 1998, Geoff Kuenning
Transformations
• Using some function of the response variable y in place of y itself
• Curvilinear regression is one example of transformation
• But techniques are more generally applicable
© 1998, Geoff Kuenning
When To Transform?
• If known properties of the measured system suggest it
• If the data’s range covers several orders of magnitude
• If the homogeneous variance assumption of the residuals is violated
© 1998, Geoff Kuenning
Transforming Due To Homoscedasticity
• If spread of scatter plot of residual vs. predicted response is not homogeneous,
• Then residuals are still functions of the predictor variables
• Transformation of response may solve the problem
© 1998, Geoff Kuenning
What TransformationTo Use?
• Compute standard deviation of the residuals• Plot as function of the mean of the
observations– Assuming multiple experiments for single
set of predictor values• Check for linearity - if it is, use a log transform
© 1998, Geoff Kuenning
Other Tests for Transformations
• If variance against mean of observations is linear, use square root transform
• If standard deviation against mean squared is linear, use inverse transform
• If standard deviation against mean to a power is linear, use a power transform
• More covered in the book
© 1998, Geoff Kuenning
General Transformation Principle
For some observed function
if
transform to
s g y ( )
h yg y
dy( )( )
1
w h y ( )
© 1998, Geoff Kuenning
For Example,
• A log transformation:• If the standard deviation against the mean is
linear, then g(y) = ay
So h y
aydy a y( ) ln
1
© 1998, Geoff Kuenning
Outliers
• Atypical observations might be outliers– Measurements that are not truly
characteristic– By chance, several standard deviations out– Or mistakes might have been made in
measurement• Which leads to a problem:
Do you include outliers in analysis or not?
© 1998, Geoff Kuenning
DecidingHow To Handle Outliers
1. Find them (by looking at scatter plot)2. Check carefully for experimental error3. Repeat experiments at predictor values for
the outlier4. Decide whether to include or not include
outliers– Or do analysis both ways
Question: Is the first point in the example an outlier on the rating vs. age plot?
© 1998, Geoff Kuenning
Common Mistakesin Regression
• Generally based on taking shortcuts• Or not being careful• Or not understanding some fundamental
principles of statistics
© 1998, Geoff Kuenning
Not Verifying Linearity
• Draw the scatter plot• If it isn’t linear, check for curvilinear
possibilities• Using linear regression when the relationship
isn’t linear is misleading
© 1998, Geoff Kuenning
Relying on ResultsWithout Visual
Verification• Always check the scatter plot as part of
regression– Examining the line regression predicts vs.
the actual points• Particularly important if regression is done
automatically
© 1998, Geoff Kuenning
Attaching ImportanceTo Values of Parameters
• Numerical values of regression parameters depend on scale of predictor variables
• So just because a particular parameter’s value seems “small” or “large,” not necessarily an indication of importance
• E.g., converting seconds to microseconds doesn’t change anything fundamental– But magnitude of associated parameter
changes
© 1998, Geoff Kuenning
Not SpecifyingConfidence Intervals
• Samples of observations are random• Thus, regression performed on them yields
parameters with random properties• Without a confidence interval, it’s impossible
to understand what a parameter really means
© 1998, Geoff Kuenning
Not CalculatingCoefficient of Determination
• Without R2, difficult to determine how much of variance is explained by the regression
• Even if R2 looks good, safest to also perform an F-test
• The extra amount of effort isn’t that large, anyway
© 1998, Geoff Kuenning
Using Coefficient of Correlation Improperly
• Coefficient of determination is R2
• Coefficient of correlation is R• R2 gives percentage of variance explained by
regression, not R• E.g., if R is .5, R2 is .25
– And the regression explains 25% of variance
– Not 50%
© 1998, Geoff Kuenning
Using Highly Correlated Predictor Variables
• If two predictor variables are highly correlated, using both degrades regression
• E.g., likely to be a correlation between an executable’s on-disk size and in-core size– So don’t use both as predictors of run time
• Which means you need to understand your predictor variables as well as possible
© 1998, Geoff Kuenning
Using Regression Beyond Range of Observations
• Regression is based on observed behavior in a particular sample
• Most likely to predict accurately within range of that sample– Far outside the range, who knows?
• E.g., a regression on run time of executables that are smaller than size of main memory may not predict performance of executables that require much VM activity
© 1998, Geoff Kuenning
Using Too ManyPredictor Variables
• Adding more predictors does not necessarily improve the model
• More likely to run into multicollinearity problems
• So what variables to choose?– Subject of much of this course
© 1998, Geoff Kuenning
Measuring Too Littleof the Range
• Regression only predicts well near range of observations
• If you don’t measure the commonly used range, regression won’t predict much
• E.g., if many programs are bigger than main memory, only measuring those that are smaller is a mistake
© 1998, Geoff Kuenning
Assuming Good PredictorIs a Good Controller
• Correlation isn’t necessarily control• Just because variable A is related to variable
B, you may not be able to control values of B by varying A
• E.g., if number of hits on a Web page and server bandwidth are correlated, you might not increase hits by increasing bandwidth
• Often, a goal of regression is finding control variables
For Discussion TodayProject Proposal1. Statement of hypothesis2. Workload decisions3. Metrics to be used4. Method