+ All Categories
Home > Documents > Announcements: Next Homework is on the Web –Due next Tuesday.

Announcements: Next Homework is on the Web –Due next Tuesday.

Date post: 20-Dec-2015
Category:
View: 218 times
Download: 1 times
Share this document with a friend
Popular Tags:
30
Announcements: • Next Homework is on the Web – Due next Tuesday
Transcript
Page 1: Announcements: Next Homework is on the Web –Due next Tuesday.

Announcements:

• Next Homework is on the Web– Due next Tuesday

Page 2: Announcements: Next Homework is on the Web –Due next Tuesday.

• Mosquito repellent experiment:• 30 people were recruited for an

experiment. • Groups of 10 were randomly assigned

to one of three repellent types.• They were put into mosquito filled

room for 10 minutes (and told not to kill the mosquitos!).

• Total number of bites in each group was counted after the experiment.

(Source: Steve Gulyas, CRC Testing)

Page 3: Announcements: Next Homework is on the Web –Due next Tuesday.

S2 = [ (X1-Xbar)2 + … + (X30-Xbar)2 ] / (30-1)

= [ 11561.37 ] / 29

= 398.67 (Xbar = (X1+…+X30)/30)

EstimateOf Total VariabilityIn theData

Page 4: Announcements: Next Homework is on the Web –Due next Tuesday.

Grouping the data by

the treatment, explains

some of the variability!

(Analysis of variance

makes this explanation

more precise.)

Same data: grouped by repellent type

Page 5: Announcements: Next Homework is on the Web –Due next Tuesday.

ANOVA table:

Source Sum of Meanof Variation df Squares Square F P

repellent 2 6952.3 3476.1 20.4 0.0000

Error 27 4609.1 170.7

Total 29 11561.4

Total variability in the data is proportional to this.

Estimate of averagevariance of counts across the repellenttypes.

Variance of counts within each repellenttype is proportionalto this.

Sum of squares treatment + sum of squares Error = sum of squares total6952.3 + 4609.1 = 1151.4

R2 = SSTreat / SSTotal = 0.6013 is fraction of variability accounted for by treatment

For test:H0: C

Page 6: Announcements: Next Homework is on the Web –Due next Tuesday.

Explaining why ANOVA is an analysis of variance:

MST = 6952.3 / 2 = 3476.1Sqrt(MST) describes standard deviation among the rellents.

MSE = 4609.1 / 27 = 170.7Sqrt(MSE) describes standard deviation of the count within each repellent type.

F = MST / MSE = 20.4It makes sense that this is large and p-value = Pr(F3-1,30-3 > 20.4) = 0 is small because the variance “among treatments” is much larger than variance within the units that get each treatment.

(Note that the F test assumes the counts are independent and normally distributed with the same variance.)

For test:H0: C

Page 7: Announcements: Next Homework is on the Web –Due next Tuesday.

It turns out that ANOVA is a special case of regression. We’ll come back to that in a

class or two. First, let’s learn about regression (chapters 12 and 13).

• Simple Linear Regression example:

Ingrid is a small business owner who wants to buy a fleet of Mitsubishi sigmas. To save $ she decides to buy second hand cars and wants to estimate how much to pay. In order to do this, she asks one of her employees to collect data on how much people have paid for these cars recently. (From Matt Wand)

Page 8: Announcements: Next Homework is on the Web –Due next Tuesday.

151413121110 9 8 7 6

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Age (years)

Regression Plot

Data:Each point is a car

Pri

ce (

$)

Page 9: Announcements: Next Homework is on the Web –Due next Tuesday.

• Plot suggests a simple model:

Price of car = intercept + slope times car’s age + erroror

yi = 0 + 1xi + i, i = 1,…,39.

Estimate 0 and 1.

Outline for Regression:

1. Estimating the regression parameters and ANOVA tables for regression

2. Testing and confidence intervals3. Multiple regression models & ANOVA4. Regression Diagnostics

Page 10: Announcements: Next Homework is on the Web –Due next Tuesday.

• Plot suggests a model:

Price of car = intercept + slope times car’s age + erroror

yi = 0 + 1xi + i, i = 1,…,39.

Estimate 0 and 1 with b0 and b1. Find these with “least squares”.

In other words, find b0 and b1 to minimize sum of squared errors:

SSE = {y1 – (b0 + b1 x1)}2 + … + {yn – (b0 + b1 xn)}2

See green line on next page.

Each term is squared differencebetween observed y and the regression line ((b0 + b1 x)

Page 11: Announcements: Next Homework is on the Web –Due next Tuesday.

Squared lengthof this line contributes

one term to Sum of Squared Errors (SSE)

This line has lengthyi – b0 – b1xi for some i

151413121110 9 8 7 6

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Age

Pri

ce

S = 1075.07 R-Sq = 43.8 % R-Sq(adj) = 42.2 %

Price = 8198.25 - 385.108 Age

Regression Plot

Page 12: Announcements: Next Homework is on the Web –Due next Tuesday.

151413121110 9 8 7 6

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Age (years)

S = 1075.07 R-Sq = 43.8 % R-Sq(adj) = 42.2 %

General Model:Price = 0 + 1 Age + error

Fitted Model:Price = 8198.25 - 385.108 Age

Regression PlotP

rice

($)

Do Minitab example

Page 13: Announcements: Next Homework is on the Web –Due next Tuesday.

Regression parameter estimates, b0 and b1, minimize

SSE = {y1 – (b0 + b1 x1)}2 + … + {yn – (b0 + b1 xn)}2

Full model is yi = 0 + 1 xi + i

Suppose errors (i’s) are independent N(0, 2).What do you think a good estimate of 2 is?

MSE = SSE/(n-2) is an estimate of 2. Note how SSE looks like the numerator in s2.

Page 14: Announcements: Next Homework is on the Web –Due next Tuesday.

(I divided price by $1000. Think about why this doesn’t matter.)

Source DF SS MS F PRegression 1 33.274 33.274 28.79 0.000Residual Error 37 42.763 1.156Total 38 76.038

Sum of Squares Total = {y1 –mean(y)}2 + … + {y39 – mean(y)}2 = 76.038

Sum of Squared Errors = {y1 – (b0 + b1 x1)}2 + … + {y – (b0 + b1 xn)}2

= 42.763

Sum of Squares for Regression = SSTotal - SSE

What do these mean?

Page 15: Announcements: Next Homework is on the Web –Due next Tuesday.

Overall meanof $3,656Regression line

151413121110 9 8 7 6

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Age

Pri

ce

S = 1075.07 R-Sq = 43.8 % R-Sq(adj) = 42.2 %

Price = 8198.25 - 385.108 Age

Regression Plot

Page 16: Announcements: Next Homework is on the Web –Due next Tuesday.

(I divided price by $1000. Think about why this doesn’t really matter.)

Source DF SS MS F PRegression 1=p-1 33.274 33.274 28.79 0.000Residual Error 37=n-p 42.763 1.156Total 38=n-1 76.038

p is the number of regression parameters (2 for now)

SSTotal = {y1 –mean(y)}2 + … + {y39 – mean(y)}2 = 76.038

SSTotal / 38 is an estimate of the variance around the overall mean.(i.e. variance in the data without doing regression)

SSE = {y1 – (b0 + b1 x1)}2 + … + {y – (b0 + b1 xn)}2 = 42.763

MSE = SSE / 37 is an estimate of the variance around the line. (i.e. variance that is not explained by the regression)

SSR = SSTotal – SSEMSR = SSR / 1 is the variance the data that is “explained by the regression”.

Page 17: Announcements: Next Homework is on the Web –Due next Tuesday.

(I divided price by $1000. Think about why this doesn’t really matter.)

Source DF SS MS F PRegression 1=p-1 33.274 33.274 28.79 0.000Residual Error 37=n-p 42.763 1.156Total 38=n-1 76.038

p is the number of regression parameters

A test of H0: 1 = 0 versus HA: parameter is not 0

Reject if the variance explained by the regression is high compared to the unexplained variability in the data. Reject if F is large.

F = MSR / MSE

p-value is Pr(Fp-1,n-p > MSR / MSE)

Reject H0 for any less than the p-value(See minitab exmple and confidence intervals for estimated parameters)(Assuming errors are independent and normal.)

Page 18: Announcements: Next Homework is on the Web –Due next Tuesday.

R2

• Another summary of a regression is:

R2 = Sum of Squares for Regression

Sum of Squares Total

0<= R2 <= 1

This is the percentage of the of variation in the data that is described by the regression.

Page 19: Announcements: Next Homework is on the Web –Due next Tuesday.

Two different ways to assess “worth” of a regression

1. Absolute size of slope: bigger = better

2. Size of error variance: smaller = better1. R2 close to one

2. Large F statistic

Page 20: Announcements: Next Homework is on the Web –Due next Tuesday.

Multiple Regression

• Cheese Example:

In a study of cheddar cheese from the La Trobe Valley of Victoria, Australia, samples of cheese were analyzed to determine the amount of acetic acid and hydrogen sulfide they contained.

• Overall scores for each cheese were obtained by combining the scores from several tasters.

• The goal is to predict the taste score based on the lactic acid and hydrogen sulfide content.

(From Matt Wand)

Page 21: Announcements: Next Homework is on the Web –Due next Tuesday.

Model:A simple model for taste is:

Tastei = 0 + 1acetici + 2H2Si + errori

i = 1,…,n=30

Again the intercepts and slopes are selected to minimize the error sum of squares:

SSE = {taste1 – (b0 + b1 acetic1 + b2 H2S1)}2 + …+ {taste30 – (b0 + b1 acetic30 + b2 H2S30)}2

Geometrically: The simple linear model estimated a line. A model with an intercept and 2 slopes estimates a surface.

Note that you could add more predictors too…

Page 22: Announcements: Next Homework is on the Web –Due next Tuesday.

Minitab:• Stat: Regression: Regression

– Response is taste– Predictors are acetic and h2s

• Output:

The regression equation istaste = - 34.0 - 7.57 H2S + 14.8 acetic

Predictor Coef SE Coef T PConstant -33.99 26.53 -1.28 0.211H2S -7.570 3.474 -2.18 0.038acetic 14.763 4.242 3.48 0.002

S = 12.98 R-Sq = 40.6% R-Sq(adj) = 36.2%

Analysis of Variance

Source DF SS MS F PRegression 2 3114.0 1557.0 9.24 0.001Residual Error 27 4548.9 168.5Total 29 7662.9

Page 23: Announcements: Next Homework is on the Web –Due next Tuesday.

Minitab:The regression equation istaste = - 34.0 - 7.57 H2S + 14.8 acetic

Predictor Coef SE Coef T PConstant -33.99 26.53 -1.28 0.211H2S -7.570 3.474 -2.18 0.038acetic 14.763 4.242 3.48 0.002

T = Coef / SE Coef

P-value is for test: H0: Coef = 0, HA: Coef is not 0(if p-value < , then reject H0)

1- CI for Coef: Coef +/- SE Coef t/2,df=error df

Test statistic

Page 24: Announcements: Next Homework is on the Web –Due next Tuesday.

Minitab:

This is a test of the “usefulness of regression”

Analysis of Variance

Source DF SS MS F P

Regression 2 3114.0 1557.0 9.24 0.001

Residual Error 27 4548.9 168.5

Total 29 7662.9

The regression equation is

taste = - 34.0 - 7.57 H2S + 14.8 acetic

Model is regression equation + error:

taste = - 34.0 - 7.57 H2S + 14.8 acetic + error

MSE = 168.5 = variance of error.

F stat = MSR / MSE (this is test statistic)

P-value is for test: H0: 1 = 2 = (both slopes = 0)HA: at least one is not 0

Overall test of whetheror not the regressionis useful.

Page 25: Announcements: Next Homework is on the Web –Due next Tuesday.

Using the regression equation:

taste = - 34.0 - 7.57 H2S + 14.8 acetic

If H2S = 3 and acetic = 5, then what is the expected taste score? (NOTE that this is not an extrapolation…)

For value, just plug H2S=3 and acetic=5 into equation.

For “confidence interval” (CI):

Stat: regression: regression, Options button: prediction interval for new obs (put in in order that they’re in the regression equation)|

New Obs Fit SE Fit 95.0% CI 95.0% PI

1 17.11 3.17 ( 10.60, 23.63) ( -10.30, 44.53)

Prediction interval: wider than CI since prediction includes “error” variability and variability in estimating the parameters.

Page 26: Announcements: Next Homework is on the Web –Due next Tuesday.

Dummy (or indicator) variables:

• When some predictor variables are categorical, then regression can still be used.

• Dummy variables are used to indicate fabric of each observation…

Page 27: Announcements: Next Homework is on the Web –Due next Tuesday.

Regression Model for Burn Time Data

Burn time = 1 if fabric 1 + 2 if fabric 2 + 3 if fabric 3 + 4 if fabric 4 + error

oryi = 1x1i + 2x2i + 3x3i + 4x4i + i

(x’s are “indicator variables”)

x1i = 1 if observation i is fabric 1 and 0 otherwisex2i = 1 if observation i is fabric 2 and 0 otherwisex3i = 1 if observation i is fabric 3 and 0 otherwisex4i = 1 if observation i is fabric 4 and 0 otherwise

Beta’s are fabric specific means.The model does not have an intercept.(stat:regression:regression,options: “Fit intercept” button)

Page 28: Announcements: Next Homework is on the Web –Due next Tuesday.

An Equivalent Model:

yi = 0 + 2x2i + 3x3i + 4x4i + i

x2i = 1 if observation i is fabric 2 and 0 otherwisex3i = 1 if observation i is fabric 3 and 0 otherwisex4i = 1 if observation i is fabric 4 and 0 otherwise

Fabric 1 mean = 0

Fabric 2 mean = 0+2

Fabric 3 mean = 0+3

Fabric 4 mean = 0+4

This model does have an intercept.

0 is mean for fabric 1

Rest of the ’s are “offsets”

Page 29: Announcements: Next Homework is on the Web –Due next Tuesday.

The regression equation isBurn Time = 16.9 - 5.90 Fabric 2 - 6.35 Fabric 3 - 5.85 Fabric 4

Predictor Coef SE Coef T PConstant 16.8500 0.5806 29.02 0.000Fabric 2 -5.9000 0.8211 -7.19 0.000Fabric 3 -6.3500 0.8211 -7.73 0.000Fabric 4 -5.8500 0.8211 -7.12 0.000

S = 1.161 R-Sq = 87.2% R-Sq(adj) = 83.9%

Analysis of Variance (Note that this is the same as before!)Source DF SS MS F PRegression 3 109.810 36.603 27.15 0.000Residual Error 12 16.180 1.348Total 15 125.990

95% CI’s for fabric means:

(Point estimate of mean) +/- t0.025,12sqrt(MSE / 4)

Fabric 2: (16.85 – 5.90) +/- 2.179sqrt(1.348 / 4)

10.96 +/- 2.179(0.5806)

(0.5806 is std dev of estimate of 0+2)(As usual, we’re assuming the errors are indep and normal with constant variance.)

Page 30: Announcements: Next Homework is on the Web –Due next Tuesday.

Back to cheese

• Suppose the cheeses come from two regions of Australia and we want to include that info in the model:

• Tastei = 0 + 1acetici + 2H2Si + 3Regioni + errori

i = 1,…,n=30

Regioni = 1 if ith sample comes from region 1 and 0 otherwise. 3 is effect of region 1…

If b3 is > 0, then region 1 tends to increasethe mean score (and vice versa)


Recommended