+ All Categories
Home > Documents > Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of...

Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of...

Date post: 14-Dec-2015
Category:
Upload: ilene-bryant
View: 222 times
Download: 1 times
Share this document with a friend
Popular Tags:
17
Stat 112: Notes 1 • Main topics of course: – Simple Regression – Multiple Regression – Analysis of Variance – Chapters 3-9 of textbook • Readings for Notes 1: Chapter 3.1-3.2. Also, Chapter 2 contains review of material from Stat 111.
Transcript
Page 1: Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:

Stat 112: Notes 1

• Main topics of course:– Simple Regression – Multiple Regression– Analysis of Variance– Chapters 3-9 of textbook

• Readings for Notes 1: Chapter 3.1-3.2. Also, Chapter 2 contains review of material from Stat 111.

Page 2: Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:

Monitoring Tiger Prey Abundance

• The Siberian (Amur) tiger is a species of tigers found in the Russian Far East.

• Tigers in general are in trouble. At the beginning of the 20th century, there were around 100,000 tigers. Today, there are less than 6000 tigers in the world and there are only about 400 Siberian tigers.

• The Sika deer is a staple of the Siberian tiger diet. It is also hunted by the local people.

• To balance the needs of the local people and at the same time ensure that there are adequate prey for tigers, local government managers need accurate estimates of the number of Sika deer in an area.

Page 3: Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:

Estimating Deer Abundance• Two Methods:

– Counting Method: The number of deer in a plot can be determined accurately but with considerable time and work. It requires 3-5 expert field workers to monitor the plot and to classify whether deer tracks are moving into or out of the plot.

– Total Tracks Counted: Count the total number of deer tracks along transects (fixed paths that the observer walks along). Total tracks counted requires much less work.

• Can total tracks counted be used to accurately predict the density (per km^2) of deer obtained from the counting method?

Page 4: Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:

Deer Density vs. Tracks Counted

• Study was done in which density was determined by expert field workers over a range of plots.

• How would we estimate the deer density if we counted 1 track per squared kilometer?

0

1

2

3

4

5

6

7

Den

sity

(per

km

^2)

0 0.5 1 1.5 2 2.5 3

Tracks Counted (per km^2)

Page 5: Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:

Simple Regression Model

• How would we estimate the deer density if we counted 1 track per squared kilometer?

• Idea: Estimate the mean deer density when we count 1 track per squared kilometer.

• Simple Regression Setup:– Y=outcome (density per km squared)– X=explanatory variable (tracks counted per km squared– Note: outcome is sometimes called dependent variable and

explanatory variable is sometimes called independent or predictor variable

• Simple Regression Model: Model for the mean (expected value) of Y given X, denoted | or ( | )Y X E Y X

Page 6: Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:

Simple Linear Regression ModelSimple Linear Regression Model:

0 1( | )E Y X X

1 Slope = Change in Mean of Y for each one unit change in X = Change in Mean Density for each one unit increase in tracks counted

0 Intercept = Mean of Y for X=0 = Mean Density for zero tracks counted (X=0)

0

1

2

3

4

5

6

7

Den

sity

(per

km

^2)

0 0.5 1 1.5 2 2.5 3

Tracks Counted (per km^2)

Linear Fit

Linear Fit Density (per km^2) = -0.053472 + 1.9085392*Tracks Counted (per km^2)

Page 7: Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:

Using the Simple Linear Regression Model for Estimating Deer Density

Mean Density when Tracks = 0.5: E(Y|X=0.5) = -0.053 + 1.909 * 0.5 = 0.90 Mean Density when Tracks = 1: E(Y|X=1) = -0.053 + 1.909 * 1 = 1.86 Mean Density when Tracks = 1.5: E(Y|X=1.5) = -0.053+ 1.909*1.5 = 2.81

0

1

2

3

4

5

6

7

Den

sity

(per

km

^2)

0 0.5 1 1.5 2 2.5 3

Tracks Counted (per km^2)

Linear Fit

Linear Fit Density (per km^2) = -0.053472 + 1.9085392*Tracks Counted (per km^2)

Page 8: Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:

Estimating the Slope and Intercept

0

1

2

3

4

5

6

7

Den

sity

(per

km

^2)

0 0.5 1 1.5 2 2.5 3

Tracks Counted (per km^2)

For the population, the line that best predicts Y based on X (in terms of minimizing squared prediction error) is the line

0 1( | )E Y X X The Least Squares Principle: We want to choose the line

for the sample data 1 1( , ), , ( , )n nX Y X Y that minimizes the sum of squared prediction errors in the sample. The least

squares line is the line 0 1ˆ ( | )E Y X b b X that minimizes

the sum of squared prediction errors in the data, 2

0 11( )

n

i iiY b b X

.

Page 9: Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:

Simple Linear Regression Using JMP

• Use Analyze, Fit Y by X. Put response variable in Y and explanatory variable in X (make sure X is continuous by clicking on the X column, clicking Cols and Column Info and checking that the Modeling Type is Continuous).

• Click on fit line under red triangle next to Bivariate Fit of Y by X.

Page 10: Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:

ResidualsWhen we use the mean density ( | )E Y X

(or estimated mean density 0 1ˆ ( | )E Y X b b X )

to predict the true density in a plot based on the tracks X, we will typically make some error. The error for a given observation ( , )i iX Y is

0 1ˆ ( | ) ( )i i i i i ie Y E Y X Y b b X

ie Residual for observation i

For observation 85, 85 852.391, 5.571X Y

85

ˆ ( | 2.391) 0.053 1.909*2.391 4.511

5.571 4.511 1.060

E Y X

e

0

1

2

3

4

5

6

7

Den

sity

(per

km

^2)

26

85

0 0.5 1 1.5 2 2.5 3

Tracks Counted (per km^2)

For observation 26, 26 262.174, 2.612X Y

26

ˆ ( | 2.174) 0.053 1.909*2.174 4.097

2.612 4.097 1.485

E Y X

e

Page 11: Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:

Root Mean Square ErrorThe root mean square error (RMSE) is the “average” absolute error (absolute residual). The book calls the RMSE the standard error of regression. The RMSE is obtained by the formula

2 20 1

1 1

1 1ˆ( ( | )) ( ( ))2 2

n n

i i i i ii i

Y E Y X Y b b Xn n

The root mean square error is found in the Summary of Fit in the JMP output for a simple linear regression analysis. Bivariate Fit of Density (per km^2) By Tracks Counted (per km^2) Linear Fit Density (per km^2) = -0.053472 + 1.9085392*Tracks Counted (per km^2) Summary of Fit RSquare 0.901652 RSquare Adj 0.900649 Root Mean Square Error 0.409303 Mean of Response 1.821665 Observations (or Sum Wgts) 100

RMSE = 0.41. On average, we will make an absolute error of about 0.41 when we predict the density of deer based on the tracks.

Technical Note: RMSE^2 is averagesquared residual.RMSE is close to but not exactlyaverage absoluteresidual

Page 12: Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:

Application of Deer Density Regression for Tiger Conservation

• Government managers do not need to know the

density of deers in an area exactly -- they only need a reasonable estimate.

• An average absolute error of 0.41 in the estimate of density based on tracks counted is tolerable.

• Because the Siberian tiger’s habitat is so vast, it would be enormously costly to have expert field workers count the deer in each area of interest.

• Counting tracks and using regression to estimate the deer density based on the tracks counted provides a basis for estimating deer density across the different areas of the tiger habitat. This enables government managers to balance the needs of the local people and tigers.

Page 13: Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:

Poverty and MDs• Do states with more poverty tend to have

fewer doctors? Which states have an unusually high number of doctors given their poverty rate or an unusually low number of doctors given their poverty rate.

150

200

250

300

350

400

450

MD

s pe

r 10

0,00

0

California

New York

Pennsylvania

7.5 10 12.5 15 17.5 20 22.5

Poverty Percent

Bivariate Fit of MDs per 100,000 By Poverty Percent

Page 14: Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:

Simple Linear Regression Model Y=M.D.’s per 100,000 residents X=Poverty percent

0 1( | )E Y X X

1 Slope = Change in Mean of Y for each one unit change in X = Change in mean M.D.’s per 100,000 residents for 1% increase in poverty

0 Intercept = Mean of Y for X=0 = Mean M.D.’s per 100,000 residents for 0% poverty.

150

200

250

300

350

400

450

MD

s pe

r 10

0,00

0California

New York

Pennsylvania

7.5 10 12.5 15 17.5 20 22.5

Poverty Percent

Linear Fit MDs per 100,000 = 286.84208 – 4.3292991 Poverty Percent The mean MDs per 100,000 residents is estimated to decrease by about 4 for an increase of 1 in the poverty percentage.

Page 15: Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:

Residuals in JMP• Saving the residuals in JMP:

– To save the residuals, after fitting the line using Fit Y by X, click the red triangle next to linear fit and click save residuals. A column with the residuals is created on the data spreadsheet.

– The residuals can be sorted by clicking • Sorting the residuals:

– Click the table menu, then click sort, click the name of the column with the residuals, click by and then click sort.

• Labeling observations: – To label an observation in the graph, click the row with the

observation and then click the rows menu and label. By default, JMP will use the observation number to label the observation. To make JMP use state to label the observation, click the state column, click the Cols menu and click label

Page 16: Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:

Residuals for Poverty-MD DataFive Largest Positive Residuals 1. Massachusetts +172.35 2. New York +168.13 3. Maryland +120.06 4. Connecticut +103.52 5. Rhode Island +100.51 These states all have more doctors per resident than their poverty rate would predict. Five Most Negative Residuals 1. Alaska -82.61 2. Iowa -76.18 3. Idaho -72.66 4. Nevada -66.22 5. Wyoming -64.32 These states all have less doctors per resident than their poverty rate would predict.

Page 17: Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:

Summary for Notes 1

• Regression Model: Model for the mean of an outcome Y given a value of the explanatory variable X, E(Y|X).

• Simple Linear Regression Model: • Regression Models are useful for:

– Predicting Y from X– Understanding the association between Y and X.– Identifying observations that are unusual in their

relationship between Y and X (large magnitude of residuals).

0 1( | )E Y X X


Recommended