Unit 6: Simple Linear Regression Lecture : Introduction to SLRtjl13/s101/slides/unit6lec1.pdf ·...

transcript

Unit 6: Simple Linear RegressionLecture : Introduction to SLR

Statistics 101

Thomas Leininger

June 17, 2013

Recap: Chi-square test of independence

1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables

2 Modeling numerical variables

3 Correlation

4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2

Categorical explanatory variables

Statistics 101

U6 - L1: Introduction to SLR Thomas Leininger

Recap: Chi-square test of independence Ball throwing

3 Correlation

Statistics 101

Does ball-throwing ability vary by major?

Going back to our carnival game, should I be worried if a bus-load ofpublic policy majors show up at my booth?

The hypotheses are:H0: Ball-throwing ability and major are independent. Ball-throwing

skills do not vary by major.HA: Ball-throwing ability and major are dependent. Ball-throwing

skills vary by major.

https:// commons.wikimedia.org/ wiki/ File:Archery Target 80cm.svg

Major Public Policy Undeclared Other TotalHit target 40 10 10 60Missed target 20 30 30 80Total 60 40 40 140

Note: I multiplied the numbers by 10 to meet our expected cell counts conditions.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 2 / 35

The hypotheses are:

H0: Ball-throwing ability and major are independent. Ball-throwingskills do not vary by major.

HA: Ball-throwing ability and major are dependent. Ball-throwingskills vary by major.

The hypotheses are:

Chi-square test of independence

The test statistic is calculated as

χ2df =

k∑i=1

(O − E)2

Ewhere df = (R − 1) × (C − 1),

where k is the number of cells, R is the number of rows, and C isthe number of columns.

Note: We calculate df differently for one-way and two-way tables.

Expected counts in two-way tables

Expected Count =(row total) × (column total)

table total

Chi-square test of independence

The test statistic is calculated as

χ2df =

k∑i=1

(O − E)2

Ewhere df = (R − 1) × (C − 1),

where k is the number of cells, R is the number of rows, and C isthe number of columns.

Note: We calculate df differently for one-way and two-way tables.

Expected Count =(row total) × (column total)

table total

Recap: Chi-square test of independence Expected counts in two-way tables

3 Correlation

Statistics 101

df = (R − 1) × (C − 1) =

(2 − 1) × (3 − 1) = 2

χ2df =

k∑i=1

(O − E)2

(40−25.7)2

25.7 + · · · +(30−22.857)2

22.857 = 24.306

p-value :

smaller than 0.001

Upper tail 0.3 0.2 0.1 0.05 0.02 0.01 0.005 0.001df 1 1.07 1.64 2.71 3.84 5.41 6.63 7.88 10.83

2 2.41 3.22 4.61 5.99 7.82 9.21 10.60 13.823 3.66 4.64 6.25 7.81 9.84 11.34 12.84 16.274 4.88 5.99 7.78 9.49 11.67 13.28 14.86 18.475 6.06 7.29 9.24 11.07 13.39 15.09 16.75 20.52

df = (R − 1) × (C − 1) =

(2 − 1) × (3 − 1) = 2

χ2df =

k∑i=1

(O − E)2

(40−25.7)2

25.7 + · · · +(30−22.857)2

22.857 = 24.306

p-value :

smaller than 0.001

Upper tail 0.3 0.2 0.1 0.05 0.02 0.01 0.005 0.001df 1 1.07 1.64 2.71 3.84 5.41 6.63 7.88 10.83

2 2.41 3.22 4.61 5.99 7.82 9.21 10.60 13.823 3.66 4.64 6.25 7.81 9.84 11.34 12.84 16.274 4.88 5.99 7.78 9.49 11.67 13.28 14.86 18.475 6.06 7.29 9.24 11.07 13.39 15.09 16.75 20.52

df = (R − 1) × (C − 1) = (2 − 1) × (3 − 1) = 2

χ2df =

k∑i=1

(O − E)2

(40−25.7)2

25.7 + · · · +(30−22.857)2

22.857 = 24.306

p-value :

smaller than 0.001

Upper tail 0.3 0.2 0.1 0.05 0.02 0.01 0.005 0.001df 1 1.07 1.64 2.71 3.84 5.41 6.63 7.88 10.83

2 2.41 3.22 4.61 5.99 7.82 9.21 10.60 13.823 3.66 4.64 6.25 7.81 9.84 11.34 12.84 16.274 4.88 5.99 7.78 9.49 11.67 13.28 14.86 18.475 6.06 7.29 9.24 11.07 13.39 15.09 16.75 20.52

df = (R − 1) × (C − 1) = (2 − 1) × (3 − 1) = 2

χ2df =

k∑i=1

(O − E)2

(40−25.7)2

25.7 + · · · +(30−22.857)2

22.857 = 24.306

p-value :

smaller than 0.001

Upper tail 0.3 0.2 0.1 0.05 0.02 0.01 0.005 0.001df 1 1.07 1.64 2.71 3.84 5.41 6.63 7.88 10.83

2 2.41 3.22 4.61 5.99 7.82 9.21 10.60 13.823 3.66 4.64 6.25 7.81 9.84 11.34 12.84 16.274 4.88 5.99 7.78 9.49 11.67 13.28 14.86 18.475 6.06 7.29 9.24 11.07 13.39 15.09 16.75 20.52

df = (R − 1) × (C − 1) = (2 − 1) × (3 − 1) = 2

χ2df =

k∑i=1

(O − E)2

(40−25.7)2

25.7 + · · · +(30−22.857)2

22.857 = 24.306

p-value : smaller than 0.001

Upper tail 0.3 0.2 0.1 0.05 0.02 0.01 0.005 0.001df 1 1.07 1.64 2.71 3.84 5.41 6.63 7.88 10.83

2 2.41 3.22 4.61 5.99 7.82 9.21 10.60 13.823 3.66 4.64 6.25 7.81 9.84 11.34 12.84 16.274 4.88 5.99 7.78 9.49 11.67 13.28 14.86 18.475 6.06 7.29 9.24 11.07 13.39 15.09 16.75 20.52

Modeling numerical variables

3 Correlation

Statistics 101

So far we have worked with1 numerical variable (Z, T)

1 categorical variable (χ2)1 numerical and 1 categorical variable (2-sample Z/T, ANOVA)2 categorical variables (χ2 test for independence)

Next up: relationships between two numerical variables, as wellas modeling numerical response variables using a numerical orcategorical explanatory variable.

Wed–Friday: to model numerical variables using manyexplanatory variables at once.

So far we have worked with1 numerical variable (Z, T)1 categorical variable (χ2)

1 numerical and 1 categorical variable (2-sample Z/T, ANOVA)2 categorical variables (χ2 test for independence)

So far we have worked with1 numerical variable (Z, T)1 categorical variable (χ2)1 numerical and 1 categorical variable (2-sample Z/T, ANOVA)

2 categorical variables (χ2 test for independence)

So far we have worked with1 numerical variable (Z, T)1 categorical variable (χ2)1 numerical and 1 categorical variable (2-sample Z/T, ANOVA)2 categorical variables (χ2 test for independence)

Poverty vs. HS graduate rate

The scatterplot below shows the relationship between HS graduaterate in all 50 US states and DC and the % of residents who live belowthe poverty line (income below $23,050 for a family of 4 in 2012).

●●

● ●

●●

80 85 90

% HS grad

Response?

% in poverty

Explanatory?

% HS grad

Relationship?

linear, negative,moderately strong

●●

● ●

●●

80 85 90

% HS grad

Response?

% in poverty

Explanatory?

% HS grad

Relationship?

●●

● ●

●●

80 85 90

% HS grad

Response?

% in poverty

Explanatory?

% HS grad

Relationship?

●●

● ●

●●

80 85 90

% HS grad

Response?

% in poverty

Explanatory?

% HS grad

Relationship?

●●

● ●

●●

80 85 90

% HS grad

Response?

% in poverty

Explanatory?

% HS grad

Relationship?

●●

● ●

●●

80 85 90

% HS grad

Response?

% in poverty

Explanatory?

% HS grad

Relationship?

●●

● ●

●●

80 85 90

% HS grad

Response?

% in poverty

Explanatory?

% HS grad

Relationship?

Correlation

3 Correlation

Statistics 101

Correlation

Quantifying the relationship

Correlation describes the strength of the linear associationbetween two variables.

It takes values between -1 (perfect negative) and +1 (perfectpositive).

A value of 0 indicates no linear association.

Correlation

Guessing the correlation

Question

Which of the following is the best guess for the correlation between %in poverty and % HS grad?

(a) 0.6

(b) -0.75

(c) -0.1

(d) 0.02

(e) -1.5

●●

● ●

●●

80 85 90

% HS grad

Correlation

Question

Which of the following is the best guess for the correlation between %in poverty and % HS grad?

(a) 0.6

(b) -0.75

(c) -0.1

(d) 0.02

(e) -1.5

●●

● ●

●●

80 85 90

% HS grad

Correlation

Question

Which of the following is the best guess for the correlation between %in poverty and % HS female householder?

(a) 0.1

(b) -0.6

(c) -0.4

(d) 0.9

(e) 0.5

●●

● ●

●●

8 10 12 14 16 18

% female householder, no husband present

Correlation

Question

Which of the following is the best guess for the correlation between %in poverty and % HS female householder?

(a) 0.1

(b) -0.6

(c) -0.4

(d) 0.9

(e) 0.5

●●

● ●

●●

8 10 12 14 16 18

% female householder, no husband present

Correlation

Assessing the correlation

Question

Which of the following is has the strongest correlation, i.e. correlationcoefficient closest to +1 or -1?

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●

●●●●●●

●●●●

●●●●●●

●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●

●●●●

●●●

●●●●●

●●

●●●●●

●●●●●●●●●

●●●●●

●●

●●●●●●●

●●

●●●

●●

●●●●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●

Correlation

Assessing the correlation

Question

Which of the following is has the strongest correlation, i.e. correlationcoefficient closest to +1 or -1?

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●

●●●●●●

●●●●

●●●●●●

●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●

●●●●

●●●

●●●●●

●●

●●●●●

●●●●●●●●●

●●●●●

●●

●●●●●●●

●●

●●●

●●

●●●●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●

(b)→correlationmeans linearassociation

Fitting a line by least squares regression

3 Correlation

Statistics 101

Fitting a line by least squares regression Residuals

3 Correlation

Statistics 101

Residuals

Residuals are the leftovers from the model fit: Data = Fit + Residual

●●

● ●

●●

80 85 90

% HS grad

Residuals (cont.)

ResidualResidual is the difference between the observed and predicted y.

ei = yi − yi

●●

● ●

●●

80 85 90

% HS grad

−4.16

% living in poverty inDC is 5.44% morethan predicted.

% living in poverty inRI is 4.16% less thanpredicted.

Residuals (cont.)

ei = yi − yi

●●

● ●

●●

80 85 90

% HS grad

−4.16

Residuals (cont.)

ei = yi − yi

●●

● ●

●●

80 85 90

% HS grad

−4.16

Fitting a line by least squares regression Best line

3 Correlation

Statistics 101

A measure for the best line

We want a line that has small residuals:

1 Option 1: Minimize the sum of magnitudes (absolute values) ofresiduals

|e1| + |e2| + · · · + |en|

2 Option 2: Minimize the sum of squared residuals – least squares

e21 + e2

2 + · · · + e2n

Why least squares?1 Most commonly used2 Easier to compute by hand and using software3 In many applications, a residual twice as large as another is more

than twice as bad

We want a line that has small residuals:1 Option 1: Minimize the sum of magnitudes (absolute values) of