Unit 6: Simple Linear Regression Lecture : Introduction to SLRtjl13/s101/slides/unit6lec1.pdf ·...

Post on 17-Oct-2020

6 views 0 download

transcript

Unit 6: Simple Linear RegressionLecture : Introduction to SLR

Statistics 101

Thomas Leininger

June 17, 2013

Recap: Chi-square test of independence

1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables

2 Modeling numerical variables

3 Correlation

4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2

Categorical explanatory variables

Statistics 101

U6 - L1: Introduction to SLR Thomas Leininger

Recap: Chi-square test of independence Ball throwing

1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables

2 Modeling numerical variables

3 Correlation

4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2

Categorical explanatory variables

Statistics 101

U6 - L1: Introduction to SLR Thomas Leininger

Recap: Chi-square test of independence Ball throwing

Does ball-throwing ability vary by major?

Going back to our carnival game, should I be worried if a bus-load ofpublic policy majors show up at my booth?

The hypotheses are:H0: Ball-throwing ability and major are independent. Ball-throwing

skills do not vary by major.HA: Ball-throwing ability and major are dependent. Ball-throwing

skills vary by major.

https:// commons.wikimedia.org/ wiki/ File:Archery Target 80cm.svg

Major Public Policy Undeclared Other TotalHit target 40 10 10 60Missed target 20 30 30 80Total 60 40 40 140

Note: I multiplied the numbers by 10 to meet our expected cell counts conditions.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 2 / 35

Recap: Chi-square test of independence Ball throwing

Does ball-throwing ability vary by major?

Going back to our carnival game, should I be worried if a bus-load ofpublic policy majors show up at my booth?

The hypotheses are:

H0: Ball-throwing ability and major are independent. Ball-throwingskills do not vary by major.

HA: Ball-throwing ability and major are dependent. Ball-throwingskills vary by major.

Major Public Policy Undeclared Other TotalHit target 40 10 10 60Missed target 20 30 30 80Total 60 40 40 140

Note: I multiplied the numbers by 10 to meet our expected cell counts conditions.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 2 / 35

Recap: Chi-square test of independence Ball throwing

Does ball-throwing ability vary by major?

Going back to our carnival game, should I be worried if a bus-load ofpublic policy majors show up at my booth?

The hypotheses are:

H0: Ball-throwing ability and major are independent. Ball-throwingskills do not vary by major.

HA: Ball-throwing ability and major are dependent. Ball-throwingskills vary by major.

Major Public Policy Undeclared Other TotalHit target 40 10 10 60Missed target 20 30 30 80Total 60 40 40 140

Note: I multiplied the numbers by 10 to meet our expected cell counts conditions.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 2 / 35

Recap: Chi-square test of independence Ball throwing

Does ball-throwing ability vary by major?

Going back to our carnival game, should I be worried if a bus-load ofpublic policy majors show up at my booth?

The hypotheses are:

H0: Ball-throwing ability and major are independent. Ball-throwingskills do not vary by major.

HA: Ball-throwing ability and major are dependent. Ball-throwingskills vary by major.

Major Public Policy Undeclared Other TotalHit target 40 10 10 60Missed target 20 30 30 80Total 60 40 40 140

Note: I multiplied the numbers by 10 to meet our expected cell counts conditions.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 2 / 35

Recap: Chi-square test of independence Ball throwing

Chi-square test of independence

The test statistic is calculated as

χ2df =

k∑i=1

(O − E)2

Ewhere df = (R − 1) × (C − 1),

where k is the number of cells, R is the number of rows, and C isthe number of columns.

Note: We calculate df differently for one-way and two-way tables.

Expected counts in two-way tables

Expected Count =(row total) × (column total)

table total

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 3 / 35

Recap: Chi-square test of independence Ball throwing

Chi-square test of independence

The test statistic is calculated as

χ2df =

k∑i=1

(O − E)2

Ewhere df = (R − 1) × (C − 1),

where k is the number of cells, R is the number of rows, and C isthe number of columns.

Note: We calculate df differently for one-way and two-way tables.

Expected counts in two-way tables

Expected Count =(row total) × (column total)

table total

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 3 / 35

Recap: Chi-square test of independence Expected counts in two-way tables

1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables

2 Modeling numerical variables

3 Correlation

4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2

Categorical explanatory variables

Statistics 101

U6 - L1: Introduction to SLR Thomas Leininger

Recap: Chi-square test of independence Expected counts in two-way tables

Expected counts in two-way tables

Major Public Policy Undeclared Other TotalHit target 40 10 10 60Missed target 20 30 30 80Total 60 40 40 140

df = (R − 1) × (C − 1) =

(2 − 1) × (3 − 1) = 2

χ2df =

k∑i=1

(O − E)2

E=

(40−25.7)2

25.7 + · · · +(30−22.857)2

22.857 = 24.306

p-value :

smaller than 0.001

Upper tail 0.3 0.2 0.1 0.05 0.02 0.01 0.005 0.001df 1 1.07 1.64 2.71 3.84 5.41 6.63 7.88 10.83

2 2.41 3.22 4.61 5.99 7.82 9.21 10.60 13.823 3.66 4.64 6.25 7.81 9.84 11.34 12.84 16.274 4.88 5.99 7.78 9.49 11.67 13.28 14.86 18.475 6.06 7.29 9.24 11.07 13.39 15.09 16.75 20.52

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 4 / 35

Recap: Chi-square test of independence Expected counts in two-way tables

Expected counts in two-way tables

Major Public Policy Undeclared Other TotalHit target 40 10 10 60Missed target 20 30 30 80Total 60 40 40 140

df = (R − 1) × (C − 1) =

(2 − 1) × (3 − 1) = 2

χ2df =

k∑i=1

(O − E)2

E=

(40−25.7)2

25.7 + · · · +(30−22.857)2

22.857 = 24.306

p-value :

smaller than 0.001

Upper tail 0.3 0.2 0.1 0.05 0.02 0.01 0.005 0.001df 1 1.07 1.64 2.71 3.84 5.41 6.63 7.88 10.83

2 2.41 3.22 4.61 5.99 7.82 9.21 10.60 13.823 3.66 4.64 6.25 7.81 9.84 11.34 12.84 16.274 4.88 5.99 7.78 9.49 11.67 13.28 14.86 18.475 6.06 7.29 9.24 11.07 13.39 15.09 16.75 20.52

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 4 / 35

Recap: Chi-square test of independence Expected counts in two-way tables

Expected counts in two-way tables

Major Public Policy Undeclared Other TotalHit target 40 10 10 60Missed target 20 30 30 80Total 60 40 40 140

df = (R − 1) × (C − 1) = (2 − 1) × (3 − 1) = 2

χ2df =

k∑i=1

(O − E)2

E=

(40−25.7)2

25.7 + · · · +(30−22.857)2

22.857 = 24.306

p-value :

smaller than 0.001

Upper tail 0.3 0.2 0.1 0.05 0.02 0.01 0.005 0.001df 1 1.07 1.64 2.71 3.84 5.41 6.63 7.88 10.83

2 2.41 3.22 4.61 5.99 7.82 9.21 10.60 13.823 3.66 4.64 6.25 7.81 9.84 11.34 12.84 16.274 4.88 5.99 7.78 9.49 11.67 13.28 14.86 18.475 6.06 7.29 9.24 11.07 13.39 15.09 16.75 20.52

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 4 / 35

Recap: Chi-square test of independence Expected counts in two-way tables

Expected counts in two-way tables

Major Public Policy Undeclared Other TotalHit target 40 10 10 60Missed target 20 30 30 80Total 60 40 40 140

df = (R − 1) × (C − 1) = (2 − 1) × (3 − 1) = 2

χ2df =

k∑i=1

(O − E)2

E=

(40−25.7)2

25.7 + · · · +(30−22.857)2

22.857 = 24.306

p-value :

smaller than 0.001

Upper tail 0.3 0.2 0.1 0.05 0.02 0.01 0.005 0.001df 1 1.07 1.64 2.71 3.84 5.41 6.63 7.88 10.83

2 2.41 3.22 4.61 5.99 7.82 9.21 10.60 13.823 3.66 4.64 6.25 7.81 9.84 11.34 12.84 16.274 4.88 5.99 7.78 9.49 11.67 13.28 14.86 18.475 6.06 7.29 9.24 11.07 13.39 15.09 16.75 20.52

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 4 / 35

Recap: Chi-square test of independence Expected counts in two-way tables

Expected counts in two-way tables

Major Public Policy Undeclared Other TotalHit target 40 10 10 60Missed target 20 30 30 80Total 60 40 40 140

df = (R − 1) × (C − 1) = (2 − 1) × (3 − 1) = 2

χ2df =

k∑i=1

(O − E)2

E=

(40−25.7)2

25.7 + · · · +(30−22.857)2

22.857 = 24.306

p-value : smaller than 0.001

Upper tail 0.3 0.2 0.1 0.05 0.02 0.01 0.005 0.001df 1 1.07 1.64 2.71 3.84 5.41 6.63 7.88 10.83

2 2.41 3.22 4.61 5.99 7.82 9.21 10.60 13.823 3.66 4.64 6.25 7.81 9.84 11.34 12.84 16.274 4.88 5.99 7.78 9.49 11.67 13.28 14.86 18.475 6.06 7.29 9.24 11.07 13.39 15.09 16.75 20.52

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 4 / 35

Modeling numerical variables

1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables

2 Modeling numerical variables

3 Correlation

4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2

Categorical explanatory variables

Statistics 101

U6 - L1: Introduction to SLR Thomas Leininger

Modeling numerical variables

Modeling numerical variables

So far we have worked with1 numerical variable (Z, T)

1 categorical variable (χ2)1 numerical and 1 categorical variable (2-sample Z/T, ANOVA)2 categorical variables (χ2 test for independence)

Next up: relationships between two numerical variables, as wellas modeling numerical response variables using a numerical orcategorical explanatory variable.

Wed–Friday: to model numerical variables using manyexplanatory variables at once.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 5 / 35

Modeling numerical variables

Modeling numerical variables

So far we have worked with1 numerical variable (Z, T)1 categorical variable (χ2)

1 numerical and 1 categorical variable (2-sample Z/T, ANOVA)2 categorical variables (χ2 test for independence)

Next up: relationships between two numerical variables, as wellas modeling numerical response variables using a numerical orcategorical explanatory variable.

Wed–Friday: to model numerical variables using manyexplanatory variables at once.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 5 / 35

Modeling numerical variables

Modeling numerical variables

So far we have worked with1 numerical variable (Z, T)1 categorical variable (χ2)1 numerical and 1 categorical variable (2-sample Z/T, ANOVA)

2 categorical variables (χ2 test for independence)

Next up: relationships between two numerical variables, as wellas modeling numerical response variables using a numerical orcategorical explanatory variable.

Wed–Friday: to model numerical variables using manyexplanatory variables at once.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 5 / 35

Modeling numerical variables

Modeling numerical variables

So far we have worked with1 numerical variable (Z, T)1 categorical variable (χ2)1 numerical and 1 categorical variable (2-sample Z/T, ANOVA)2 categorical variables (χ2 test for independence)

Next up: relationships between two numerical variables, as wellas modeling numerical response variables using a numerical orcategorical explanatory variable.

Wed–Friday: to model numerical variables using manyexplanatory variables at once.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 5 / 35

Modeling numerical variables

Modeling numerical variables

So far we have worked with1 numerical variable (Z, T)1 categorical variable (χ2)1 numerical and 1 categorical variable (2-sample Z/T, ANOVA)2 categorical variables (χ2 test for independence)

Next up: relationships between two numerical variables, as wellas modeling numerical response variables using a numerical orcategorical explanatory variable.

Wed–Friday: to model numerical variables using manyexplanatory variables at once.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 5 / 35

Modeling numerical variables

Modeling numerical variables

So far we have worked with1 numerical variable (Z, T)1 categorical variable (χ2)1 numerical and 1 categorical variable (2-sample Z/T, ANOVA)2 categorical variables (χ2 test for independence)

Next up: relationships between two numerical variables, as wellas modeling numerical response variables using a numerical orcategorical explanatory variable.

Wed–Friday: to model numerical variables using manyexplanatory variables at once.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 5 / 35

Modeling numerical variables

Modeling numerical variables

So far we have worked with1 numerical variable (Z, T)1 categorical variable (χ2)1 numerical and 1 categorical variable (2-sample Z/T, ANOVA)2 categorical variables (χ2 test for independence)

Next up: relationships between two numerical variables, as wellas modeling numerical response variables using a numerical orcategorical explanatory variable.

Wed–Friday: to model numerical variables using manyexplanatory variables at once.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 5 / 35

Modeling numerical variables

Poverty vs. HS graduate rate

The scatterplot below shows the relationship between HS graduaterate in all 50 US states and DC and the % of residents who live belowthe poverty line (income below $23,050 for a family of 4 in 2012).

●●

● ●

●●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty

Response?

% in poverty

Explanatory?

% HS grad

Relationship?

linear, negative,moderately strong

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 6 / 35

Modeling numerical variables

Poverty vs. HS graduate rate

The scatterplot below shows the relationship between HS graduaterate in all 50 US states and DC and the % of residents who live belowthe poverty line (income below $23,050 for a family of 4 in 2012).

●●

● ●

●●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty

Response?

% in poverty

Explanatory?

% HS grad

Relationship?

linear, negative,moderately strong

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 6 / 35

Modeling numerical variables

Poverty vs. HS graduate rate

The scatterplot below shows the relationship between HS graduaterate in all 50 US states and DC and the % of residents who live belowthe poverty line (income below $23,050 for a family of 4 in 2012).

●●

● ●

●●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty

Response?

% in poverty

Explanatory?

% HS grad

Relationship?

linear, negative,moderately strong

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 6 / 35

Modeling numerical variables

Poverty vs. HS graduate rate

The scatterplot below shows the relationship between HS graduaterate in all 50 US states and DC and the % of residents who live belowthe poverty line (income below $23,050 for a family of 4 in 2012).

●●

● ●

●●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty

Response?

% in poverty

Explanatory?

% HS grad

Relationship?

linear, negative,moderately strong

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 6 / 35

Modeling numerical variables

Poverty vs. HS graduate rate

The scatterplot below shows the relationship between HS graduaterate in all 50 US states and DC and the % of residents who live belowthe poverty line (income below $23,050 for a family of 4 in 2012).

●●

● ●

●●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty

Response?

% in poverty

Explanatory?

% HS grad

Relationship?

linear, negative,moderately strong

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 6 / 35

Modeling numerical variables

Poverty vs. HS graduate rate

The scatterplot below shows the relationship between HS graduaterate in all 50 US states and DC and the % of residents who live belowthe poverty line (income below $23,050 for a family of 4 in 2012).

●●

● ●

●●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty

Response?

% in poverty

Explanatory?

% HS grad

Relationship?

linear, negative,moderately strong

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 6 / 35

Modeling numerical variables

Poverty vs. HS graduate rate

The scatterplot below shows the relationship between HS graduaterate in all 50 US states and DC and the % of residents who live belowthe poverty line (income below $23,050 for a family of 4 in 2012).

●●

● ●

●●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty

Response?

% in poverty

Explanatory?

% HS grad

Relationship?

linear, negative,moderately strong

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 6 / 35

Correlation

1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables

2 Modeling numerical variables

3 Correlation

4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2

Categorical explanatory variables

Statistics 101

U6 - L1: Introduction to SLR Thomas Leininger

Correlation

Quantifying the relationship

Correlation describes the strength of the linear associationbetween two variables.

It takes values between -1 (perfect negative) and +1 (perfectpositive).

A value of 0 indicates no linear association.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 7 / 35

Correlation

Quantifying the relationship

Correlation describes the strength of the linear associationbetween two variables.

It takes values between -1 (perfect negative) and +1 (perfectpositive).

A value of 0 indicates no linear association.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 7 / 35

Correlation

Quantifying the relationship

Correlation describes the strength of the linear associationbetween two variables.

It takes values between -1 (perfect negative) and +1 (perfectpositive).

A value of 0 indicates no linear association.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 7 / 35

Correlation

Guessing the correlation

Question

Which of the following is the best guess for the correlation between %in poverty and % HS grad?

(a) 0.6

(b) -0.75

(c) -0.1

(d) 0.02

(e) -1.5

●●

● ●

●●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 8 / 35

Correlation

Guessing the correlation

Question

Which of the following is the best guess for the correlation between %in poverty and % HS grad?

(a) 0.6

(b) -0.75

(c) -0.1

(d) 0.02

(e) -1.5

●●

● ●

●●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 8 / 35

Correlation

Guessing the correlation

Question

Which of the following is the best guess for the correlation between %in poverty and % HS female householder?

(a) 0.1

(b) -0.6

(c) -0.4

(d) 0.9

(e) 0.5

●●

● ●

●●

8 10 12 14 16 18

6

8

10

12

14

16

18

% female householder, no husband present

% in

pov

erty

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 9 / 35

Correlation

Guessing the correlation

Question

Which of the following is the best guess for the correlation between %in poverty and % HS female householder?

(a) 0.1

(b) -0.6

(c) -0.4

(d) 0.9

(e) 0.5

●●

● ●

●●

8 10 12 14 16 18

6

8

10

12

14

16

18

% female householder, no husband present

% in

pov

erty

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 9 / 35

Correlation

Assessing the correlation

Question

Which of the following is has the strongest correlation, i.e. correlationcoefficient closest to +1 or -1?

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●

●●●●●●

●●●●●●

●●●●

●●●●●●

●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●

●●●●

●●●●

(a)

●●●

●●●

●●●●●

●●

●●

●●

●●●●●

●●●●●●●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●●●●●●

●●

●●●

●●

●●●●●

●●

●●

●●●

●●●

●●

(b)

●●

●●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

(c)

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

(d)

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 10 / 35

Correlation

Assessing the correlation

Question

Which of the following is has the strongest correlation, i.e. correlationcoefficient closest to +1 or -1?

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●

●●●●●●

●●●●●●

●●●●

●●●●●●

●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●

●●●●

●●●●

(a)

●●●

●●●

●●●●●

●●

●●

●●

●●●●●

●●●●●●●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●●●●●●

●●

●●●

●●

●●●●●

●●

●●

●●●

●●●

●●

(b)

●●

●●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

(c)

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

(d)

(b)→correlationmeans linearassociation

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 10 / 35

Fitting a line by least squares regression

1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables

2 Modeling numerical variables

3 Correlation

4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2

Categorical explanatory variables

Statistics 101

U6 - L1: Introduction to SLR Thomas Leininger

Fitting a line by least squares regression Residuals

1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables

2 Modeling numerical variables

3 Correlation

4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2

Categorical explanatory variables

Statistics 101

U6 - L1: Introduction to SLR Thomas Leininger

Fitting a line by least squares regression Residuals

Residuals

Residuals are the leftovers from the model fit: Data = Fit + Residual

●●

● ●

●●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 11 / 35

Fitting a line by least squares regression Residuals

Residuals (cont.)

ResidualResidual is the difference between the observed and predicted y.

ei = yi − yi

●●

● ●

●●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty

y

5.44

yy

−4.16

y

DC

RI

% living in poverty inDC is 5.44% morethan predicted.

% living in poverty inRI is 4.16% less thanpredicted.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 12 / 35

Fitting a line by least squares regression Residuals

Residuals (cont.)

ResidualResidual is the difference between the observed and predicted y.

ei = yi − yi

●●

● ●

●●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty

y

5.44

yy

−4.16

y

DC

RI

% living in poverty inDC is 5.44% morethan predicted.

% living in poverty inRI is 4.16% less thanpredicted.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 12 / 35

Fitting a line by least squares regression Residuals

Residuals (cont.)

ResidualResidual is the difference between the observed and predicted y.

ei = yi − yi

●●

● ●

●●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty

y

5.44

yy

−4.16

y

DC

RI

% living in poverty inDC is 5.44% morethan predicted.

% living in poverty inRI is 4.16% less thanpredicted.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 12 / 35

Fitting a line by least squares regression Best line

1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables

2 Modeling numerical variables

3 Correlation

4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2

Categorical explanatory variables

Statistics 101

U6 - L1: Introduction to SLR Thomas Leininger

Fitting a line by least squares regression Best line

A measure for the best line

We want a line that has small residuals:

1 Option 1: Minimize the sum of magnitudes (absolute values) ofresiduals

|e1| + |e2| + · · · + |en|

2 Option 2: Minimize the sum of squared residuals – least squares

e21 + e2

2 + · · · + e2n

Why least squares?1 Most commonly used2 Easier to compute by hand and using software3 In many applications, a residual twice as large as another is more

than twice as bad

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 13 / 35

Fitting a line by least squares regression Best line

A measure for the best line

We want a line that has small residuals:1 Option 1: Minimize the sum of magnitudes (absolute values) of

residuals|e1| + |e2| + · · · + |en|

2 Option 2: Minimize the sum of squared residuals – least squares

e21 + e2

2 + · · · + e2n

Why least squares?1 Most commonly used2 Easier to compute by hand and using software3 In many applications, a residual twice as large as another is more

than twice as bad

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 13 / 35

Fitting a line by least squares regression Best line

A measure for the best line

We want a line that has small residuals:1 Option 1: Minimize the sum of magnitudes (absolute values) of

residuals|e1| + |e2| + · · · + |en|

2 Option 2: Minimize the sum of squared residuals – least squares

e21 + e2

2 + · · · + e2n

Why least squares?1 Most commonly used2 Easier to compute by hand and using software3 In many applications, a residual twice as large as another is more

than twice as bad

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 13 / 35

Fitting a line by least squares regression Best line

A measure for the best line

We want a line that has small residuals:1 Option 1: Minimize the sum of magnitudes (absolute values) of

residuals|e1| + |e2| + · · · + |en|

2 Option 2: Minimize the sum of squared residuals – least squares

e21 + e2

2 + · · · + e2n

Why least squares?

1 Most commonly used2 Easier to compute by hand and using software3 In many applications, a residual twice as large as another is more

than twice as bad

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 13 / 35

Fitting a line by least squares regression Best line

A measure for the best line

We want a line that has small residuals:1 Option 1: Minimize the sum of magnitudes (absolute values) of

residuals|e1| + |e2| + · · · + |en|

2 Option 2: Minimize the sum of squared residuals – least squares

e21 + e2

2 + · · · + e2n

Why least squares?1 Most commonly used

2 Easier to compute by hand and using software3 In many applications, a residual twice as large as another is more

than twice as bad

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 13 / 35

Fitting a line by least squares regression Best line

A measure for the best line

We want a line that has small residuals:1 Option 1: Minimize the sum of magnitudes (absolute values) of

residuals|e1| + |e2| + · · · + |en|

2 Option 2: Minimize the sum of squared residuals – least squares

e21 + e2

2 + · · · + e2n

Why least squares?1 Most commonly used2 Easier to compute by hand and using software

3 In many applications, a residual twice as large as another is morethan twice as bad

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 13 / 35

Fitting a line by least squares regression Best line

A measure for the best line

We want a line that has small residuals:1 Option 1: Minimize the sum of magnitudes (absolute values) of

residuals|e1| + |e2| + · · · + |en|

2 Option 2: Minimize the sum of squared residuals – least squares

e21 + e2

2 + · · · + e2n

Why least squares?1 Most commonly used2 Easier to compute by hand and using software3 In many applications, a residual twice as large as another is more

than twice as bad

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 13 / 35

Fitting a line by least squares regression Best line

The least squares line

y = β0 + β1x��

����predicted y��

��intercept

@@@R

slope

HHHHHj

explanatory variable

Notation:Intercept:

Parameter: β0Point estimate: b0

Slope:Parameter: β1Point estimate: b1

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 14 / 35

Fitting a line by least squares regression The least squares line

1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables

2 Modeling numerical variables

3 Correlation

4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2

Categorical explanatory variables

Statistics 101

U6 - L1: Introduction to SLR Thomas Leininger

Fitting a line by least squares regression The least squares line

Given...

●●

● ●

●●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty

% HS grad % in poverty(x) (y)

mean x = 86.01 y = 11.35sd sx = 3.73 sy = 3.1

correlation R = −0.75

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 15 / 35

Fitting a line by least squares regression The least squares line

Slope

Slope

The slope of the regression can be calculated as

b1 =sy

sxR

In context...b1 =

3.13.73

× −0.75 = −0.62

InterpretationFor each % point increase in HS graduate rate, we would expect the% living in poverty to decrease on average by 0.62% points.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 16 / 35

Fitting a line by least squares regression The least squares line

Slope

Slope

The slope of the regression can be calculated as

b1 =sy

sxR

In context...b1 =

3.13.73

× −0.75 = −0.62

InterpretationFor each % point increase in HS graduate rate, we would expect the% living in poverty to decrease on average by 0.62% points.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 16 / 35

Fitting a line by least squares regression The least squares line

Slope

Slope

The slope of the regression can be calculated as

b1 =sy

sxR

In context...b1 =

3.13.73

× −0.75 = −0.62

InterpretationFor each % point increase in HS graduate rate, we would expect the% living in poverty to decrease on average by 0.62% points.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 16 / 35

Fitting a line by least squares regression The least squares line

Intercept

InterceptThe intercept is where the regression line intersects the y-axis. Thecalculation of the intercept uses the fact the a regression line alwayspasses through (x, y).

b0 = y − b1x

●● ●

●●●●●

● ●●

●●●

●●

●●

●●●

●●●

●●

●●●●

● ●

0 20 40 60 80 1000

10

20

30

40

50

60

70

% HS grad

% in

pov

erty

intercept

b0 = 11.35 − (−0.62) × 86.01

= 64.68

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 17 / 35

Fitting a line by least squares regression The least squares line

Intercept

InterceptThe intercept is where the regression line intersects the y-axis. Thecalculation of the intercept uses the fact the a regression line alwayspasses through (x, y).

b0 = y − b1x

●● ●

●●●●●

● ●●

●●●

●●

●●

●●●

●●●

●●

●●●●

● ●

0 20 40 60 80 1000

10

20

30

40

50

60

70

% HS grad

% in

pov

erty

intercept

b0 = 11.35 − (−0.62) × 86.01

= 64.68

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 17 / 35

Fitting a line by least squares regression The least squares line

Intercept

InterceptThe intercept is where the regression line intersects the y-axis. Thecalculation of the intercept uses the fact the a regression line alwayspasses through (x, y).

b0 = y − b1x

●● ●

●●●●●

● ●●

●●●

●●

●●

●●●

●●●

●●

●●●●

● ●

0 20 40 60 80 1000

10

20

30

40

50

60

70

% HS grad

% in

pov

erty

intercept

b0 = 11.35 − (−0.62) × 86.01

= 64.68

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 17 / 35

Fitting a line by least squares regression The least squares line

Interpret b0

Question

How do we interpret the intercept? (b0 = 64.68)

●● ●

●●●●●

● ●●

●●●

●●

●●

●●●

●●●

●●

●●●●

● ●

0 20 40 60 80 1000

10

20

30

40

50

60

70

% HS grad

% in

pov

erty

intercept

States with no HS graduates are expected on average to have64.68% of residents living below the poverty line.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 18 / 35

Fitting a line by least squares regression The least squares line

Interpret b0

Question

How do we interpret the intercept? (b0 = 64.68)

●● ●

●●●●●

● ●●

●●●

●●

●●

●●●

●●●

●●

●●●●

● ●

0 20 40 60 80 1000

10

20

30

40

50

60

70

% HS grad

% in

pov

erty

intercept

States with no HS graduates are expected on average to have64.68% of residents living below the poverty line.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 18 / 35

Fitting a line by least squares regression The least squares line

Recap: Interpretation of slope and intercept

Intercept: When x = 0, y is expected to equal the value of theintercept.

Slope: For each unit increase in x, y is expected toincrease/decrease on average by value of the slope.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 19 / 35

Fitting a line by least squares regression The least squares line

Regression line

% in poverty = 64.68 − 0.62 % HS grad

●●

● ●

●●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 20 / 35

Fitting a line by least squares regression Prediction & extrapolation

1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables

2 Modeling numerical variables

3 Correlation

4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2

Categorical explanatory variables

Statistics 101

U6 - L1: Introduction to SLR Thomas Leininger

Fitting a line by least squares regression Prediction & extrapolation

Prediction

Using the linear model to predict the value of the responsevariable for a given value of the explanatory variable is calledprediction, simply by plugging in the value of x in the linear modelequation.There will be some uncertainty associated with the predictedvalue - we’ll talk about this next time.

●●

● ●

●●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 21 / 35

Fitting a line by least squares regression Prediction & extrapolation

Extrapolation

Applying a model estimate to values outside of the realm of theoriginal data is called extrapolation.

Sometimes the intercept might be an extrapolation.

●● ●

●●●●●

● ●●

●●●

●●

●●

●●●

●●●

●●

●●●●

● ●

0 20 40 60 80 1000

10

20

30

40

50

60

70

% HS grad

% in

pov

erty

intercept

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 22 / 35

Fitting a line by least squares regression Prediction & extrapolation

Examples of extrapolation

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 23 / 35

Fitting a line by least squares regression Prediction & extrapolation

Examples of extrapolation

1 http:// www.colbertnation.com/ the-colbert-report-videos/ 269929

2 Sprinting:

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 24 / 35

Fitting a line by least squares regression Prediction & extrapolation

Examples of extrapolation

1 http:// www.colbertnation.com/ the-colbert-report-videos/ 2699292 Sprinting:

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 24 / 35

Fitting a line by least squares regression Prediction & extrapolation

Examples of extrapolation

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 25 / 35

Fitting a line by least squares regression Conditions for the least squares line

1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables

2 Modeling numerical variables

3 Correlation

4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2

Categorical explanatory variables

Statistics 101

U6 - L1: Introduction to SLR Thomas Leininger

Fitting a line by least squares regression Conditions for the least squares line

Conditions for the least squares line

1 Linearity

2 Nearly normal residuals

3 Constant variability

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 26 / 35

Fitting a line by least squares regression Conditions for the least squares line

Conditions for the least squares line

1 Linearity

2 Nearly normal residuals

3 Constant variability

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 26 / 35

Fitting a line by least squares regression Conditions for the least squares line

Conditions for the least squares line

1 Linearity

2 Nearly normal residuals

3 Constant variability

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 26 / 35

Fitting a line by least squares regression Conditions for the least squares line

Conditions: (1) Linearity

The relationship between the explanatory and the responsevariable should be linear.

Methods for fitting a model to non-linear relationships exist, butare beyond the scope of this class.Check using a scatterplot of the data, or a residuals plot.

x x

ysu

mm

ary(

g)$r

esid

uals

x

ysu

mm

ary(

g)$r

esid

uals

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 27 / 35

Fitting a line by least squares regression Conditions for the least squares line

Conditions: (1) Linearity

The relationship between the explanatory and the responsevariable should be linear.Methods for fitting a model to non-linear relationships exist, butare beyond the scope of this class.

Check using a scatterplot of the data, or a residuals plot.

x x

ysu

mm

ary(

g)$r

esid

uals

x

ysu

mm

ary(

g)$r

esid

uals

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 27 / 35

Fitting a line by least squares regression Conditions for the least squares line

Conditions: (1) Linearity

The relationship between the explanatory and the responsevariable should be linear.Methods for fitting a model to non-linear relationships exist, butare beyond the scope of this class.Check using a scatterplot of the data, or a residuals plot.

x x

ysu

mm

ary(

g)$r

esid

uals

x

ysu

mm

ary(

g)$r

esid

uals

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 27 / 35

Fitting a line by least squares regression Conditions for the least squares line

Anatomy of a residuals plot

% HS grad

% in

pov

erty

80 85 90

5

10

15

−5

0

5

∗ RI:

% HS grad = 81 % in poverty = 10.3% in poverty = 64.68 − 0.62 ∗ 81 = 14.46

e = % in poverty − % in poverty

= 10.3 − 14.46 = −4.16

� DC:

% HS grad = 86 % in poverty = 16.8% in poverty = 64.68 − 0.62 ∗ 86 = 11.36

e = % in poverty − % in poverty

= 16.8 − 11.36 = 5.44

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 28 / 35

Fitting a line by least squares regression Conditions for the least squares line

Anatomy of a residuals plot

% HS grad

% in

pov

erty

80 85 90

5

10

15

−5

0

5

∗ RI:

% HS grad = 81 % in poverty = 10.3% in poverty = 64.68 − 0.62 ∗ 81 = 14.46

e = % in poverty − % in poverty

= 10.3 − 14.46 = −4.16

� DC:

% HS grad = 86 % in poverty = 16.8% in poverty = 64.68 − 0.62 ∗ 86 = 11.36

e = % in poverty − % in poverty

= 16.8 − 11.36 = 5.44

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 28 / 35

Fitting a line by least squares regression Conditions for the least squares line

Conditions: (2) Nearly normal residuals

The residuals should be nearly normal.

This condition may not be satisfied when there are unusualobservations that don’t follow the trend of the rest of the data.Check using a histogram or normal probability plot of residuals.

residuals

freq

uenc

y

−4 −2 0 2 4 6

02

46

810

12

●●

●●

●● ●

●●

−2 −1 0 1 2

−4

−2

02

4

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 29 / 35

Fitting a line by least squares regression Conditions for the least squares line

Conditions: (2) Nearly normal residuals

The residuals should be nearly normal.This condition may not be satisfied when there are unusualobservations that don’t follow the trend of the rest of the data.

Check using a histogram or normal probability plot of residuals.

residuals

freq

uenc

y

−4 −2 0 2 4 6

02

46

810

12

●●

●●

●● ●

●●

−2 −1 0 1 2

−4

−2

02

4

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 29 / 35

Fitting a line by least squares regression Conditions for the least squares line

Conditions: (2) Nearly normal residuals

The residuals should be nearly normal.This condition may not be satisfied when there are unusualobservations that don’t follow the trend of the rest of the data.Check using a histogram or normal probability plot of residuals.

residuals

freq

uenc

y

−4 −2 0 2 4 6

02

46

810

12

●●

●●

●● ●

●●

−2 −1 0 1 2

−4

−2

02

4

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 29 / 35

Fitting a line by least squares regression Conditions for the least squares line

Conditions: (3) Constant variability

●●

● ●

●●

80 85 90

68

1012

1416

18

% HS grad

% in

pov

erty

● ●●

●●

●●

● ●●

●●●

●●

●●

●●

●●

●●

●●

80 90

−4

04

The variability of pointsaround the least squares lineshould be roughly constant.

This implies that the variabilityof residuals around the 0 lineshould be roughly constant aswell.

Also called homoscedasticity.

Check using a residuals plot.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 30 / 35

Fitting a line by least squares regression Conditions for the least squares line

Conditions: (3) Constant variability

●●

● ●

●●

80 85 90

68

1012

1416

18

% HS grad

% in

pov

erty

● ●●

●●

●●

● ●●

●●●

●●

●●

●●

●●

●●

●●

80 90

−4

04

The variability of pointsaround the least squares lineshould be roughly constant.

This implies that the variabilityof residuals around the 0 lineshould be roughly constant aswell.

Also called homoscedasticity.

Check using a residuals plot.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 30 / 35

Fitting a line by least squares regression Conditions for the least squares line

Conditions: (3) Constant variability

●●

● ●

●●

80 85 90

68

1012

1416

18

% HS grad

% in

pov

erty

● ●●

●●

●●

● ●●

●●●

●●

●●

●●

●●

●●

●●

80 90

−4

04

The variability of pointsaround the least squares lineshould be roughly constant.

This implies that the variabilityof residuals around the 0 lineshould be roughly constant aswell.

Also called homoscedasticity.

Check using a residuals plot.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 30 / 35

Fitting a line by least squares regression Conditions for the least squares line

Conditions: (3) Constant variability

●●

● ●

●●

80 85 90

68

1012

1416

18

% HS grad

% in

pov

erty

● ●●

●●

●●

● ●●

●●●

●●

●●

●●

●●

●●

●●

80 90

−4

04

The variability of pointsaround the least squares lineshould be roughly constant.

This implies that the variabilityof residuals around the 0 lineshould be roughly constant aswell.

Also called homoscedasticity.

Check using a residuals plot.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 30 / 35

Fitting a line by least squares regression Conditions for the least squares line

Checking conditions

Question

What condition is this linear modelobviously violating?

(a) Constant variability

(b) Linear relationship

(c) Non-normal residuals

(d) No extreme outliers x x

yg$residuals

x

yg$residuals

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 31 / 35

Fitting a line by least squares regression Conditions for the least squares line

Checking conditions

Question

What condition is this linear modelobviously violating?

(a) Constant variability

(b) Linear relationship

(c) Non-normal residuals

(d) No extreme outliers x x

yg$residuals

x

yg$residuals

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 31 / 35

Fitting a line by least squares regression Conditions for the least squares line

Checking conditions

Question

What condition is this linear modelobviously violating?

(a) Constant variability

(b) Linear relationship

(c) Non-normal residuals

(d) No extreme outliersx x

yg$residuals

x

yg$residuals

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 32 / 35

Fitting a line by least squares regression Conditions for the least squares line

Checking conditions

Question

What condition is this linear modelobviously violating?

(a) Constant variability

(b) Linear relationship

(c) Non-normal residuals

(d) No extreme outliersx x

yg$residuals

x

yg$residuals

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 32 / 35

Fitting a line by least squares regression R2

1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables

2 Modeling numerical variables

3 Correlation

4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2

Categorical explanatory variables

Statistics 101

U6 - L1: Introduction to SLR Thomas Leininger

Fitting a line by least squares regression R2

R2

The strength of the fit of a linear model is most commonlyevaluated using R2.

R2 is calculated as the square of the correlation coefficient.

It tells us what percent of variability in the response variable isexplained by the model.

The remainder of the variability is explained by variables notincluded in the model.

For the model we’ve been working with, R2 = (−0.62)2 = 0.38.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 33 / 35

Fitting a line by least squares regression R2

R2

The strength of the fit of a linear model is most commonlyevaluated using R2.

R2 is calculated as the square of the correlation coefficient.

It tells us what percent of variability in the response variable isexplained by the model.

The remainder of the variability is explained by variables notincluded in the model.

For the model we’ve been working with, R2 = (−0.62)2 = 0.38.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 33 / 35

Fitting a line by least squares regression R2

R2

The strength of the fit of a linear model is most commonlyevaluated using R2.

R2 is calculated as the square of the correlation coefficient.

It tells us what percent of variability in the response variable isexplained by the model.

The remainder of the variability is explained by variables notincluded in the model.

For the model we’ve been working with, R2 = (−0.62)2 = 0.38.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 33 / 35

Fitting a line by least squares regression R2

R2

The strength of the fit of a linear model is most commonlyevaluated using R2.

R2 is calculated as the square of the correlation coefficient.

It tells us what percent of variability in the response variable isexplained by the model.

The remainder of the variability is explained by variables notincluded in the model.

For the model we’ve been working with, R2 = (−0.62)2 = 0.38.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 33 / 35

Fitting a line by least squares regression R2

R2

The strength of the fit of a linear model is most commonlyevaluated using R2.

R2 is calculated as the square of the correlation coefficient.

It tells us what percent of variability in the response variable isexplained by the model.

The remainder of the variability is explained by variables notincluded in the model.

For the model we’ve been working with, R2 = (−0.62)2 = 0.38.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 33 / 35

Fitting a line by least squares regression R2

Interpretation of R2

Question

Which of the below is the correct interpretation of R = −0.62, R2 = 0.38?

(a) 38% of the variability in the % of HGgraduates among the 51 states isexplained by the model.

(b) 38% of the variability in the % ofresidents living in poverty among the 51states is explained by the model.

(c) 38% of the time % HS graduates predict% living in poverty correctly.

(d) 62% of the variability in the % ofresidents living in poverty among the 51states is explained by the model.

●●

● ●

●●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 34 / 35

Fitting a line by least squares regression R2

Interpretation of R2

Question

Which of the below is the correct interpretation of R = −0.62, R2 = 0.38?

(a) 38% of the variability in the % of HGgraduates among the 51 states isexplained by the model.

(b) 38% of the variability in the % ofresidents living in poverty among the 51states is explained by the model.

(c) 38% of the time % HS graduates predict% living in poverty correctly.

(d) 62% of the variability in the % ofresidents living in poverty among the 51states is explained by the model.

●●

● ●

●●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 34 / 35

Fitting a line by least squares regression Categorical explanatory variables

1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables

2 Modeling numerical variables

3 Correlation

4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2

Categorical explanatory variables

Statistics 101

U6 - L1: Introduction to SLR Thomas Leininger

Fitting a line by least squares regression Categorical explanatory variables

Poverty vs. region (east, west)

poverty = 11.17 + 0.38 × west

Explanatory variable: region, reference level: eastIntercept: The estimated average poverty percentage in easternstates is 11.17%/

This is the value we get if we plug in 0 for the explanatory variable

Slope: The estimated average poverty percentage in westernstates is 0.38% higher than eastern states.

Then, the estimated average poverty percentage in westernstates is 11.17 + 0.38 = 11.55%.This is the value we get if we plug in 1 for the explanatory variable

This is called using a dummy variable.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 35 / 35

Fitting a line by least squares regression Categorical explanatory variables

Poverty vs. region (east, west)

poverty = 11.17 + 0.38 × west

Explanatory variable: region, reference level: eastIntercept: The estimated average poverty percentage in easternstates is 11.17%/

This is the value we get if we plug in 0 for the explanatory variable

Slope: The estimated average poverty percentage in westernstates is 0.38% higher than eastern states.

Then, the estimated average poverty percentage in westernstates is 11.17 + 0.38 = 11.55%.This is the value we get if we plug in 1 for the explanatory variable

This is called using a dummy variable.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 35 / 35

Fitting a line by least squares regression Categorical explanatory variables

Poverty vs. region (east, west)

poverty = 11.17 + 0.38 × west

Explanatory variable: region, reference level: eastIntercept: The estimated average poverty percentage in easternstates is 11.17%/

This is the value we get if we plug in 0 for the explanatory variable

Slope: The estimated average poverty percentage in westernstates is 0.38% higher than eastern states.

Then, the estimated average poverty percentage in westernstates is 11.17 + 0.38 = 11.55%.This is the value we get if we plug in 1 for the explanatory variable

This is called using a dummy variable.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 35 / 35

Fitting a line by least squares regression Categorical explanatory variables

Poverty vs. region (east, west)

poverty = 11.17 + 0.38 × west

Explanatory variable: region, reference level: eastIntercept: The estimated average poverty percentage in easternstates is 11.17%/

This is the value we get if we plug in 0 for the explanatory variable

Slope: The estimated average poverty percentage in westernstates is 0.38% higher than eastern states.

Then, the estimated average poverty percentage in westernstates is 11.17 + 0.38 = 11.55%.

This is the value we get if we plug in 1 for the explanatory variable

This is called using a dummy variable.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 35 / 35

Fitting a line by least squares regression Categorical explanatory variables

Poverty vs. region (east, west)

poverty = 11.17 + 0.38 × west

Explanatory variable: region, reference level: eastIntercept: The estimated average poverty percentage in easternstates is 11.17%/

This is the value we get if we plug in 0 for the explanatory variable

Slope: The estimated average poverty percentage in westernstates is 0.38% higher than eastern states.

Then, the estimated average poverty percentage in westernstates is 11.17 + 0.38 = 11.55%.This is the value we get if we plug in 1 for the explanatory variable

This is called using a dummy variable.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 35 / 35