+ All Categories
Home > Documents > Intermediate Data Collection & Analysis

Intermediate Data Collection & Analysis

Date post: 02-Jan-2016
Category:
Upload: russell-cochran
View: 33 times
Download: 0 times
Share this document with a friend
Description:
Intermediate Data Collection & Analysis. Steven A. Allshouse Coordinator of Research and Analysis November 5, 2008. Organization of the Class. Part I – Discussion of Correlation and Causation. Part II – Quantitative Examples of Correlation and Causation. - PowerPoint PPT Presentation
Popular Tags:
47
Intermediate Data Collection & Analysis Steven A. Allshouse Coordinator of Research and Analysis November 5, 2008
Transcript
Page 1: Intermediate Data Collection & Analysis

Intermediate Data Collection & Analysis

Steven A. AllshouseCoordinator of Research and Analysis

November 5, 2008

Page 2: Intermediate Data Collection & Analysis

Organization of the Class

• Part I – Discussion of Correlation and Causation.

• Part II – Quantitative Examples of Correlation and Causation.

• Part III – How to Measure Correlation (OLS Method).

• Part IV – Common Pitfalls of the OLS Method.

• Part V – MS Excel Exercise.

Page 3: Intermediate Data Collection & Analysis

Part I – Qualitative Examples of Correlationand Causation

Page 4: Intermediate Data Collection & Analysis

Correlation

• A situation in which one variable or set of variables tends to be associated with a second variable or set of variables, but is not thought to bring about that second variable or set of variables.

• Examples: The size of a person’s left foot and the size of his or her right foot; women’s hemlines and the performance of the stock market; and the number of cavities in elementary school children and the size of their vocabulary.

• Note: Correlation can be positive or negative; positive means as X increases, so does Y; negative means as X increases, Y decreases.

Page 5: Intermediate Data Collection & Analysis

Causation

• A situation in which one variable or set of variables is thought to bring about, or help bring about, a second variable or set of variables.

• Examples: Alcohol consumption/traffic accidents; average daily temperatures/heating oil consumption.

• Notes: Causation usually implies correlation; If X causes Y, where we see X we would expect to see Y. Causation can be positive or negative; an increase in X can cause an increase or a decrease in Y. The direction of causation can run one or both ways; X causes Y, but Y might or might not cause X.

Page 6: Intermediate Data Collection & Analysis

A Case of Causation?

• There is a strong positive correlation between the number of fire engines that respond to a fire and the number of fatalities in that fire, i.e., the greater the number of fire engines, the greater the number of deaths.

• Question: Does this fact mean that Albemarle County could save lives by decreasing the number of fire engines sent to a given fire?

Page 7: Intermediate Data Collection & Analysis

Additional Notes about Correlation & Causation

• Direction of causation usually determines what we identify as “independent” and “dependent” variables; Independent variable X causes the dependent variable Y. X and Y are correlated, but Y does not cause X.

• Identification problem: Smoke actually does not cause the fire alarm to be pulled; fire is the underlying cause. Similarly, an increase in, say, education can be seen as causing an increase in income, but educational attainment might just be a “signal” of some underlying ability.

Page 8: Intermediate Data Collection & Analysis

Part II – Quantitative Examples of Correlationand Causation

Page 9: Intermediate Data Collection & Analysis

Example #1

Number of Years of Education and Annual Income ($) in Hooville County

(According to Survey Taken in 2007)

No. of Yrs.of Educ. Annual

Person Completed Income ($)

Joe 12 20,000Bill 14 24,000

Sara 16 28,000Steve 9 14,000Lori 17 30,000Sean 10 16,000

Robert 8 12,000Andy 7 10,000Susan 11 18,000Wendy 15 26,000Joyce 13 22,000

Page 10: Intermediate Data Collection & Analysis

Graph I -- Number of Years of Education and Annual Income in Hooville County in 2007

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

6 7 8 9 10 11 12 13 14 15 16 17 18

Number of Years of Education Completed

An

nu

al I

nco

me

($)

Page 11: Intermediate Data Collection & Analysis

Graph I -- Number of Years of Education and Annual Income in Hooville County in 2007

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

6 7 8 9 10 11 12 13 14 15 16 17 18

Number of Years of Education Completed

An

nu

al I

nc

om

e (

$)

Page 12: Intermediate Data Collection & Analysis

Example #2

Number of Years of Education and Annual Income ($) in Hooville County

(According to Survey Taken in 2007)

No. of Yrs.of Educ. Annual

Person Completed Income ($)

Joe 12 20,000Bill 14 24,000

Sara 16 28,000Steve 9 14,000Lori 17 30,000Sean 10 16,000

Robert 8 12,000Andy 7 10,000Susan 11 18,000Wendy 15 26,000Joyce 13 22,000Ron 14 23,000Alice 14 27,000

Brenda 16 25,000Laura 10 18,000Bob 17 29,000

Bryan 10 12,000Tom 9 7,000Lee 7 14,000Lisa 15 23,000

Meagan 15 35,000Judy 13 18,000

Page 13: Intermediate Data Collection & Analysis

Graph II(a) -- Number of Years of Education and Annual Income in Hooville County in 2007

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

6 7 8 9 10 11 12 13 14 15 16 17 18

Number of Years of Education Completed

An

nu

al

Inc

om

e (

$)

Page 14: Intermediate Data Collection & Analysis

Graph II(b) -- Number of Years of Education, Annual Income, and Average AnnualIncome by Years of Education (Hooville County, 2007)

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

6 7 8 9 10 11 12 13 14 15 16 17 18

Number of Years of Education Completed

An

nu

al

Inc

om

e (

$)

Avg. Inc. ($) by Yrs. of Educ.

Page 15: Intermediate Data Collection & Analysis

Graph II(c) -- Number of Years of Education and Average AnnualIncome by Years of Education (Hooville County, 2007)

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

6 7 8 9 10 11 12 13 14 15 16 17 18

Number of Years of Education Completed

Av

era

ge

An

nu

al I

nc

om

e (

$)

Avg. Inc. ($) by Yrs. of Educ.

Page 16: Intermediate Data Collection & Analysis

Graph II(d) -- Number of Years of Education and Median AnnualIncome by Years of Education (Hooville County, 2007)

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

6 7 8 9 10 11 12 13 14 15 16 17 18

Number of Years of Education Completed

Me

dia

n A

nn

ua

l In

co

me

($

)

Med. Inc. ($) by Yrs. of Educ.

Page 17: Intermediate Data Collection & Analysis

Graph II(e) -- Number of Years of Education and Annual Income in Hooville County in 2007

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

6 7 8 9 10 11 12 13 14 15 16 17 18

Number of Years of Education Completed

An

nu

al I

nc

om

e (

$)

Page 18: Intermediate Data Collection & Analysis

Graph II(f) -- Number of Years of Education and Annual Income in Hooville County in 2007

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

6 7 8 9 10 11 12 13 14 15 16 17 18

Number of Years of Education Completed

An

nu

al I

nc

om

e (

$)

Page 19: Intermediate Data Collection & Analysis

Graph II(g) -- Number of Years of Education and Annual Income in Hooville County in 2007

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

6 7 8 9 10 11 12 13 14 15 16 17 18

Number of Years of Education Completed

An

nu

al I

nc

om

e (

$)

Page 20: Intermediate Data Collection & Analysis

Example #3

Number of Years of Education and Annual Income ($) in Hooville County

(According to Survey Taken in 2007)

No. of Yrs.of Educ. Annual

Person Completed Income ($)

Joe 12 19,500Bill 14 27,000

Sara 16 32,000Steve 9 13,000Lori 17 33,000Sean 10 15,000

Robert 8 12,700Andy 7 9,000Susan 11 18,000Wendy 15 18,000Joyce 13 24,000Ron 14 29,000Alice 14 30,000

Brenda 16 25,000Laura 10 22,000Bob 17 29,000

Bryan 10 12,000Tom 9 7,000Lee 7 14,000Lisa 15 23,000

Meagan 15 35,000Judy 13 16,000

Page 21: Intermediate Data Collection & Analysis

Graph III(a) -- Number of Years of Education and Annual Income in Hooville County in 2007

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

6 7 8 9 10 11 12 13 14 15 16 17 18

Number of Years of Education Completed

An

nu

al I

nc

om

e (

$)

Page 22: Intermediate Data Collection & Analysis

Graph III(b) -- Number of Years of Education and Annual Income in Hooville County in 2007

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

6 7 8 9 10 11 12 13 14 15 16 17 18

Number of Years of Education Completed

An

nu

al I

nc

om

e (

$)

Bill's Guess at the Functional Relation

Page 23: Intermediate Data Collection & Analysis

Graph III(c) -- Number of Years of Education and Annual Income in Hooville County in 2007

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

6 7 8 9 10 11 12 13 14 15 16 17 18

Number of Years of Education Completed

An

nu

al I

nc

om

e (

$)

Joe's Guess at the Functional Relation

Bill's Guess at the Functional Relation

Page 24: Intermediate Data Collection & Analysis

Graph III(d) -- Number of Years of Education and Annual Income in Hooville County in 2007

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

6 7 8 9 10 11 12 13 14 15 16 17 18

Number of Years of Education Completed

An

nu

al I

nc

om

e (

$)

Joe's Guess at the Functional Relation

The Mathematically-Derived Estimate of the Functional Relation

Bill's Guess at the Functional Relation

Page 25: Intermediate Data Collection & Analysis

Part III – How to Estimate Correlation

Page 26: Intermediate Data Collection & Analysis

Ordinary Least Squares (OLS) Method

• OLS is mathematical technique that estimates the correlation between two or more variables. Usually, however, if we are measuring correlation, we already are assuming causation.

• The OLS technique renders two items:

• (1) A formula whose graphical representation (a “regression” or “trend” line) best “fits” the observed data; and

• (2) A number (R2) whose value describes how “tightly” the data fits around the regression line.

Page 27: Intermediate Data Collection & Analysis

The “Regression” or “Trend” Line

• Data is plotted in a “scatter” diagram. Horizontal line contains “x” values (independent variable) and vertical line contains “y” values (dependent variable).

• Regression or Trend line is expressed in the form y = mx + b.

• The terms “regression” line and “trend” line frequently are used interchangeably but, usually, a “trend” line pertains to data where the value of the dependent variable changes with time.

Page 28: Intermediate Data Collection & Analysis

Graph I -- Number of Years of Education and Annual Income in Hooville County in 2007

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

6 7 8 9 10 11 12 13 14 15 16 17 18

Number of Years of Education Completed

An

nu

al I

nco

me

($)

Page 29: Intermediate Data Collection & Analysis

Graph I -- Number of Years of Education and Annual Income in Hooville County in 2007

y = 2,000x - 4,000

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

6 7 8 9 10 11 12 13 14 15 16 17 18

Number of Years of Education Completed

An

nu

al I

nco

me

($)

OLS Regression Formula Generated by MS Excel

This formula means that, for each additional year of education (the "x" value), the estimated annual income of a resident of Hooville County (the "y" value) is estimated to increase by $2,000. The "- 4,000" means that, if a resident had zero years of education, that person would make negative $4,000.

Page 30: Intermediate Data Collection & Analysis

The R2 Number

• Has a value anywhere from Zero to 1.

• An R2 value of zero means that there is absolutely no correlation between the independent and dependent variables.

• An R2 value of 1 means that there is a perfectly deterministic correlation between the independent and dependent variables.

• The R2 number tells us how much changes in the dependent variable are “explained” by changes in the independent variable.

• Example: If R2 equals 0.70, that means that 70% of the change in the dependent variable is “explained” by the change in the independent variable.

Page 31: Intermediate Data Collection & Analysis

Graph I -- Number of Years of Education and Annual Income in Hooville County in 2007

y = 2,000x - 4,000

R2 = 1

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

6 7 8 9 10 11 12 13 14 15 16 17 18

Number of Years of Education Completed

An

nu

al I

nco

me

($)

The MS Excel-generated R2 Value. In this case R2 = 1.00. This situation means that, in this sample of residents, 100% of changes in annual income are "explained" by changes in the number of year of education.

Page 32: Intermediate Data Collection & Analysis

Graph III(a) -- Number of Years of Education and Annual Income in Hooville County in 2007

y = 2,154.69x - 5,585.26

R2 = 0.71

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

6 7 8 9 10 11 12 13 14 15 16 17 18

Number of Years of Education Completed

An

nu

al I

nc

om

e (

$)

Page 33: Intermediate Data Collection & Analysis

Graph III(a) -- Number of Years of Education and Annual Income in Hooville County in 2007

y = 148.241x + 16,894.472

R2 = 0.001

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

6 7 8 9 10 11 12 13 14 15 16 17 18

Number of Years of Education Completed

An

nu

al I

nc

om

e (

$)

Page 34: Intermediate Data Collection & Analysis

Example of a Trend Line Analysis

Page 35: Intermediate Data Collection & Analysis

Graph IV -- Percentage of SFD Building Permits in Albemarle County thatWere Issued in the Growth Areas (1991-2006)

y = -0.006x + 12.669

R2 = 0.246

30%

35%

40%

45%

50%

55%

60%

65%

1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006

Calendar Year

Pe

rcen

tag

e o

f A

ll S

FD

Bu

ildin

g P

erm

its

Page 36: Intermediate Data Collection & Analysis

Part IV – Some Common Pitfalls of Regression / Trend Line Analysis

Page 37: Intermediate Data Collection & Analysis

• Pitfall #1: The Regression or Trend Line that is derived from the OLS method might be meaningful only for a limited range of numbers.

• Pitfall #2: The most valid Regression or Trend Line for a particular set of data might not necessarily be linear.

• Pitfall #3: Usually, a dependent variable is a function of several independent variables, not just one independent variable.

Page 38: Intermediate Data Collection & Analysis

Graph IV -- Percentage of SFD Building Permits in Albemarle County thatWere Issued in the Growth Areas (1991-2006)

y = -0.006x + 12.669

R2 = 0.246

30%

35%

40%

45%

50%

55%

60%

65%

1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006

Calendar Year

Pe

rcen

tag

e o

f A

ll S

FD

Bu

ildin

g P

erm

its

Page 39: Intermediate Data Collection & Analysis

Graph IV -- Percentage of SFD Building Permits in Albemarle County thatWere Issued in the Growth Areas (1991-2006)

y = -0.006x + 12.669

R2 = 0.246

y = -0.002x2 + 7.270x - 7257.795

R2 = 0.617

30%

35%

40%

45%

50%

55%

60%

65%

1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006

Calendar Year

Pe

rcen

tag

e o

f A

ll S

FD

Bu

ildin

g P

erm

its

Page 40: Intermediate Data Collection & Analysis

Graph V

Scatter Diagram for Regression #2 (1988-1993 Population Changeand 1988-1993 Tax Rate Change -- All Jurisdictions)

y = 0.2544x + 83.933

R2 = 0.0256

40.00

60.00

80.00

100.00

120.00

140.00

160.00

180.00

60.00 70.00 80.00 90.00 100.00 110.00 120.00 130.00 140.00 150.00 160.00

1993 Population as a % of 1988 Population

19

93 T

ax

Rat

e a

s a

% o

f 19

88

Ta

x R

ate

Page 41: Intermediate Data Collection & Analysis

Questions?

Page 42: Intermediate Data Collection & Analysis

Part V – MS Excel Exercise

Page 43: Intermediate Data Collection & Analysis

Background

• You work in the Planning Department; your boss comes to you with historical development data showing growth in the square footage of non-residential space.

• An intern has compiled the data, and has calculated the square footage, by type of non-residential space, that has occurred during a twenty year time period.

• The intern has taken the twenty year increase and divided that number by twenty in order to derive and average annual increase in each type of square footage.

• Your boss has used this average annual increase to estimate the number of square feet, by non-residential type, that the County can expect over the course of the next ten years.

Page 44: Intermediate Data Collection & Analysis

Table IV

Estimated Total Square Footage of Non-Residential Space inAlbemarle County (1983-2003) & Estimated Ten

Year Supply of Non-Residential Space

Retail Tax. Office Industrial InstitutionalYear SF SF SF SF Total SF

1983 1,573,000 177,000 2,420,000 2,125,000 6,295,0001984 1,681,000 177,000 2,455,000 2,138,000 6,451,0001985 1,808,000 270,000 2,625,000 2,153,000 6,856,0001986 1,932,000 395,000 2,685,000 2,323,000 7,335,0001987 2,176,000 454,000 2,840,000 2,412,000 7,882,0001988 2,584,000 504,000 3,066,000 2,417,000 8,571,0001989 3,521,000 605,000 3,206,000 3,167,000 10,499,0001990 3,734,000 735,000 3,116,000 3,210,000 10,795,0001991 4,006,000 760,000 3,116,000 3,347,000 11,229,0001992 4,072,000 760,000 3,116,000 3,422,000 11,370,0001993 4,087,000 950,000 3,171,000 3,521,000 11,729,0001994 4,109,000 1,025,000 3,241,000 3,614,000 11,989,0001995 4,178,000 1,238,000 3,241,000 3,635,000 12,292,0001996 4,294,000 1,318,000 3,266,000 3,830,000 12,708,0001997 4,451,000 1,479,000 3,266,000 3,914,000 13,110,0001998 4,652,219 1,716,708 3,290,000 4,093,757 13,752,6841999 4,851,503 1,989,368 3,312,510 4,109,717 14,263,0982000 4,968,326 2,249,268 3,312,510 4,281,088 14,811,1922001 5,550,755 2,296,396 3,504,050 5,000,310 16,351,5112002 5,664,335 2,296,396 3,524,950 5,289,208 16,774,8892003 5,873,712 2,296,396 3,584,945 5,299,188 17,054,241

20 Yr. Avg. Grth. 215,036 105,970 58,247 158,709 537,962

Est. 10 Yr. SF 2,150,356 1,059,698 582,473 1,587,094 5,379,621

Page 45: Intermediate Data Collection & Analysis

Background (Cont.)

• You are somewhat suspicious of the ten year projection for industrial space, since the County had a net loss of jobs in the manufacturing sector during the course of the twenty years.

• Assignment:

• (a) Take the historical data for the industrial square footage and use MS Excel to derive an OLS trend line that fits this data;

• (b) Graph the trend line, the trend line equation, and the R2 value; and

• (c) Using the trend line equation, project the total new industrial square footage that the County can expect during the course of the next ten years.

Page 46: Intermediate Data Collection & Analysis

Assignment (Cont.)

• Question: Is your estimate different from the estimate that your boss derived? If so, how large is the gap (both in absolute square footage and percentage terms)?

• How “tightly” does the data fit around the trend line that you have derived? Do you have much confidence in your trend line?

Page 47: Intermediate Data Collection & Analysis

Conclusion


Recommended