+ All Categories
Home > Documents > Exploratory Data Analysis Hal Varian 20 March 2006.

Exploratory Data Analysis Hal Varian 20 March 2006.

Date post: 19-Dec-2015
Category:
View: 226 times
Download: 3 times
Share this document with a friend
Popular Tags:
33
Exploratory Data Analysis Hal Varian 20 March 2006
Transcript
Page 1: Exploratory Data Analysis Hal Varian 20 March 2006.

Exploratory Data Analysis

Hal Varian20 March 2006

Page 2: Exploratory Data Analysis Hal Varian 20 March 2006.

What is EDA? Goals

Examine and summarize data Look for patterns and suggest hypotheses Provide guidance for more systematic analysis

Methods of analysis Primarily graphics and tables Online reference

http://www.itl.nist.gov/div898/handbook/eda/eda.htm http://www.math.yorku.ca/SCS/Courses/eda/

Page 3: Exploratory Data Analysis Hal Varian 20 March 2006.

Tools for EDA We will use R = open source S

Very widely used by statisticians Libraries for all sorts of things are

available Download from

cran.stat.ucla.edu http://www.r-project.org/

Recommend ESS (=Emacs Speaks Statistics) for interactive use

Windows interface is not bad

Page 4: Exploratory Data Analysis Hal Varian 20 March 2006.

Interactive R session

> library("foreign")

> dat <- read.spss("GSS93 subset.sav")

> attach(dat)

> summary(AGE)

Min. 1st Qu. Median Mean 3rd Qu. Max.

18.0 33.0 43.0 46.4 59.0 99.0 > hist(AGE)

Page 5: Exploratory Data Analysis Hal Varian 20 March 2006.

Histogram of ageHistogram of AGE

AGE

Fre

qu

en

cy

20 40 60 80 100

05

01

00

15

02

00

Page 6: Exploratory Data Analysis Hal Varian 20 March 2006.

Recode missing data AGE[AGE>90] <- NA plot(density(AGE,na.rm=T))

#plot both together hist(AGE,freq=F) lines(density(AGE,na.rm=T))

Page 7: Exploratory Data Analysis Hal Varian 20 March 2006.

Density and density + hist

20 40 60 80 100

0.0

00

0.0

05

0.0

10

0.0

15

0.0

20

0.0

25

density(x = AGE, na.rm = T)

N = 1495 Bandwidth = 3.633

De

nsi

ty

Histogram of AGE

AGE

De

nsi

ty

20 40 60 80

0.0

00

0.0

05

0.0

10

0.0

15

0.0

20

0.0

25

Page 8: Exploratory Data Analysis Hal Varian 20 March 2006.

Boxplot Boxplot

Outlier 1.5 interquartile range 3rd quartile Median 1st quartile Smallest value 20

4060

8010

0

Page 9: Exploratory Data Analysis Hal Varian 20 March 2006.

Boxplot enhancements Notches: confidence interval for

median Varwidth=T: width of box is sqrt(n) Useful for

comparisons2

04

06

08

01

00

Page 10: Exploratory Data Analysis Hal Varian 20 March 2006.

Comparing distributions boxplot(AGE~RACE) boxplot(AGE~RACE,notch=T,varwidth=T)

Doesn’t seem to be big diff in age distn

white black other

20

30

40

50

60

70

80

90

Page 11: Exploratory Data Analysis Hal Varian 20 March 2006.

EDUC v RACEboxplot(EDUC[EDUC<90]~RACE[EDUC<90],notch=T,varwidth=T)

other black white

05

10

15

20

Page 12: Exploratory Data Analysis Hal Varian 20 March 2006.

Violin plot Combines density plot and boxplot Good for weird shaped

distributions…

Page 13: Exploratory Data Analysis Hal Varian 20 March 2006.

Back to Back Histogram library("Hmisc") histbackback(EDUC[RACE=="black"],EDUC[RACE=="white"],probability=T)

0.2 0.1 0.0 0.1 0.2

2.0

00

00

00

6.0

00

00

00

10

.00

00

00

01

4.0

00

00

00

18

.00

00

00

0

EDUC[RACE == "black"] EDUC[RACE == "white"]

Page 14: Exploratory Data Analysis Hal Varian 20 March 2006.

Two-way table GT12 <- EDUC>12 temp <-table(GT12,RACE)

GT12 white black other FALSE 614 100 37 TRUE 640 67 38

prop.table(temp,2) GT12 white black other FALSE 0.4896332 0.5988024 0.4933333 TRUE 0.5103668 0.4011976 0.5066667

Page 15: Exploratory Data Analysis Hal Varian 20 March 2006.

Comparing distributions qqplot = quantile-quantile plot

Fraction of data less than k in x Fraction of data less than k in y

Shapes Straight line: same distribution Vertical intercepts differ: different mean Slopes differ: different variance

Reference distribution can be theoretical distn qnorm – compare to standardized normal Skew to right: both tails below straight line Heavy tails: lower tail above, upper tail below line

Page 16: Exploratory Data Analysis Hal Varian 20 March 2006.

qqplot(x,y) examples

-3 -2 -1 0 1 2 3

-2-1

01

2

x

y

-4 -2 0 2 4

-4-2

02

4

x

y

-4 -2 0 2 4

-4-2

02

4

x

y

Mean1=0Mean2=2

1=12=2

identical

-3 -2 -1 0 1 2 3

-2-1

01

2

Normal Q-Q Plot

Theoretical Quantiles

Sa

mp

le Q

ua

ntil

es

Sample vN(0,1),with refline

Page 17: Exploratory Data Analysis Hal Varian 20 March 2006.

More qqnorm examples

Skewed to right Heavy tails

www.maths.murdoch.edu.au/units/statsnotes/samplestats/qqplot.html

Page 18: Exploratory Data Analysis Hal Varian 20 March 2006.

Pairs of variables Is one variable related to another? Scatterplot

Basic: plot(x,y) Enhanced from library(“car”):

scatterplot(x,y) Scatterplot matrix

Basic: pairs(data.frame(x,y,z)) Enhanced:

scatterplot.matrix(data.frame(x,y,z))

Page 19: Exploratory Data Analysis Hal Varian 20 March 2006.

Basic and enhanced scatterplot

Page 20: Exploratory Data Analysis Hal Varian 20 March 2006.

Scatterplot matrix

Page 21: Exploratory Data Analysis Hal Varian 20 March 2006.

Labeling points in scatterplots identify(x,y,labels=“foo”) Color is also useful

-2 -1 0 1 2

-4-2

02

46

x

y

90

98

110

175

Page 22: Exploratory Data Analysis Hal Varian 20 March 2006.

Cigarettes and taxes Discussant on paper by Austan

Goolsbee, “Playing with Fire” Question: did Internet purchases of

cigarettes affect state tobacco tax revenues?

Page 23: Exploratory Data Analysis Hal Varian 20 March 2006.

Cigarette Prices in 1990s

1990 1992 1994 1996 1998 2000

15

02

00

25

03

00

35

04

00

Price of cigarettes

Page 24: Exploratory Data Analysis Hal Varian 20 March 2006.

Internet usage

1990 1992 1994 1996 1998 2000

0.0

0.1

0.2

0.3

0.4

0.5

Internet usage

Page 25: Exploratory Data Analysis Hal Varian 20 March 2006.

Price elasticity of use/sales Across all states and years

Taxable sales elasticity: -0.802 Use elasticity: -0.440

Sales are much more responsive to price than usage suggesting that there is some cross border trade (aka “buttlegging”)

Page 26: Exploratory Data Analysis Hal Varian 20 March 2006.

Use vs Sales in 2000

40 60 80 100 120 140 160

34

56

q.p[year == 2000]

cig

use

.p[y

ea

r =

= 2

00

0]

DE

KY

NH

CAUT

Page 27: Exploratory Data Analysis Hal Varian 20 March 2006.

Reduced form dp = log(p2001) – log(p1995) dq = log(q2001) – log(q1995) Regress dq/dp on internet

penetration in 2000 See next slide for result

Page 28: Exploratory Data Analysis Hal Varian 20 March 2006.

0.25 0.30 0.35 0.40 0.45

-0.8

-0.6

-0.4

-0.2

0.0

0.2

i

dq

/dp

CA

DC

DE

MI

NH

NY

OK

WA

Elasticity v Internet penetration

Page 29: Exploratory Data Analysis Hal Varian 20 March 2006.

What is Internet providing? It was always a good deal for some to buy

cigarettes out-of-state (in high tax states) Mail order has been around for a long time

and is certainly cost-effective Internet makes it easier to find merchants

– just type into search engine Internet is great at matching buyers and

sellers

Page 30: Exploratory Data Analysis Hal Varian 20 March 2006.

Price of a match Google doesn’t accept cigarette

advertisements, but Overture does Price for top listing: $1.20 per click

Avg price for click on Overture is 40 cents

Conversion rates might be 5%, so advertiser is paying $24 for introduction

But think of lifetime value…

Page 31: Exploratory Data Analysis Hal Varian 20 March 2006.

Value of a match Google doesn’t accept cigarette

advertisements, but Overture does Price for top listing: $1.20 per click

Avg price for click on Overture is 40 cents

Conversion rates might be 5%, so advertiser is paying $24 for introduction

But think of lifetime value…

Page 32: Exploratory Data Analysis Hal Varian 20 March 2006.

Straightening out and scaling data Find transform so that data looks

linear, or normal, or fits on same scale Log10 (easier to interpret than log) Square root Reciprocal Box-Cox transform (xr – 1)/r which

combines many of above; r=0 is log

Page 33: Exploratory Data Analysis Hal Varian 20 March 2006.

City sizes: regular & log10

Histogram of log10(pop1980)

log10(pop1980)

De

nsi

ty

3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0

0.0

0.2

0.4

0.6

0.8


Recommended