+ All Categories
Home > Documents > 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary...

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary...

Date post: 11-Dec-2015
Category:
Upload: nya-hillson
View: 219 times
Download: 4 times
Share this document with a friend
47
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment
Transcript
Page 1: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

1

Peter Fox

Data Analytics – ITWS-4963/ITWS-6965

Week 3a, February 4, 2014, SAGE 3101

Preliminary Analysis, Interpretation, Detailed Analysis,

Assessment

Page 2: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Contents• PDA• Interpreting what you get back (the stats,

the plots)• Detailed analyses/ fitting – a start• How to assess/ intercompare

2

Page 3: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Preliminary Data Analysis• Relates to the sample v. population (for Big

Data) discussion last week• Also called Exploratory DA

– “EDA is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there , as well as those we believe will be there” (John Tukey)

• Distribution analysis and comparison, visual ‘analysis’, model testing, i.e. pretty much the things you did last Friday!

• Thus we are going to review those results 3

Page 4: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Patterns and Relationships• Stepping from elementary/ distribution

analysis to algorithmic-based analysis

• I.e. pattern detection via data mining: classification, clustering, rules; machine learning; support vector machines, non-parametric models

• Relations – associations between/among populations

• Outcome: model and an evaluation of its fitness for purpose

4

Page 5: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Models• Assumptions are often used when

considering models, e.g. as being representative of the population – since they are so often derived from a sample – this should be starting to make sense (a bit)

• Two key topics:– N=all and the open world assumption– Model of the thing of interest versus model of the

data (data model; structural form)• “All models are wrong but some are useful”

(generally attributed to the statistician George Box) 5

Page 6: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Conceptual, logical and physical models

6

Applied to a database:

However our models will be mathematical, statistical, or a combination.

The concept of the model comes from the hypothesis

The implementation of the physical model comes from the data ;-)

Page 7: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Art or science?• The form of the model, incorporating the

hypothesis determines a “form”

• Thus, as much art as science because it depends both on your world view and what the data is telling you (or not)

• We will however, be giving the models nice mathematical properties; orthogonal/ orthonormal basis functions, etc…

7

Page 8: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Exploring the distribution> summary(EPI) # stats

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

32.10 48.60 59.20 58.37 67.60 93.50 68

> boxplot(EPI)

> fivenum(EPI,na.rm=TRUE)[1] 32.1 48.6 59.2 67.6 93.5

Tukey: min, lower hinge, median, upper hinge, max

8

Page 9: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Stem and leaf plot> stem(EPI) # like-a histogram The decimal point is 1 digit(s) to the right of the | - but the scale of the stem is 10… watch carefully..

3 | 234

3 | 66889

4 | 00011112222223344444

4 | 5555677788888999

5 | 0000111111111244444

5 | 55666677778888999999

6 | 000001111111222333344444

6 | 5555666666677778888889999999

7 | 000111233333334

7 | 5567888

8 | 11

8 | 669

9 | 4 9

Page 10: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Histogram> hist(EPI) #defaults

10

Page 11: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Distributions• Shape• Character• Parameter(s)

• Which one fits?

11

Page 12: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

12

> hist(EPI, seq(30., 95., 1.0), prob=TRUE)

> lines (density(EPI,na.rm=TRUE,bw=1.))

> rug(EPI)or> lines (density(EPI,na.rm=TRUE,bw=“SJ”))

Page 13: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

13

> hist(EPI, seq(30., 95., 1.0), prob=TRUE)

> lines (density(EPI,na.rm=TRUE,bw=“SJ”))

Page 14: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Why are histograms so unsatisfying?

14

Page 15: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

> xn<-seq(30,95,1)

> qn<-dnorm(xn,mean=63, sd=5,log=FALSE)

> lines(xn,qn)

> lines(xn,.4*qn)

> ln<-dnorm(xn,mean=44, sd=5,log=FALSE)

> lines(xn,.26*ln)

15

Page 16: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Eland ~ EPI!Landlock> hist(ELand, seq(30., 95., 1.0), prob=TRUE); lines …

16

Page 17: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

No surface water

17

Page 18: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

EPIreg<-EPI_data$EPI[EPI_data$EPI_reg

ions=="Europe"]

18

Page 19: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Exploring other distributions> summary(DALY) # stats

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

0.00 37.19 60.35 53.94 71.97 91.50 39

> fivenum(DALY,na.rm=TRUE)[1] 0.000 36.955 60.350 72.320 91.500

19EPI DALY

Page 20: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Stem and leaf plot> stem(DALY) # The decimal point is 1 digit(s) to the right of the |

0 | 0000111244

0 | 567899

1 | 0234

1 | 56688

2 | 000123

2 | 5667889

3 | 00001134

3 | 5678899

4 | 00011223444

4 | 555799

5 | 12223344

5 | 556667788999999

6 | 0000011111222233334444

6 | 6666666677788889999

7 | 00000000223333444

7 | 66888999

8 | 1113333333

8 | 555557777777777799999

9 | 2220

Page 21: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

DALYhist(DALY, seq(0., 99., 1.0), prob=TRUE)

lines(density( DALY, na.rm=TRUE,bw=1.))

lines(density( DALY, na.rm=TRUE,bw=“SJ”))

21

Page 22: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Beyond histograms• Cumulative distribution function: probability that a

real-valued random variable X with a given probability distribution will be found at a value less than or equal to x.

> plot(ecdf(EPI), do.points=FALSE, verticals=TRUE)

22

Page 23: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Beyond histograms• Quantile ~ inverse cumulative density function –

points taken at regular intervals from the CDF, e.g. 2-quantiles=median, 4-quantiles=quartiles

• Quantile-Quantile (versus default=normal dist.)> par(pty="s")

> qqnorm(EPI); qqline(EPI)

23

Page 24: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Beyond histograms• Simulated data from t-distribution (random):

> x <- rt(250, df = 5)

> qqnorm(x); qqline(x)

24

Page 25: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Beyond histograms• Q-Q plot against the generating distribution: x<-

seq(30,95,1)> qqplot(qt(ppoints(250), df = 5), x, xlab = "Q-Q plot for t dsn")

> qqline(x)

25

Page 26: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

DALY (ecdf and qqplot)

26

Page 27: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Weibull qqplot……..

27

Page 28: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Testing the fits• shapiro.test(EPI) # null hypothesis – normal?

Shapiro-Wilk normality test

data: EPI

W = 0.9866, p-value = 0.1188

Interpretation: W and probability-value

Reject null hypothesis or not? Here.. ~ NO.

DALY: W = 0.9365, p-value = 1.891e-07 (reject)

28

Page 29: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Kolmogorov–Smirnov• One-sided or two-sided:

> ks.test(EPI,seq(30.,95.,1.0))

Two-sample Kolmogorov-Smirnov test

data: EPI and seq(30, 95, 1)

D = 0.2507, p-value = 0.005451

alternative hypothesis: two-sided

Warning message:

In ks.test(EPI, seq(30, 95, 1)) :

p-value will be approximate in the presence of ties

D=distance between ECDF (blue) of sample and CDF (red) for one-sided: but p-value is important – accept if p-value>0.05.

29

Page 30: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Variability in normal distributions

30

Page 31: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

F-test

31

F = S12 / S2

2

where S1 and S2 are the

sample variances.

The more this ratio deviates from 1, the stronger the evidence for unequal population variances.

Page 32: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

> var.test(EPI,DALY)

F test to compare two variances

data: EPI and DALY

F = 0.2393, num df = 162, denom df = 191, p-value < 2.2e-16

alternative hypothesis: true ratio of variances is not equal to 1

95 percent confidence interval:

0.1781283 0.3226470

sample estimates:

ratio of variances

0.2392948 32

Page 33: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

T-test

33

Page 34: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Comparing distributions> t.test(EPI,DALY)

Welch Two Sample t-test

data: EPI and DALY

t = 2.1361, df = 286.968, p-value = 0.03352

alternative hypothesis: true difference in means is not

equal to 0

95 percent confidence interval:

0.3478545 8.5069998

sample estimates:

mean of x mean of y

58.37055 53.94313

34

Page 35: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Comparing distributions> boxplot(EPI,DALY)

35

Page 36: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

CDF for EPI and DALY

36> plot(ecdf(EPI), do.points=FALSE, verticals=TRUE)> plot(ecdf(DALY), do.points=FALSE, verticals=TRUE, add=TRUE)

Page 37: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

qqplot(EPI,DALY)

37

Page 38: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Oooppss did we forget?

38

Page 39: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Goal?• Find the single most important factor in

increasing the EPI in a given region

• Preceding table gives a nested conceptual model

• Examine distributions down to the leaf nodes and build up an EPI “model”

39

Page 40: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

boxplot(ENVHEALTH,ECOSYSTEM)

40

Page 41: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

qqplot(ENVHEALTH,ECOSYSTEM)

41

Page 42: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

ENVHEALTH/ ECOSYSTEM> shapiro.test(ENVHEALTH)

Shapiro-Wilk normality test

data: ENVHEALTH

W = 0.9161, p-value = 1.083e-08 ------- Reject.

> shapiro.test(ECOSYSTEM)

Shapiro-Wilk normality test

data: ECOSYSTEM

W = 0.9813, p-value = 0.02654 ----- ~reject42

Page 43: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Kolmogorov- Smirnov - KS test -

> ks.test(EPI,DALY)

Two-sample Kolmogorov-Smirnov test

data: EPI and DALY

D = 0.2331, p-value = 0.0001382

alternative hypothesis: two-sided

Warning message:

In ks.test(EPI, DALY) : p-value will be approximate in the presence of ties 43

Page 44: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

44

Page 45: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

How are the software installs going?

• R/Scipy (et al)/Matlab – getting comfortable?

• Data infrastructure …

• http://hyperpolyglot.org/numerical-analysis (Matlab, R, scipy/numpy) table comparison

45

Page 46: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Tentative assignments• Assignment 2: Datasets and data infrastructures – lab

assignment. Held in week 3 (Feb. 7) 10% (lab; individual);

• Assignment 3: Preliminary and Statistical Analysis. Due ~ week 4. 15% (15% written and 0% oral; individual);

• Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ week 5. 15% (10% written and 5% oral; individual);

• Assignment 5: Term project proposal. Due ~ week 6. 5% (0% written and 5% oral; individual);

• Assignment 6: Predictive and Prescriptive Analytics. Due ~ week 8. 15% (15% written and 5% oral; individual);

• Term project. Due ~ week 13. 30% (25% written, 5% oral; individual).

46

Page 47: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Admin info (keep/ print this slide)• Class: ITWS-4963/ITWS 6965• Hours: 12:00pm-1:50pm Tuesday/ Friday• Location: SAGE 3101• Instructor: Peter Fox• Instructor contact: [email protected], 518.276.4862 (do not

leave a msg)• Contact hours: Monday** 3:00-4:00pm (or by email appt)• Contact location: Winslow 2120 (sometimes Lally 207A

announced by email)• TA: Lakshmi Chenicheri [email protected] • Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014

– Schedule, lectures, syllabus, reading, assignments, etc.

47


Recommended