Date post: | 11-Dec-2015 |
Category: |
Documents |
Upload: | nya-hillson |
View: | 219 times |
Download: | 4 times |
1
Peter Fox
Data Analytics – ITWS-4963/ITWS-6965
Week 3a, February 4, 2014, SAGE 3101
Preliminary Analysis, Interpretation, Detailed Analysis,
Assessment
Contents• PDA• Interpreting what you get back (the stats,
the plots)• Detailed analyses/ fitting – a start• How to assess/ intercompare
2
Preliminary Data Analysis• Relates to the sample v. population (for Big
Data) discussion last week• Also called Exploratory DA
– “EDA is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there , as well as those we believe will be there” (John Tukey)
• Distribution analysis and comparison, visual ‘analysis’, model testing, i.e. pretty much the things you did last Friday!
• Thus we are going to review those results 3
Patterns and Relationships• Stepping from elementary/ distribution
analysis to algorithmic-based analysis
• I.e. pattern detection via data mining: classification, clustering, rules; machine learning; support vector machines, non-parametric models
• Relations – associations between/among populations
• Outcome: model and an evaluation of its fitness for purpose
4
Models• Assumptions are often used when
considering models, e.g. as being representative of the population – since they are so often derived from a sample – this should be starting to make sense (a bit)
• Two key topics:– N=all and the open world assumption– Model of the thing of interest versus model of the
data (data model; structural form)• “All models are wrong but some are useful”
(generally attributed to the statistician George Box) 5
Conceptual, logical and physical models
6
Applied to a database:
However our models will be mathematical, statistical, or a combination.
The concept of the model comes from the hypothesis
The implementation of the physical model comes from the data ;-)
Art or science?• The form of the model, incorporating the
hypothesis determines a “form”
• Thus, as much art as science because it depends both on your world view and what the data is telling you (or not)
• We will however, be giving the models nice mathematical properties; orthogonal/ orthonormal basis functions, etc…
7
Exploring the distribution> summary(EPI) # stats
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
32.10 48.60 59.20 58.37 67.60 93.50 68
> boxplot(EPI)
> fivenum(EPI,na.rm=TRUE)[1] 32.1 48.6 59.2 67.6 93.5
Tukey: min, lower hinge, median, upper hinge, max
8
Stem and leaf plot> stem(EPI) # like-a histogram The decimal point is 1 digit(s) to the right of the | - but the scale of the stem is 10… watch carefully..
3 | 234
3 | 66889
4 | 00011112222223344444
4 | 5555677788888999
5 | 0000111111111244444
5 | 55666677778888999999
6 | 000001111111222333344444
6 | 5555666666677778888889999999
7 | 000111233333334
7 | 5567888
8 | 11
8 | 669
9 | 4 9
Histogram> hist(EPI) #defaults
10
Distributions• Shape• Character• Parameter(s)
• Which one fits?
11
12
> hist(EPI, seq(30., 95., 1.0), prob=TRUE)
> lines (density(EPI,na.rm=TRUE,bw=1.))
> rug(EPI)or> lines (density(EPI,na.rm=TRUE,bw=“SJ”))
13
> hist(EPI, seq(30., 95., 1.0), prob=TRUE)
> lines (density(EPI,na.rm=TRUE,bw=“SJ”))
Why are histograms so unsatisfying?
14
> xn<-seq(30,95,1)
> qn<-dnorm(xn,mean=63, sd=5,log=FALSE)
> lines(xn,qn)
> lines(xn,.4*qn)
> ln<-dnorm(xn,mean=44, sd=5,log=FALSE)
> lines(xn,.26*ln)
15
Eland ~ EPI!Landlock> hist(ELand, seq(30., 95., 1.0), prob=TRUE); lines …
16
No surface water
17
EPIreg<-EPI_data$EPI[EPI_data$EPI_reg
ions=="Europe"]
18
Exploring other distributions> summary(DALY) # stats
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 37.19 60.35 53.94 71.97 91.50 39
> fivenum(DALY,na.rm=TRUE)[1] 0.000 36.955 60.350 72.320 91.500
19EPI DALY
Stem and leaf plot> stem(DALY) # The decimal point is 1 digit(s) to the right of the |
0 | 0000111244
0 | 567899
1 | 0234
1 | 56688
2 | 000123
2 | 5667889
3 | 00001134
3 | 5678899
4 | 00011223444
4 | 555799
5 | 12223344
5 | 556667788999999
6 | 0000011111222233334444
6 | 6666666677788889999
7 | 00000000223333444
7 | 66888999
8 | 1113333333
8 | 555557777777777799999
9 | 2220
DALYhist(DALY, seq(0., 99., 1.0), prob=TRUE)
lines(density( DALY, na.rm=TRUE,bw=1.))
lines(density( DALY, na.rm=TRUE,bw=“SJ”))
21
Beyond histograms• Cumulative distribution function: probability that a
real-valued random variable X with a given probability distribution will be found at a value less than or equal to x.
> plot(ecdf(EPI), do.points=FALSE, verticals=TRUE)
22
Beyond histograms• Quantile ~ inverse cumulative density function –
points taken at regular intervals from the CDF, e.g. 2-quantiles=median, 4-quantiles=quartiles
• Quantile-Quantile (versus default=normal dist.)> par(pty="s")
> qqnorm(EPI); qqline(EPI)
23
Beyond histograms• Simulated data from t-distribution (random):
> x <- rt(250, df = 5)
> qqnorm(x); qqline(x)
24
Beyond histograms• Q-Q plot against the generating distribution: x<-
seq(30,95,1)> qqplot(qt(ppoints(250), df = 5), x, xlab = "Q-Q plot for t dsn")
> qqline(x)
25
DALY (ecdf and qqplot)
26
Weibull qqplot……..
27
Testing the fits• shapiro.test(EPI) # null hypothesis – normal?
Shapiro-Wilk normality test
data: EPI
W = 0.9866, p-value = 0.1188
Interpretation: W and probability-value
Reject null hypothesis or not? Here.. ~ NO.
DALY: W = 0.9365, p-value = 1.891e-07 (reject)
28
Kolmogorov–Smirnov• One-sided or two-sided:
> ks.test(EPI,seq(30.,95.,1.0))
Two-sample Kolmogorov-Smirnov test
data: EPI and seq(30, 95, 1)
D = 0.2507, p-value = 0.005451
alternative hypothesis: two-sided
Warning message:
In ks.test(EPI, seq(30, 95, 1)) :
p-value will be approximate in the presence of ties
D=distance between ECDF (blue) of sample and CDF (red) for one-sided: but p-value is important – accept if p-value>0.05.
29
Variability in normal distributions
30
F-test
31
F = S12 / S2
2
where S1 and S2 are the
sample variances.
The more this ratio deviates from 1, the stronger the evidence for unequal population variances.
> var.test(EPI,DALY)
F test to compare two variances
data: EPI and DALY
F = 0.2393, num df = 162, denom df = 191, p-value < 2.2e-16
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.1781283 0.3226470
sample estimates:
ratio of variances
0.2392948 32
T-test
33
Comparing distributions> t.test(EPI,DALY)
Welch Two Sample t-test
data: EPI and DALY
t = 2.1361, df = 286.968, p-value = 0.03352
alternative hypothesis: true difference in means is not
equal to 0
95 percent confidence interval:
0.3478545 8.5069998
sample estimates:
mean of x mean of y
58.37055 53.94313
34
Comparing distributions> boxplot(EPI,DALY)
35
CDF for EPI and DALY
36> plot(ecdf(EPI), do.points=FALSE, verticals=TRUE)> plot(ecdf(DALY), do.points=FALSE, verticals=TRUE, add=TRUE)
qqplot(EPI,DALY)
37
Oooppss did we forget?
38
Goal?• Find the single most important factor in
increasing the EPI in a given region
• Preceding table gives a nested conceptual model
• Examine distributions down to the leaf nodes and build up an EPI “model”
39
boxplot(ENVHEALTH,ECOSYSTEM)
40
qqplot(ENVHEALTH,ECOSYSTEM)
41
ENVHEALTH/ ECOSYSTEM> shapiro.test(ENVHEALTH)
Shapiro-Wilk normality test
data: ENVHEALTH
W = 0.9161, p-value = 1.083e-08 ------- Reject.
> shapiro.test(ECOSYSTEM)
Shapiro-Wilk normality test
data: ECOSYSTEM
W = 0.9813, p-value = 0.02654 ----- ~reject42
Kolmogorov- Smirnov - KS test -
> ks.test(EPI,DALY)
Two-sample Kolmogorov-Smirnov test
data: EPI and DALY
D = 0.2331, p-value = 0.0001382
alternative hypothesis: two-sided
Warning message:
In ks.test(EPI, DALY) : p-value will be approximate in the presence of ties 43
44
How are the software installs going?
• R/Scipy (et al)/Matlab – getting comfortable?
• Data infrastructure …
• http://hyperpolyglot.org/numerical-analysis (Matlab, R, scipy/numpy) table comparison
45
Tentative assignments• Assignment 2: Datasets and data infrastructures – lab
assignment. Held in week 3 (Feb. 7) 10% (lab; individual);
• Assignment 3: Preliminary and Statistical Analysis. Due ~ week 4. 15% (15% written and 0% oral; individual);
• Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ week 5. 15% (10% written and 5% oral; individual);
• Assignment 5: Term project proposal. Due ~ week 6. 5% (0% written and 5% oral; individual);
• Assignment 6: Predictive and Prescriptive Analytics. Due ~ week 8. 15% (15% written and 5% oral; individual);
• Term project. Due ~ week 13. 30% (25% written, 5% oral; individual).
46
Admin info (keep/ print this slide)• Class: ITWS-4963/ITWS 6965• Hours: 12:00pm-1:50pm Tuesday/ Friday• Location: SAGE 3101• Instructor: Peter Fox• Instructor contact: [email protected], 518.276.4862 (do not
leave a msg)• Contact hours: Monday** 3:00-4:00pm (or by email appt)• Contact location: Winslow 2120 (sometimes Lally 207A
announced by email)• TA: Lakshmi Chenicheri [email protected] • Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014
– Schedule, lectures, syllabus, reading, assignments, etc.
47