+ All Categories
Home > Documents > INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON...

INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON...

Date post: 29-Jun-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
30
INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 – 4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm – 4:45pm
Transcript
Page 1: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

INTRODUCTION TO DATA SCIENCEJOHN P DICKERSON

Lecture #19 – 4/6/2017

CMSC320Tuesdays & Thursdays3:30pm – 4:45pm

Page 2: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

ANNOUNCEMENTSMini-Project #3 is out!• It is now due April 21st (was April 18th).

• ELMS link: https://myelms.umd.edu/courses/1218364/assignments/4389209

This is a two-part mini-project (regression & classification)!Midterm next Thursday (4/13); good ways to study:• Lecture slides; trying stuff out in iPython; Piazza; external links

on the course website; asking TAs about gaps in knowledge

2

Page 3: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

TODAY’S LECTURE

Data collection

Data processing

Exploratory analysis

&Data viz

Analysis, hypothesis testing, &

ML

Insight & Policy

Decision

3

BIG THANKS: Zico Kolter (CMU)

Page 4: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

SUMMARYHypothesis testing allows us to formulate beliefs about investment attributes and subject those beliefs to rigorous testing following the scientific method.

• For parametric hypothesis testing, we formulate our beliefs (hypotheses), collect data, and calculate a value of the investment attribute in which we are interested (the test statistic) for that set of data (the sample), and then we compare that with a value determined under assumptions that describe the underlying population (the critical value). We can then assess the likelihood that our beliefs are true given the relationship between the test statistic and the critical value.

• Commonly tested beliefs associated with the expected return and variance of returns for a given investment or investments can be formulated in this way.

4

Outline

Motivation

Background: sample statistics and central limit theorem

Basic hypothesis testing

Experimental design

4

Page 5: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

5

Motivating setting

For a data science course, there has been very little “science” thus far…

“Science” as I’m using it roughly refers to “determining truth about the real world”

5

Page 6: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

6

Asking scientific questions

Suppose you work for a company that is considering a redesign of their website; does their new design (design B) offer any statistical advantage to their current design (design A)?

In linear regression, does a certain variable impact the response? (E.g. does energy consumption depend on whether or not a day is a weekday or weekend?)

In both settings, we are concerned with making actual statements about the nature of the world

6

Page 7: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

7

Outline

Motivation

Background: sample statistics and central limit theorem

Basic hypothesis testing

Experimental design

7

Page 8: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

8

Sample statistics

To be a bit more consistent with standard statistics notation, we’ll introduce the notion of a population and a sample

8

Population Sample

Mean

Variance

! = "[#]

$ = "[ # − ! 2]

'̅ =1)

∑ ' +,

+=1

.2 =1

) − 1∑ ' + − '̅ 2,

+=1

Page 9: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

9

Sample mean as random variable

The same mean is an empirical average over ) independent samples from the distribution; it can also be considered as a random variable

This new random variable has the mean and variance

" '̅ = "1)

∑ ' +,

+=1=

1)

∑ " #,

+=1= " # = !

/01 '̅ = /011)

∑ ' +,

+=1=

1)2 ∑ /01[#]

,

+=1=

$2

)

where we used the fact that for independent random variables #1, #2/01 #1 + #2 = /01 #1 + /01 #2

When estimating variance of sample, we use .2/) (the square root of this term is called the standard error)

9

Page 10: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

10

Central limit theorem

Central limit theorem states further that '̅ (for “reasonably sized” samples, in practice ) ≥ 30) actually has a Gaussian distribution regardless of the distribution of #

'̅ → 4 !,$2

) or equivalently

'̅ − !$/)1/2 → 4(0,1)

In practice, for ) < 30 and for estimating $2 using sample variance, we use a Student’s t-distribution with ) − 1 degrees of freedom

'̅ − !./)1/2 → 5,−1, 6 '; 7 ∝ 1 +

'2

7

−9+12

10

Page 11: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

11

Aside: why the ) − 1 scaling?

We scale the sample variance by ) − 1 so that it is an unbiased estimate of the population variance

" ∑ ' + − '̅ 2,

+=1= " ∑ ' + − ! − '̅ − !

2,

+=1

= " ∑ ' + − ! 2 − '̅ − !,

+=1∑ ' + − !,

+=1+ ) '̅ − ! 2

= " ∑ ' + − ! 2,

+=1− )" ∑ '̅ − ! 2

,

+=1

= )/01 # −)/01 #

)= ) − 1 $2

11

Page 12: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

12

Outline

Motivation

Background: sample statistics and central limit theorem

Basic hypothesis testing

Experimental design

12

Page 13: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

13

Hypothesis testing

Using these basic statistical techniques, we can devise some tests to determine whether certain data gives evidence that some effect “really” occurs in the real world

Fundamentally, this is evaluating whether things are (likely to be) true about the population (all the data) given a sample

Lots of caveats about the precise meaning of these terms, to the point that many people debate the usefulness of hypothesis testing at all

But, still incredibly common in practice, and important to understand

13

Page 14: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

14

Hypothesis testing basics

Posit a null hypothesis :0 and an alternative hypothesis :1 (usually just

that “:0 is not true”

Given some data ', we want to accept or reject the null hypothesis in favor of the alternative hypothesis

14

<= true <> true

Accept <= CorrectType II error

(false negative)

Reject <=Type I error

(false positive)Correct

6 reject :0 :0 true = “significance of test”

6 reject :0 :1 true = “power of test”

Page 15: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

15

Basic approach to hypothesis testing

Basic approach: compute the probability of observing the data under the null hypothesis (this is the p-value of the statistical test)

6 = 6 data :0 is true)

Reject the null hypothesis if the p-value is below the desired significance level (alternatively, just report the p-value itself, which is the lowest significance level we could use to reject hypothesis)

Important: p-value is 6 data :0 is true) not 6 :0 not true data)

15

Page 16: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

16

Canonical example: t-test

Given a sample ' 1 ,… , ' , ∈ ℝ

:0: ! = 0 (for population):1: ! ≠ 0

By central limit theorem, we know that '̅ − ! /(./)12) ∼ 5,−1

(Student’s t-distribution with ) − 1 degrees of freedom)

So we just compute E = '/̅ ./)12 (called test statistic), then compute

6 = 6 ' > E + 6 ' < − E = F − E + 1 − F E = 2F (− E )

(where F is cumulative distribution function of Student’s t-distribution)

16

Page 17: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

17

Visual example

What we are doing fundamentally is modeling the distribution 6 '̅ :0and then determining the probability of the observed '̅ or a more extreme value

17

6 = Area

Page 18: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

18

Code in Python

Compute E statistic and 6 value from data

18

import numpy as npimport scipy.stats as stx = np.random.randn(m)

# compute t statistic and p valuexbar = np.mean(x)s2 = np.sum((x - xbar)**2)/(m-1)std_err = np.sqrt(s2/m)t = xbar/std_err

t_dist = st.t(m-1)p = 2*td.cdf(-np.abs(t))

# with scipy alonet,p = st.ttest_1samp(x, 0)

Page 19: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

19

Two-sided vs. one-sided tests

The previous test considered deviation from the null hypothesis in both directions (two-sided test), also possible to consider a one-sided test

:0: ! ≥ 0 (for population):1: ! < 0

Same E statistic as before, but we only compute the area under the left side of the curve

6 = 6 ' < E = F (E)

19

Page 20: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

20

Confidence intervals

We can also use the E statistic to create confidence intervals for the mean

Because ' ̅has mean ! and variance .2/), we know that 1 − G of its probability mass must lie within the range

'̅ = ! ±.

)1/2 ⋅ F −1 1 −G2

≡ ! + JK ., ), G

⟺ ! = '̅ ± JK ., ), G

where F −1 denotes the inverse CDF function of E-distribution with ) − 1 degrees of freedom

20

# simple confidence interval compuationCI = lambda s,m,a : s / np.sqrt(m) * st.t(m-1).ppf(1-a/2)

Page 21: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

21

Outline

Motivation

Background: sample statistics and central limit theorem

Basic hypothesis testing

Experimental design

21

Page 22: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

22

Experimental design: A/B testing

Up until now, we have assumed that the null hypothesis is given by some known mean, but in reality, we may not know the mean that we want to compare to

Example: we want to tell if some additional feature on our website makes user stay longer, so we need to estimate both how long users stay on the current site and how long they stay on redesigned site

Standard approach is A/B testing: create a control group (mean !1) and a

treatment group (mean !2):0: !1 = !2 or e. g. !1 ≥ !2:1: !1 ≠ !2 or e. g. !1 < !2

22

Page 23: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

23

Independent E-test (Welch’s E-test)

Collect samples (possibly different numbers) from both populations

'11 , … , '1

,1 , '21 , … , '2

,2

compute sample mean '1̅, '2̅ and sample variance .12, .2

2 for each group

Compute test statistic

E ='1̅ − '2̅

.12/)1 + .2

2/)21/2

And evaluate using a t distribution with degrees of freedom given by

.12/)1 + .2

2/)22

.12/)1

2

)1 − 1 + .22/)2

2

)2 − 1

23

Page 24: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

24

Starting seem a bit ad-hoc?

There are a huge number of different tests for different situations

You probably won’t need to remember these, and can just look up whatever test is most appropriate for your given situation

But the basic idea in call cases is the same: you’re trying to find the distribution of your test statistic under the hull hypothesis, and then you are computing the probability of the observed test statistic or something more extreme

All the different tests are really just about different distributions based upon your problem setup

24

Page 25: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

25

Hypothesis testing in linear regression

One last example (because it’s useful in practice): consider the linear regression M ≈ OP ', and suppose we want to perform a hypothesis test on the coefficients of O

Example: suppose that instead of just two website, you have a website with multiple features that can be turned on/off, and your sample data includes a wide variety of different samples

We would like to ask the question: is the Qth variable relevant for predicting the output?

We’ve already seen ways we can do this (i.e., evaluate cross-validation error, but it’s a bit difficult to understand what such mean

25

Page 26: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

26

Formula for sample variance in linear regression

There is an analogous formula for sample variance on the errors that a linear regression model makes

.2 =1

) − R∑ M + − OP ' + 2,

+=1

Use this to determine sample covariance of coefficients

STU O = .2 #P # −1

Can then evaluate null hypothesis :0: O+ = 0, using t statistic

E = O+/STU O +,+1/2

Similar procedure to get confidence intervals of coefficients

26

Page 27: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

27

P-values considered harmful

A basic problem is that 6 data :0 ≠ 6(:0|data) (despite being frequently interpreted as such)

People treat 6 < 0.05 with way too much importance

27

Histogram of p values from ~3,500 published journal papers(from E. J. Masicampo and Daniel Lalande, A peculiar prevalence of p values just below .05, 2012)

Page 28: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

QUICK ASIDE FOR HW #3Precision P:

#correct positive results / #positive results returnedRecall R:

#correct positive results / #all possible positive results

28

Page 29: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

QUICK ASIDE FOR HW #3F-Score F:

weighted average of the precision and recall of a testF1: (harmonic) mean of precision and recall:

Can be parameterized to attach higher importance to recall:

29

Page 30: INTRODUCTION TO DATA SCIENCE · 2017-04-07 · INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #19 –4/6/2017 CMSC320 Tuesdays & Thursdays 3:30pm –4:45pm

NEXT CLASS:MIDTERM PREP & REGULARIZATION

30


Recommended