Unit 1: Introduction to data Lecture 1: Data collection ...tjl13/s101/slides/unit1lec1H.pdf · Unit...

Unit 1: Introduction to dataLecture 1: Data collection, observational studies,

and experiments

Statistics 101

Thomas Leininger

May 16, 2013

Thought for the day

”We are drowning in information but starved for knowledge...Uncontrolled and unorganized information is no longer a resource in

an information society, instead it becomes the enemy.”

–John Naisbitt, Megatrends (1982)

Statistics 101 (Thomas Leininger) U1 - L1: Data coll., obs. studies, experiments May 16, 2013 2 / 33

Introduction to Data Some terminology

Dr. Arbuthnot’s baptismal records

year boys girls B4500

1 1629 5218 4683 TRUE

2 1630 4858 4457 TRUE

3 1631 4422 4102 FALSE

4 1632 4994 4590 TRUE

5 1633 5158 4839 TRUE

6 1634 5035 4820 TRUE

7 1635 5106 4928 TRUE

8 1636 4917 4605 TRUE

9 1637 4703 4457 TRUE

10 1638 5359 4952 TRUE

Terms to know:case

variable

numerical variable

discrete variable

continuous variable

categorical variable (levels)

ordinal variable



Control vs. treatment groups

A pharmaceutical company has created a wonder drug to cure boneloss. In order to sell this drug to consumers, the FDA requires thiscompany to perform several highly regulated experiments to prove theefficacy (and safety) of this new drug.

In this experiment, some patients will be randomly assigned to thecontrol group, where they will receive a standard bone loss treatment.The other patients are all assigned to the treatment group, where theyreceive the new wonder drug.

If the treatment group experiences significantly better outcomes, theFDA will allow this company to sell their new drug.



Association and Independence

http:// biojournalism.com/ 2012/ 08/ correlation-vs-causation/


http://biojournalism.com/2012/08/correlation-vs-causation/

Overview of data collection principles Anecdotal evidence

Anecdotal evidence and early smoking research

Anti-smoking research started in the 1930s and 1940s whencigarette smoking became increasingly popular. While somesmokers seemed to be sensitive to cigarette smoke, others werecompletely unaffected.

Anti-smoking research was faced with resistance based onanecdotal evidence such as “My uncle smokes three packs a dayand he’s in perfectly good health”, evidence based on a limitedsample size that might not be representative of the population.

It was concluded that “smoking is a complex human behavior, byits nature difficult to study, confounded by human variability.”

In time researchers were able to examine larger samples ofcases (smokers) and trends showing that smoking has negativehealth impacts became much clearer.

Brandt, The Cigarette Century (2009), Basic Books.


Overview of data collection principles Populations and samples

Populations and samples

http:// well.blogs.nytimes.com/ 2012/ 08/ 29/

finding-your-ideal-running-form

Research question: Can peoplebecome better, more efficientrunners on their own, merely byrunning?Population of interest:

Sample: Group of adult women who recently joined a running groupPopulation to which results can be generalized:


http://well.blogs.nytimes.com/2012/08/29/finding-your-ideal-running-form

http://well.blogs.nytimes.com/2012/08/29/finding-your-ideal-running-form

Overview of data collection principles Sampling methods

Census

Wouldn’t it be better to just include everyone and “sample” theentire population?

This is called a census.

There are problems with taking a census:

It can be difficult to complete a census: there always seem to besome individuals who are hard to locate or hard to measure. Andthere may be certain characteristics about those individuals whoare hard to locate.Populations rarely stand still. Even if you could take a census, thepopulation changes constantly, so it’s never possible to get aperfect measure.Taking a census may be more complex than sampling.



http:// www.npr.org/ templates/ story/ story.php?storyId=125380052


http://www.npr.org/templates/story/story.php?storyId=125380052


Exploratory analysis to inference

Sampling is natural...Think about sampling something you are cooking - you taste(examine) a small part of what you’re cooking to get an ideaabout the dish as a whole.When you taste a spoonful of soup and decide the spoonful youtasted isn’t salty enough, that’s exploratory analysis.If you generalize and conclude that your entire soup needs salt,that’s an inference.For your inference to be valid, the spoonful you tasted (thesample) needs to be representative of the entire pot (thepopulation).

If your spoonful comes only from the surface and the salt iscollected at the bottom of the pot, what you tasted is probably notrepresentative of the whole pot.If you first stir the soup thoroughly before you taste, your spoonfulwill more likely be representative of the whole pot.



Simple random sample

Randomly select cases from the population, each case is equallylikely to be selected.

Index

●

●●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●●

●

●

Index

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

Stratum 1

Stratum 2

Stratum 3

Stratum 4

Stratum 5

Stratum 6

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9



Stratified sample

Strata are homogenous, simple random sample from each stratum.

Index

●

●●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●●

●

●

Index

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

Stratum 1

Stratum 2

Stratum 3

Stratum 4

Stratum 5

Stratum 6

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9



Cluster sample

Clusters are not necessarily homogenous, simple random samplefrom a random sample of clusters. Usually preferred for economicalreasons.

Index

●

●●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●●

●

●

Index

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

Stratum 1

Stratum 2

Stratum 3

Stratum 4

Stratum 5

Stratum 6

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9



Question

A city council has requested a household survey be conducted in asuburban area of their city. The area is broken into many distinct andunique neighborhoods, some including large homes, some with onlyapartments, and others a diverse mixture of housing structures. Whichapproach would likely be the least effective?

(a) Simple random sampling

(b) Cluster sampling

(c) Stratified sampling

(d) Blocked sampling

(e) Anecdotal sampling


Overview of data collection principles Sampling bias

A few sources of bias

Non-response: If only a small fraction of the randomly sampledpeople choose to respond to a survey, the sample may no longerbe representative of the population.Voluntary response: Occurs when the sample consists of peoplewho volunteer to respond because they have strong opinions onthe issue since such a sample will also not be representative ofthe population.

cnn.com, Jan 14, 2012

Convenience sample: Individuals who are easily accessible aremore likely to be included in the sample.



Landon vs. FDR

A historical example of a biased sample yielding misleading results:

In 1936, Landonsought theRepublicanpresidentialnomination opposingthe re-election ofFDR.



The Literary Digest Poll

The Literary Digest polled about 10 millionAmericans, and got responses from about2.4 million.

The poll showed that Landon would likelybe the overwhelming winner and FDRwould get only 43% of the votes.

Election result: FDR won, with 62% of thevotes.

The magazine was completely discredited because of the poll,and was soon discontinued.



The Literary Digest Poll - what went wrong?

The magazine had surveyed

its own readers,registered automobile owners, andregistered telephone users.

These groups had incomes well above the national average ofthe day (remember, this is Great Depression era) which resultedin lists of voters far more likely to support Republicans than atruly typical voter of the time, i.e. the sample was notrepresentative of the American population at the time.



Large samples are preferable, but...

The Literary Digest election poll was based on a sample size of2.4 million, which is huge, but since the sample was biased, thesample did not yield an accurate prediction.

Back to the soup analogy: If the soup is not well stirred, it doesn’tmatter how large a spoon you have, it will still not taste right. Ifthe soup is well stirred, it doesn’t matter whether you have alarge or small spoon, it will taste fine either way.



Question

A school district is considering whether it will no longer allow high schoolstudents to park at school after two recent accidents where students wereseverely injured. As a first step, they survey parents by mail, asking themwhether or not the parents would object to this policy change. Of 6,000 sur-veys that go out, 1,200 are returned. Of these 1,200 surveys that were com-pleted, 960 agreed with the policy change and 240 disagreed. Which of thefollowing statements are true?

I. Some of the mailings may have never reached the parents.

II. The school district has strong support from parents to move forwardwith the policy approval.

III. It is possible that majority of the parents of high school studentsdisagree with the policy change.

IV. The survey results are unlikely to be biased because all parents weremailed a survey.

(a) Only I (b) I and II (c) I and III (d) III and IV (e) Only IV


Overview of data collection principles Observational studies and experiments

Observational studies and experiments

Observational study: Researchers collect data in a way that doesnot directly interfere with how the data arise, i.e. they merely“observe”, and can only establish an association between theexplanatory and response variables.Experiment: Researchers randomly assign subjects to varioustreatments in order to be able to establish causal connectionsbetween the explanatory and response variables.If you’re going to walk away with one thing from this class, let itbe “correlation does not imply causation”.

http:// xkcd.com/ 552/


http://xkcd.com/552/

Observational studies



What type of study is this, observational study or an experiment?“Girls who regularly ate breakfast, particularly one that includes cereal, were slimmer

than those who skipped the morning meal, according to a study that tracked nearly

2,400 girls for 10 years. [...] As part of the survey, the girls were asked once a year

what they had eaten during the previous three days.”

What is the conclusion of the study?

Who sponsored the study?



3 possible explanations:

1 Eating breakfast causes girls to be thinner.

2 Being thin causes girls to eat breakfast.

3 A third variable is responsible for both. What could it be?An extraneous variable that affects both the explanatory and theresponse variable and that make it seem like there is arelationship between the two are called confounding variables.



Project ideas - observational studies

1 numerical: Is the average number of hours Americans spendrelaxing after work different than the European average of 3hours/day?[Data: Number of hours relaxing after work]

1 categorical: Estimate the percentage of North Carolinaresidents who live below the poverty line and are planning tovote Republican in the most recent presidential election.[Data: Vote Republican - yes, no]

1 numerical and 1 categorical: Is there a relationship betweenmom’s working status during the first 5 years of the child’s lifeand the child’s education?[Data: Number of years of education of child; Mom’s working status - yes, no]

2 categorical: Do racial minority groups in North Carolina haveless access to health care coverage?[Data: Ethnicity - white, minority; Health coverage - yes, no]


Experiments Principles of experimental design

Principles of experimental design

1 Control: Compare treatment of interest to a control group.2 Randomize: Randomly assign subjects to treatments.3 Replicate: Within a study, replicate by collecting a sufficiently

large sample. Or replicate the entire study.4 Block: If there are variables that are known or suspected to affect

the response variable, first group subjects into blocks based onthese variables, and then randomize cases within each block totreatment groups.



More on blocking

We would like to design an experiment toinvestigate if energy gels makes you run faster:

Treatment: energy gelControl: no energy gel

It is suspected that energy gels might affect proand amateur athletes differently, therefore weblock for pro status:

Divide the sample to pro and amateurRandomly assign pro athletes to treatment andcontrol groupsRandomly assign amateur athletes totreatment and control groupsPro/amateur status is equally represented inthe resulting treatment and control groups

Why is this important? Can you think of other variables to block for?



Question

A study is designed to test the effect of light level and noise level onexam performance of students. The researcher also believes that lightand noise levels might have different effects on males and females,so wants to make sure both genders are represented equally underdifferent conditions. Which of the below is correct?

(a) There are 3 explanatory variables (light, noise, gender) and 1response variable (exam performance)

(b) There are 2 explanatory variables (light and noise), 1 blockingvariable (gender), and 1 response variable (exam performance)

(c) There is 1 explanatory variable (gender) and 3 response variables(light, noise, exam performance)

(d) There are 2 blocking variables (light and noise), 1 explanatoryvariable (gender), and 1 response variable (exam performance)



Difference between blocking and explanatory variables

Factors are conditions we can impose on the experimental units.

Blocking variables are characteristics that the experimental unitscome with, that we would like to control for.

Blocking is like stratifying, except used in experimental settingswhen randomly assigning, as opposed to when sampling.



More experimental design terminology...

Placebo: fake treatment, often used as the control group formedical studies

Placebo effect: experimental units showing improvement simplybecause they believe they are receiving a special treatment

Blinding: when experimental units do not know whether they arein the control or treatment group

Double-blind: when both the experimental units and theresearchers do not know who is in the control and who is in thetreatment group



Project ideas - experiments

1 numerical and 1 categorical: Is there a relationship betweenmemory and distraction? Randomly assign 20 students to twogroups: one group memorizes a list of words while also listeningto music, another group memorizes the same words in silence.Compare average number of words memorized in the twogroups.[Data: Number of words memorized; Group - treatment, control]

2 categorical: Is there a relationship between learning anddistraction? Randomly assign a group of students to two groups:one group studies a concept while also listening to music, theother group studies in silence using the same materials. Thentest whether or not they learned the concept.[Data: Whether or not the students learned the concept - yes, no; Group -

treatment, control


Recap

Question

What is the main difference between observational studies and exper-iments?

(a) Experiments take place in a lab while observational studies donot need to.

(b) In an observational study we only look at what happened in thepast.

(c) Most experiments use random assignment while observationalstudies do not.

(d) Observational studies are completely useless since no causalinference can be made based on their findings.


Recap

Random assignment vs. random sampling

Random assignment

No random assignment

Random sampling

Causal conclusion, generalized to the whole

population.

No causal conclusion, correlation statement

generalized to the whole population.

Generalizability

No random sampling

Causal conclusion, only for the sample.

No causal conclusion, correlation statement only

for the sample.No

generalizability

Causation Correlation

ideal experiment

most experiments

most observational

studies

bad observational

studies


Date post:	25-Mar-2018
Category:	Documents
Upload:	tranthuan
View:	222 times
Download:	5 times

Unit 1: Introduction to data Lecture 1: Data collection ...tjl13/s101/slides/unit1lec1H.pdf · Unit...

Documents