Post on 23-Dec-2021
transcript
Innovation in IntroCase Studies and Data Science
Kari Lock MorganDepartment of Statistics
Penn State University
Southern Stat ChatApril 15th, 2016
Goal #2Focus on real biological questions
Topicsoftheday:
• Two-waytables• Graphicaldisplaysfortwo
categoricalvariables• Differenceinproportions• Oddsratio
Statistics:UnlockingthePowerofData Lock5
TeaandtheImmuneSystem• L-theanine isanaminoacidfoundintea
• Blacktea:about20mgpercup• Greentea(standard):varies,aslowas5mgpercup• Greentea(shadegrown):varies,upto46mgpercup(Shadegrowngreenteaexamples:Gyokuro,Matcha)
� 𝛾 𝛿 Tcellsareimportantforhelpingtheimmunesystemfendoffinfection
� ItisthoughtthatL-theanine primestheseTcells,activatingthemtoastateofreadinessandmakingthembetterabletorespondtofutureantigens.
� Doesdrinkingteaactuallyboostyourimmunity?
Statistics:UnlockingthePowerofData Lock5
TeaandImmuneResponse
Antigensintea-BeveragePrimeHumanVγ2Vδ2TCellsinvitroandinvivoforMemoryandNon-memoryAntibacterialCytokineResponses,Kamath et.al.,ProceedingsoftheNationalAcademyofSciences,May13,2003.
• Participantswererandomizedtodrinkfiveorsixcupsofeithertea(black)orcoffeeeverydayfortwoweeks(bothdrinkshavecaffeinebutonlyteahasL-theanine)
• Aftertwoweeks,bloodsampleswereexposedtoanantigen,andproductionofinterferongamma(immunesystemresponse)wasmeasured
• Explanatoryvariable:teaorcoffee• Responsevariable:measureofinterferongamma
Statistics:UnlockingthePowerofData Lock5
TeaandtheImmuneSystem
Instudycomparingteaandcoffeeandlevelsofinterferongamma,ifteadrinkershavesignificantlyhigherlevelsofinterferongamma,canweconcludethatdrinkingtearatherthancoffeecaused anincreaseinthisaspectoftheimmuneresponse?
a) Yesb) No
Randomized experiment allows conclusions about causality
Statistics:UnlockingthePowerofData Lock5
TeaandImmuneSystemTheexplanatoryvariableisteaorcoffee,andtheresponsevariableisimmunesystemresponsemeasuredinamountofinterferongammaproduced.Howcouldwevisualizethisdata?
a) Barchartb) Histogramc) Side-by-sideboxplotsd) Scatterplot
One categorical and one quantitative
Statistics:UnlockingthePowerofData Lock5
TeaandImmuneSystem
�̅�- − �̅�. = 34.82 − 17.70 = 17.12
Twoplausibleexplanations:
¡Teaboostsimmunity
¡Randomchance
Whatmighthappenjustbyrandomchance???
Statistics:UnlockingthePowerofData Lock5
p-value: The chance of obtaining a statistic as extreme as that observed, just by random chance, if the null hypothesis is true
Fall '14 Spring '15 Fall '15
2040
6080
100
Pos
t Tes
t Sco
re
Scores on GOALS
Averages:
p-value:0.00002
43% 60% 63%National:47%Ourmajors:66%
SimulationSimulation + Case studiesOld
GOALS ResultsMost improvement on p-value questions!
Before After Nat’l MajorsAbletoreasonthatasmallerp-valueprovidesstrongerevidenceagainstthenullhypothesisthanalargerp-value.
29%
Abletoreasonaboutaconclusionbasedonastatisticallysignificantp-valueinthecontextofaresearchstudythatcomparestwogroups.
43%
45%
48%
52%
65%
68%
80%
Goal #2Focus on real biological questions
Topicsoftheday:
• Two-waytables• Graphicaldisplaysfortwo
categoricalvariables• Differenceinproportions• Oddsratio
Focus on Real Questions• Each class or lab opens with a (scientific)
question of interest
• The stat topics of the day are covered en route to answering this question
• More time on motivation and data
• Usually a few connected examples:– One question, multiple datasets – One dataset, multiple questions
Questions• Does exercise improve cognitive ability?• Are children diagnosed with ADHD younger
than their peers?• Which aspect(s) of sunlight help protect against
MS?
• What happens when you switch to organic?• How obese are Americans?• Is brain size associated with social network size?
• Does exposure to light at night increase body mass and/or glucose intolerance in mice?
Data from Penn State!• Are moderate or aggressive doses better for
drug resistance?
• Do lizards learn to avoid invasive fire ants? • Does choice of mate improve offspring fitness? • How does genetic diversity relate to distance
from Africa? • How do breastfed babies and mothers differ
from bottle fed?
• Do malaria parasites impact mosquito behavior to their advantage?
Course MaterialsYou choose an example!
www.personal.psu.edu/klm47/Courses/STAT250/Fall2015/schedule.htm
Teacher Shift in Perspective
Here’s a dataset to illustrate this statistical topic…
Here’s an interesting question, let’s use data to answer it!
Student Shift in Perspective?
I should pay attention so I know this for the test…
I wonder what the answer is! What can these data tell me?
(what I hope happens!)
Data to answer questions• More time on each study; allows depth• More on the “science” and background• More discussion on data collection,
choices, limitations, etc.• Link to original paper, show excerpts• Teaches how to look at studies• Raises points not typically covered• Statistics is interesting, relevant, powerful!• Fun!
Challenges• Finding interesting data and fully
understanding it is time consuming
• Materials had to be recreated
• Discussion of studies is hard to assess
• Real data analysis is hard to assess
• Many students care only about exams
Teaching evaluations: n = 99 respondents (out of 225)
17 positive comments about real data/examples/interesting/ relevant17 negative comments about lectures not aligning with exams
Not a Case Study Approach?• Case study approach:
• My approach:
Findaninterestingstudy
Teachthestatsneeded
Decideonthestatsneeded
Findaninterestingstudy
• Sequence chosen by statistical topics• Multiple scenarios per statistical topic• Why? Generalizability of statistical ideas
Real Case Studies Approach• Real Life Statistics: Your Chance for
Happiness (or Misery) (at Harvard)
• Five real-life modules: romance, finance, medical, legal, wine and chocolate
• Statistical content driven by real topic
• Not “the intro course”: freedom!
• No exams; homework & team projects
• More info on this “Happy team” course
Data ScienceIntro Stat
Inference
Design VisualizationSummariesModeling
Scope of inference
Communication
Scraping data
Scientific method
Data wrangling
Big data
Computer science
Data Collection• Classical statistics:
• Data science:
Askaquestion Collect(small)datatoanswerit
Obtainavailable(big)data
Seewhatittellsyou
BOTH ARE IMPORTANT!!!
Inference!
Inference?
Design, randomness
Data Quality vs QuantityWhich provides a better (MSE) estimate?
a) A simple random sample of n = 100b) A non-random sample of n = 50 million (!)
(say from the US population of 320 million)…with correlation of 0.05 between x and probability of inclusion (relatively small)
The small random sample!!!Meng, X.L. (2016). Discussion of “Perils and potentials of self-selected entry to epidemiological studies and surveys,” Journal of the Royal Statistical Society: Series A (Statistics in Society), 179(2), 319-376.
Confidence intervals from large biased samples: worthless!
Data Quality over Quantity• For population inference, small random
sample beats large biased sample
• For causality, small randomized experiment beats large observational study
• (Statistics beats data science? J)
• Design (randomness) remains important ⇒ inference remains important!
• But, LOTS of non-random data available!
p-values?A.S.S. Response to ASA“We think the ASA did not go far enough. It is time to admit that the era of p-values is over. … In place of p-values, the ASS advocates the STOP (SeaT-Of-Pants procedure). … The STOP is simple, direct, data-driven, and authoritative. To carry it out, an authority figure (an older male, by preference) reviews the data and decides whether they agree with his opinion. When he decides they do, the result is ‘significant.’”
Emphasize/Add?• Scope of inference / data limitations:
sampling bias, confounding, randomness• More on visualization• Experience with more than two variables• Choice of graph/stat/parameter/method• Interpreting and communicating results• p-value cautions: multiple testing, sample size• More examples without design or inference• More on insights from available data • Extendable technology (R) (??)
Technology• Data analysis requires technology!However:• Concepts...
– apply regardless of choice of technology– pertain to data analyzed by someone else– are needed by everyone
• Technology…– changes quickly– has lots of different forms– will not be needed by everyone
Technology Independent Concepts• Scope of inference / data limitations:
sampling bias, confounding, randomness• More on visualization• Experience with more than two variables • Choice of graph/stat/parameter/method• Interpreting and communicating results• p-value cautions: multiple testing, sample size• More examples without design or inference• More on insights from available data • Extendable technology (R) (??)
• Scope of inference / data limitations: sampling bias, confounding, randomness
• More on visualization (interpreting graphs)• Experience with more than two variables• Choice of graph/stat/parameter/method• Interpreting and communicating results• p-value cautions: multiple testing, sample size• More examples without design or inference• More on insights from available data (some)• Extendable technology (R) (??)
Not in Intro? • One intro course can not do everything!!!• Scraping data• Data wrangling• Big data• Consider eliminating…
– Anything not directly connected to data– Hand calculations not helping understanding– Basic descriptive statistics?– t-distribution? – Distribution-based inference?
DataScienceCourse
Summary: Data Science in Intro• Emphasize the overlap with data science
• Acknowledge that not all data analysis is question -> purposeful design -> inference
• Use technology, but emphasize concepts
• Eliminate material not connected to data
• Eliminate material that technology can do and that does not help understanding