Statistical Models: The Rest of the Story
Scott L. Zeger
Hurley-Dorrier Professor and Chair
Department of Biostatistics
The Johns Hopkins University Bloomberg School
What is a model?
What is a statistical model?
Tool for those empirical sciences where signals come embedded in noise
Lens through which to view data to better understand the signal
Tool for quantifying the evidence in data about a particular truth we seek
Empirical science: search for “truth”
Truth for Population
Observed Value for a Representative
Sample
Probability – statistical model
Statistical inference
How to Choose the Best Model
• Miminize the mean squared error• Minimize the Akaike Information Criterion (AIC)• Minimize the Bayesian Information Criterion (BIC)• Maximize the likelihood function• Cross-validate• Jackknife• Bootstrap• Boost, then bag• etc
You can not choose the best model because there isn’t one
You can choose a useful model based upon prior scientific knowledge
You can explore and report how your scientific findings vary over a set of other useful models
You can average your results across useful models
Causal Model
Smoking Disease
Dollars
Death
Causal Model
Iraq invasion
Violence Death
What Do We Know about Smoking and Medical Expenditures
• WHO, U.S. Surgeon General and IARC say smoking causes 13 major diseases:– Lung cancer; COPD; atherosclerosis; MI;
stroke; ….• In the U.S., most people receive treatment for
major chronic diseases (e.g. lung cancer)
• It cost money to treat your disease
What we know LITTLE about
• Whether smoking causes people to use more or less medical services to treat smoking caused diseases
• Whether smoking causes people without a major disease to seek more or less medical services– “I hate my doctor, she tries to take my cigs
away”– “I go as often as I can afford; got to watch out
for those diseases that can kill me”
Competing Causal Models
Smoking Disease Dollars
Smoking Dollars
Odds Ratios of Lung Cancer/COPD by Pack-years for Current and Former Smokers
Medical Expenditures for Persons with vs without Lung Cancer/COPD
Difference in Average Expenditures by Propensity to Have Disease
Smoking Attributable Burden for Cohort of 60 Million Who Started Under 21 Years Old,
1954-2000
Disease: LC/COPD
(millions case-years)
43.7
Disease: CHD Group
(millions case-years)
80.8
Dollars
(billions)
1,087
Deaths
(million years lost)
128.0
(13m persons)
Smoking Disease Dollars
“Know this”:
$1 Trillion +- 0.2 T for 10% of pop
???
Estimate well what you can; estimate poorly what you must.
Don’t dilute decent causal estimates with causal speculates (unless you intend to make everything uncertain)
Causal Model
Iraq invasion
Violence Death
What We Know Well
• 2,237 U.S. soldiers (DoD)
• 99 British soldiers (British Govt)
• 4,027 Iraqi police (News reports compiled by iCasualties.org)
• 28,198 - 31,800 Iraqi civilians (IBC web-site)
The count includes civilian deaths caused by coalition military action and by military or paramilitary responses to the coalition presence (e.g. insurgent and terrorist attacks). It also includes excess civilian deaths caused by criminal action resulting from the breakdown in law and order which followed the coalition invasion. Compiled from eye-witness reports and news articles
What We Know Less Well
Iraq invasion
Violence Death
~30,000 Iraqi deaths
Lack of sanitation
Lack of clean water
Poor nutrition
Limited access to medical care
Extreme stress and grief
98,000 (95% CI: 8,000 - 194,000) without Falluja
~ 20 - 50% violent
Summary
• A model defines the boundaries of an analysis and can determine what will be learned from data– Like a lens determines what you will see
• Same model for two problems – Separate what can be estimated precisely from
what can not– Prior knowledge about pathway
• Too much uncertainty invalidates, whether it should or not
Timing of Proceeds Relative to Smoking Attributable Expenditures for Major Diseases Only
Smoking Attributable Fraction of Disease (SAF) and Dollars (SAFE)