Date post: | 28-Mar-2015 |
Category: |
Documents |
Upload: | daniel-okeefe |
View: | 234 times |
Download: | 1 times |
Introduction to Bayesian inference and computation for social science
data analysis
Nicky Best
Imperial College, London
www.bias-project.org.uk
Outline
• Overview of Bayesian methods– Illustration of conjugate Bayesian inference
– MCMC methods
• Examples illustrating:– Analysis using informative priors
– Hierarchical priors, meta-analysis and evidence synthesis
– Adjusting for data quality
– Model uncertainty
• Discussion
Overview of Bayesian inference and computation
Overview of Bayesian methods
• Bayesian methods have been widely applied in many areas– medicine / epidemiology / genetics
– ecology / environmental sciences
– finance
– archaeology
– political and social sciences, ………
• Motivations for adopting Bayesian approach vary– natural and coherent way of thinking about science and
learning
– pragmatic choice that is suitable for the problem in hand
Overview of Bayesian methods
• Medical context: FDA draft guidance www.fda.gov/cdrh/meetings/072706-bayesian.html:
“Bayesian statistics…provides a coherent method for learning from evidence as it accumulates”
• Evidence can accumulate in various ways:– Sequentially
– Measurement of many ‘similar’ units (individuals, centres, sub-groups, areas, periods…..)
– Measurement of different aspects of a problem
• Evidence can take different forms:– Data
– Expert judgement
Overview of Bayesian methods
• Bayesian approach also provides formal framework for propagating uncertainty – Well suited to building complex models by linking
together multiple sub-models
– Can obtain estimates and uncertainty intervals for any parameter, function of parameters or predictive quantity of interest
• Bayesian inference doesn’t rely on asymptotics or analytic approximations– Arbitrarily wide range of models can be handled using
same inferential framework
– Focus on specifying realistic models, not on choosing analytically tractable approximation
Bayesian inference
• Distinguish between
x : known quantities (data)
: unknown quantities (e.g. regression coefficients, future outcomes, missing observations)
• Fundamental idea: use probability distributions to represent uncertainty about unknowns
• Likelihood – model for the data: p( x | )
• Prior distribution – representing current uncertainty about unknowns: p()
• Applying Bayes theorem gives posterior distribution
p(µjx) = p(µ)p(xjµ)Rp(µ)p(xjµ)dµ
/ p(µ)p(xjµ)
Conjugate Bayesian inference
• Example: election poll (from Franklin, 2004*)
• Imagine an election campaign where (for simplicity) we have just a Government/Opposition vote choice.
• We enter the campaign with a prior distribution for the proportion supporting Government. This is p()
• As the campaign begins, we get polling data. How should we change our estimate of Government’s support?
*Adapted from Charles Franklin’s Essex Summer School course slides: http://www.polisci.wisc.edu/users/franklin/Content/Essex/Lecs/BayesLec01p6up.pdf
Conjugate Bayesian inference
Data and likelihood
• Each poll consists of n voters, x of whom say they will vote for Government and n - x will vote for the opposition.
• If we assume we have no information to distinguish voters in their probability of supporting government then we have a binomial distribution for x
• This binomial distribution is the likelihood p(x | )
x »
µnx
¶µx(1¡ µ)n¡ x / µx(1¡ µ)n¡ x
Conjugate Bayesian inference
Prior
• We need to specify a prior that – expresses our uncertainty about the election (before it begins)
– conforms to the nature of the parameter, i.e. is continuous but bounded between 0 and 1
• A convenient choice is the Beta distribution
p(µ) = Beta(a;b) =¡ (a+b)¡ (a)¡ (b)
µa¡ 1(1¡ µ)b¡ 1
/ µa¡ 1(1¡ µ)b¡ 1
Conjugate Bayesian inference
• Beta(a,b) distribution can take a variety of shapes depending on its two parameters a and b
0.0 0.4 0.8
theta
Beta(0.5,0.5)
0.0 0.4 0.8
theta
Beta(1,1)
0.0 0.4 0.8
theta
Beta(5,1)
0.0 0.4 0.8theta
Beta(5,5)
0.0 0.4 0.8theta
Beta(5,20)
0.0 0.4 0.8theta
Beta(50,200)
Mean of Beta(a, b) distribution = a/(a+b)
Variance of Beta(a,b) distribution =
ab(a+b+1)/(a+b)2
Conjugate Bayesian inference
Posterior
• Combining a beta prior with the binomial likelihood gives a posterior distribution
• When prior and posterior come from same family, the prior is said to be conjugate to the likelihood– Occurs when prior and likelihood have the same ‘kernel’
p(µ j x;n) / p(x j µ;n) p(µ)
/ µx(1¡ µ)n¡ xµa¡ 1(1¡ µ)b¡ 1
= µx+a¡ 1(1¡ µ)n¡ x+b¡ 1
/ Beta(x +a; n ¡ x +b)
Conjugate Bayesian inference
• Suppose I believe that Government only has the support of half the population, and I think that estimate has a standard deviation of about 0.07
– This is approximately a Beta(50, 50) distribution
• We observe a poll with 200 respondents, 120 of whom (60%) say they will vote for Government
• This produces a posterior which is a
Beta(120+50, 80+50) = Beta(170, 130) distribution
Conjugate Bayesian inference
• Prior mean, E() = 50/100 = 0.5
• Posterior mean, E( | x, n) = 170/300 = 0.57
• Posterior SD, √Var( | x, n) = 0.029
• Frequentist estimate is based only on the data:
µ̂= 120=200= 0:6; SE(µ̂) =p(0:6£ 0:4)=200= 0:035
0.0 0.2 0.4 0.6 0.8 1.0
theta (prob of voting for government)
PosteriorPriorLikelihood
Conjugate Bayesian inference
A harder problem
• What is the probability that Government wins?– It is not .57 or .60. Those are expected votes but not the probability
of winning. How to answer this?
• Frequentists have a hard time with this one. They can obtain a p-value for testing H0: > 0.5, but this isn’t the same as the probability that Government wins– (its actually the probability of observing data more extreme than
120 out of 200 if H0 is true)
• Easy from Bayesian perspective – calculate Pr(> 0.5 | x, n), the posterior probability that > 0.5
0.45 0.55 0.65theta
Pr(theta>0.5) = 0.98
Bayesian computation
• All Bayesian inference is based on the posterior distribution
• Summarising posterior distributions involves integration
• Except for conjugate models, integrals are usually analytically intractable
• Use Monte Carlo (simulation) integration (MCMC)
Mean: E (µjx) =Rµp(µjx)dµ
Prediction: p(xnewjx) =Rp(xnew;µjx)p(µjx)dµ
Probabilities: P r(µ2 (c1;c2)jx) =Rc2c1p(µjx)dµ
Bayesian computation
• Suppose we didn’t know how to analytically integrate the Beta(170, 130) posterior…
• ….but we do know how to simulate from a Beta
theta0.4 0.6 0.8
100 samples mean = 0.569
theta0.4 0.6 0.8
10,000 samples mean = 0.566
0.4 0.6 0.8theta
True posterior mean=0.567
Bayesian computation
• Monte Carlo integration– Suppose we have samples (1), (2),…, (n) from p( | x )
– Then E(µjx) =Rµp(µjx)dµ ¼ 1
n
P ni=1
µ(i )
• Can also use samples to estimate posterior tail area probabilities, percentiles, variances etc.
• Difficult to generate independent samples when posterior is complex and high dimensional
• Instead, generate dependent samples from a Markov chain having p( | x ) as its stationary distribution → Markov chain Monte Carlo (MCMC)
Illustrative Examples
Borrowing strength
• Bayesian learning → borrowing “strength” (precision) from other sources of information
• Informative prior is one such source– “today’s posterior is tomorrows prior”
– relevance of prior information to current study must be justified
Informative priors
Example 1: Western and Jackman (1994)*
• Example of regression analysis in comparative research
• What explains cross-national variation in union density? – Union density is defined as the percentage of the work
force who belongs to a labour union
• Two issues– Philosophical: data represent all available observations
from a population → conventional (frequentist) analysis based on long-run behaviour of repeatable data mechanism not appropriate
– Practical: small, collinear dataset yields imprecise estimates of regression effects
* Slides adapted from Jeff Grynaviski: http://home.uchicago.edu/~grynav/bayes/abs03.htm
Informative priors
• Competing theories– Wallerstein: union density depends on the size of the
civilian labour force (LabF)
– Stephens: union density depends on industrial concentration (IndC)
– Note: These two predictors correlate at 0.92.
• Control variable: presence of a left-wing government (LeftG)
• Sample: n = 20 countries with a continuous history of democracy since World War II
• Fit linear regression model to compare theories
union densityi ~ N(i, 2)
i = 0 + 1LeftG + 2LabF + 3IndC
Informative priors• Results with non-informative priors on regression coefficients (numerically equivalent to OLS analysis)
-40 0 20 40
regression coefficient
Left Govt
Labour Force
Ind Conc.
point estimate
___ 95% CI
Informative priors
Motivation for Bayesian approach with informative priors
• Because of small sample size and multicollinear variables, not able to adjudicate between theories
• Data tend to favour Wallerstein (union density depends on labour force size), but neither coefficient estimated very precisely
• Other historical data are available that could provide further relevant information
• Incorporation of prior information provides additional structure to the data, which helps to uniquely identify the two coefficients
Informative priors
Prior distributions for regression coefficients
Wallerstein
• Believes in negative labour force effect
• Comparison of Sweden and Norway in 1950:
→ doubling of labour force corresponds to 3.5-4% drop in union density
→ on log scale, labour force effect size ≈ -3.5/log(2) ≈ -5
• Confidence in direction of effect represented by prior SD giving 95% interval that excludes 0
2 ~ N(-5, 2.52)
Informative priors
Prior distributions for regression coefficients
Stephens
• Believes in positive industrial concentration effect
• Decline in industrial concentration in UK in 1980s:
→ drop of 0.3 in industrial concentration corresponds to about 3% drop in union density
→ industrial concentration effect size ≈ 3/0.3 = 10
• Confidence in direction of effect represented by prior SD giving 95% interval that excludes 0
3 ~ N(10, 52)
Informative priors
Prior distributions for regression coefficients
Wallerstein and Stephens
• Both believe left-wing gov’ts assist union growth
• Assuming 1 year of left-wing gov’t increases union density by about 1% translates to effect size of 0.3
• Confidence in direction of effect represented by prior SD giving 95% interval that excludes 0
1 ~ N(0.3, 0.152)
• Vague prior 0 ~ N(0, 1002) assumed for intercept
Informative priors
coefficient-20 0 20
Vague priors
coefficient-20 0 20
Wallerstein priors
coefficient-20 0 20
Stephens priors
Ind Conc
Lab Force
Left Govt
Informative priors
• Effects of LabF and IndC estimated more precisely
• Both sets of prior beliefs support inference that labour-force size decreases union density
• Only Stephens’ prior supports conclusion that industrial concentration increases union density
• Choice of prior is subjective – if no consensus, can we be satisfied that data have been interpreted “fairly”?
• Sensitivity analysis – Sensitivity to priors (e.g. repeat analysis using priors with
increasing variance)
– Sensitivity to data (e.g. residuals, influence diagnostics)
Hierarchical priors
• Hierarchical priors are another widely used approach for borrowing strength
• Useful when data available on many “similar” units (individuals, areas, studies, subgroups,…)
• Data xi and parameters i for each unit i=1,…,N
• Three different assumptions:– Independent parameters: units are unrelated, and each i is
estimated separately using data xi alone
– Identical parameters: observations treated as coming from same unit, with common parameter
– Exchangeable parameters: units are “similar” (labels convey no information) → mathematically equivalent to assuming i’s are drawn from common probability distribution with unknown parameters
Meta-analysis
Example 2: Meta-analysis (Spiegelhalter et al 2004)
• 8 small RCTs of IV magnesium sulphate following acute myocardial infarction
• Data: xig = deaths, nig = patients in trial i, treatment group g (0=control, 1=magnesium)
• Model (likelihood): xig ~ Binomial(pig, nig)
logit(pig) = i + i·g
• θi is log odds ratio for treatment effect
• If not willing to believe trials are identical, but no reason to believe they are systematically different → assume i’s are exchangeable with hierarchical prior
i ~ Normal(, 2), 2 also treated as unknown with (vague) priors
Meta-analysis
Mortality OR0.10 0.25 0.50 0.75 1.50
Overall
Morton
Rasmussen
Smith
Abraham
Feldstedt
Shecter
Ceremuz.
LIMIT-2
Hierachical BayesFixed Effects
Estimates and 95% intervals for treatment effect from independent MLE and hierarchical Bayesian analysis
Meta-analysis
• Effective sample size
n = sample size of trial
V1 = variance of i without borrowing (var of MLE)
V2 = variance of i with borrowing (posterior variance of i )
ESS = n£ V1V2
Trial n Var(MLE) Var(µi jx) V1=V2 ESS1 76 1.12 0.15 7.3 5552 270 0.17 0.10 1.7 4593 400 0.55 0.15 3.7 14804 94 1.37 0.16 8.5 7995 298 0.23 0.10 2.2 6566 115 0.81 0.22 3.6 4147 48 1.04 0.16 6.4 3078 2316 0.02 0.02 1.1 2350
Meta-analysis
15 20 25 30 35
-0.3
-0.1
0.1
0.3
Class size
Sta
nd
ard
ise
d T
est
Sco
re
• Example 3: Meta-analysis of effect of class size on educational achievement (Goldstein et al, 2000)
8 studies:
• 1 RCT
• 3 matched
• 2 experimental
• 2 observational
Meta-analysis
• Goldstein et al use maximum likelihood, with bootstrap CI due to small sample size
• Under-estimates uncertainty relative to Bayesian intervals
• Note that 95% CI for Bayesian estimate of effect of class size includes 0
-0.10 -0.05 0.00 0.05Effect of class size
(change in score per extra pupil)
Stu
dy
1
2
3
4
5
6
7
8
OverallGoldstein
Accounting for data quality
• Bayesian approach also provides formal framework for propagating uncertainty about different quantities in a model
• Natural tool for explicitly modelling different aspects of data quality
– Measurement error
– Missing data
Accounting for data quality
Example 4: Accounting for population errors in small-area estimation and disease mapping(Best and Wakefield, 1999)
• Context: Mapping geographical variations in risk of breast cancer by electoral ward in SE England, 1981-1991
• Typical model: yi ~ Poisson(i Ni)
yi = number of breast cancer cases in area i
i is the area specific rate of breast cancer: parameter of interest
Ni = t Nit = population-years at risk in area i
Accounting for data quality • Ni usually assumed to be known
• Ignores uncertainty in small-area age/sex population counts in inter-census years
• B&W make use of additional data on Registrar General’s mid-year district age/sex population totals Ndt
• Model A: Nit = Ndt pit where pit is proportion of annual district population in particular age group of interest living in ward ipit estimated by interpolating 1981 and 1991 census counts
• Model B: Allow for sampling variability in Nit
Nit ~ Multinomial(Ndt, [ p1t ,…, pKt ])
• Model C: Allow for uncertainty in proportions pit
pit ~ informative Dirichlet prior distribution
Accounting for data quality
i
2
i
prior prior prior
Xi
yi ward i
Ni
Random effects Poisson regression: log i = i + Xi
Xi = deprivation score for ward i
Accounting for data quality
i
2
i
prior prior prior
Xi
yi ward i
Nitpit
Ndt
year t
prior
Ni
Accounting for data quality
Ward RR (assuming Ei known)
Wa
rd R
R (
mod
elle
d E
i)
Area-specific RR estimates
Ni known
A B C
RR of breast cancer for affluent vs deprived wards
Model uncertainty
• Model uncertainty can be large for observational data studies
• In regression models:– What is the ‘best’ set of predictors for response
of interest?
– Which confounders to control for?
– Which interactions to include?
– What functional form to use (linear, non-linear,….)?
Model uncertainty• Example 5: Predictors of crime rates in US States (adapted
from Raftery et al, 1997)
• Ehrlich (1973) – developed and tested theory that decision to commit crime is rational choice based on costs and benefits
• Costs of crime related to probability of imprisonment and average length of time served in prison
• Benefits of crime related to income inequalities and aggregate wealth of community
• Net benefits of other (legitimate) activities related to employment rate and education levels in community
• Ehrlich analysed data from 47 US states in 1960, focusing on relationship between crime rate and the 2 prison variables
• Up to 13 candidate control variables also considered
Model uncertainty
• y = log crime rate in 1960 in each of 47 US states
• Z1, Z2 = log prob. of prison, log av. time in prison
• X1,…, X13 = candidate control variables
• Fit Normal linear regression model
• Results sensitive to choice of control variables
Table adapted from Table 2 in Raftery et al (1997)
Model Control variables Estimated e®ect (SE)Prob Prison TimePrison
Full model All -0.30 (0.10) -0.27 (0.17)Stepwise 1,3,4,9,11,13 -0.19 (0.07) { {Adjusted R2 1,2,4,7,8,9,11,12,13 -0.30 (0.09) -0.25 (0.16)Ehrlich model 1 9,12,13 -0.45 (0.12) -0.55 (0.20)Ehrlich model 2 1,6,9,10,12,13 -0.43 (0.11) -0.53 (0.20)
Model uncertainty• Using Bayesian approach, can let set of control variables be an
unknown parameter of the model,
• Don't know (a priori) no. of covariates in ‘best’ model has unknown dimension assign prior distribution
• Can handle such “trans-dimensional” (TD) models using “reversible jump” MCMC algorithms
• Normal linear regression model
yi ~ Normal(i, 2) i = 1,...,47
• Variable selection model:¹ i = Zi° +Wi¯; Wi = (X iµ1 ;X iµ2 ; :::;X iµk )
k = number of currently selected predictorsµ= vector of k selected columns of X¯ = vector of k regression coe±cients
yi
k
2
i
state i
Xi
Zi
Model uncertainty
Model uncertainty
k chains 1:2 sample: 20000
-1 0 2 4 6
0.0 0.2 0.4 0.6 0.8
prob
abili
ty0.
00.
40.
8-1
13
5
control variable1 3 5 7 9 11 13
Pro
b pr
ison
Tim
e pr
ison
effe
ct o
n cr
ime
rate
prob
abili
ty0.
00.
40.
8
prob
abili
ty
0
0.
4
0.8
Posterior mean and 95% CI for effect (b) conditional on being in model
Model uncertainty
Posterior probability that control variable is in model
Model uncertainty
• Most likely (40%) set of control variables contains X4 (police expenditure in 1960) and X13 (income inequality)
• 2nd most likely (28%) set of control variables contains X5 (police expenditure in 1959) and X13 (income inequality)
• Control variables with >10% marginal probability of inclusion– X3 : average years of schooling (18%)
– X4 : police expenditure in 1960 (56%)
– X5 : police expenditure in 1959 (40%)
– X13 : income inequality (94%)
• Posterior estimates of prison variables, averaged over models
log prob. of prison: -0.29 (-0.55, -0.05)
log av. time in prison: -0.27 (-0.69, 0.14)
Discussion
• Bayesian approach provides coherent framework for combining many sources of evidence in a statistical model
• Formal approach to “borrowing strength”– Improved precision/effective sample size
– Fully accounts for uncertainty
• Relevance of different pieces of evidence is a judgement – must be justifiable
• Bayesian approach forces us to be explicit about model assumptions
• Sensitivity analysis to assumptions is crucial
Discussion
• Bayesian calculations are computationally intensive, but:– Provides exact inference; no asymptotics
– MCMC offers huge flexibility to model complex problems
• All examples discussed here were fitted using free WinBUGS software: www.mrc-bsu.cam.ac.uk
• Want to learn more about using Bayesian methods for social science data analysis?– Short course: Introduction to Bayesian inference
and WinBUGS, Sept 19-20, Imperial College
See www.bias-project.org.uk for details
Thank you!
References
• Best, N. and Wakefield, J. (1999). Accounting for inaccuracies in population counts and case registration in cancer mapping studies. J Roy Statist Soc, Series A, 162: 363-382
• Goldstein, H., Yang, M., Omar, R., Turner, R. and Thompson, S. (2000). Meta-analysis using multilevel models with an application to the study of class size effects. Applied Statistics, 49: 399-412.
• Raftery, A., Madigan, D. and Hoeting, J. (1997). Bayesian model averaging for linear regression models. J Am Statist Assoc, 92: 179-191.
• Spiegelhalter, D., Abrams, K. and Myles, J. (2004). Bayesian Approaches to Clinical Trials and Health Care Evaluation, Wiley, Chichester.
• Western, B. and Jackman, S. (1994). Bayesian inference for comparative research. The American Political Science Review, 88: 412-423.