Political Science Association, Methodology Initiative British … · 2018-08-14 · Harris,...

transcript

Political Science Association, Methodology InitiativeBritish Academy, November 25, 2015

Statistical Modeling to Understand Terrorism: An Overview of NewTools

JEFF GILLWashington University

Political Science Association, PolMeth Initiative [1]

Motivation

◮ The safety of millions of people depends on the understanding of the workings of covert networks,

especially of terrorist networks.

◮ To protect people, governments and nongovernmental organizations invest enormous amounts of

time and energy to detect covert networks and to thwart terrorist events and other kinds of attacks.

◮ Terrorism is an important political and public health problem because it affects:

⊲ government stability,

⊲ personal safety,

⊲ immediate epidemiological concerns,

⊲ internal government policies,

⊲ public perception and panic,

⊲ and possibly widespread health effects.

◮ Academic work on terrorism has increased dramatically in recent

decades for obvious reasons, but remains under-developed.

Representative Historical and Descriptive Approaches

◮ Terrorists play to the media. Wilkinson. “The Media and Terror: A Re-

assessment.” Terrorism and Political Violence 9(2), 1997, 51-64.

◮ Terrorists become more militant following concessions. Ethan

BdM,. “Conciliation, Counterterrorism, and Patterns of Terrorist Violence: A Compara-

tive Study of Four Cases.” IO 59(1), 2003, 145176.

◮ Terrorism works better against democracies than tyrannies.

Dershowitz, Why Terrorism Works: Understanding the Threat, Responding to the Chal-

lenge., 2002, Yale University Press.

◮ Predicting terrorism is hard but sociological understanding is

easier. Boyns and Ballard, “Developing a Sociological Theory for the Empirical Un-

derstanding of Terrorism.” The American Sociologist 35(2), 2008, 5-25.

◮ Planes are special kinds of weapons. Einav, “Understanding Aviation

Terrorism.” Interavia: Business & Technology 58(670), 2003, 34-37.

Representative Formal or Game Theoretic Approaches

◮ Normal-form games can distinguish proactive from defensive poli-

cies. Sandler & Arce, “Terrorism: A Game-Theoretic Approach.” In Handbook of Defense

Economics, Sandler & Hartley (eds.). Volume 2, 775-813, Elsevier.

◮ Most terrorists are non-suicidal rational actors attacking soft tar-

gets. Atkinson, Sandler & Tschirhart, “Terrorism in a Bargaining Framework.” JLEO 30,

1987, 1-21.

◮ Probabilistic risk analysis shows vulnerabilities. Harris, “Mathematical

Methods in Combatting Terrorism.” Risk Analysis 24(2), 2004, 985-988.

◮ Government should provide incentives for former terrorists to exert

counterterrorism efforts. Ethan BdM, “The Terrorist Endgame: A Model with Moral

Hazard and Learning. JCR 49(2), 2005, 237258.

◮ Those with low ability or little education are most likely to join.

Ethan BdM, “The Quality of Terror.” AJPS 49(3), 2005, 515530.

Representative Economic Approaches

◮ Terrorism constitutes transnational externalities and market fail-

ures. Todd Sandler and Walter Enders. “An Economic Perspective on Transnational Terror-

ism.” In The Economic Analysis of Terrorism, Tilman Bruck (ed.). 11-28, 2007, Routledge.

◮ Trade and FDI reduce terrorism. Li & Schaub, “Economic Globalization and

Transnational Terrorism: A Pooled Time-Series Analysis.” JCR 48(2), 2004, 230-258.

◮ Economic Centers Are At Risk. Rosoff & von Winterfeldt, “A Risk and Economic

Analysis of Dirty Bomb Attacks on the Ports of Los Angeles and Long Beach.” Risk Analysis,

27(3) 2007, 1539-6924.

◮ Terrorism is bad for tourism. Sloboda, “Assessing the Effects of Terrorism on

Tourism by Use of Time Series Methods. Tourism Economics 9(2), 2003, 179-190.

◮ There exist links between the national economy and homegrown

terrorism. Blomberg, Hess & Weerapana, “Economic Conditions and Terrorism.” EJPE,

20(2), 2004, 463-478.

Data Problems with Individual/Events Level Approaches

◮ Micro-level empirical work in this area has not produced many revealing insights.

◮ There are some major deficiencies in direct data-analytic micro-studies of terrorism:

⊲ the data consist of either publicly observed events or classified data at government agencies,

⊲ government actions are typically censored to scholars,

⊲ targets are strategic, actions are dynamic: the subjects are deliberately trying to deny observers

information,

⊲ existing tools for filling in missing information are inappropriate,

⊲ qualitative and technical experts have not traditionally coordinated,

⊲ and it can even be physically dangerous.

◮ Can we use standard data-analytic regression techniques despite these problems?

An Example of Basic Data Analysis for Terrorism Data

◮ Violent events within the state of Israel.

◮ Subsetted to give 103 suicide attacks with explosives over a three-

year period from November 6, 2000 to November 3, 2003 when there

was a steep drop (the early period of the first “Intifada”).

◮ Information provided: date and place of the attack, attack type,

the type of target and device employed, organizational affiliation

of the attacker, and the number of casualties, along with a written

description of the attack.

◮ Casualties are given personal attributes such as name, age, sex,

nationality, and religion.

◮ These data are subsetted by Mark Harrison (2006).

Terrorism Data

harr <- read.table("http://jgill.wustl.edu/data/harrison4.txt",header=TRUE)

apply(harr[,-1],2,table)

$NumberKilled

0 1 2 3 5 6 7 8 9 11 15 17 19 21 23 24 30

44 13 9 8 3 2 3 2 2 3 4 3 1 3 1 1 1

$NumberInjured

0 1 2 3 4 5 6 8 9 11 13 14 16 17 20 21 22 26 27 30

28 1 5 4 4 3 1 2 2 2 1 1 1 1 3 1 1 1 1 5

40 42 47 50 52 57 58 59 60 65 69 86 90 100 102 120 130 150 188

3 1 1 7 1 1 1 2 5 1 1 1 1 3 1 1 1 2 1

$TotalCasualties

0 1 2 3 4 5 6 8 9 10 12 13 15 17 20 21 26 27 29 30

22 5 6 4 3 3 2 2 1 1 2 2 1 1 2 1 1 2 1 1

31 32 35 38 45 49 50 51 52 53 57 58 59 61 62 63 65 67 71 75

1 1 1 1 1 2 1 1 1 1 2 1 1 2 1 1 2 2 3 2

81 91 93 105 106 123 126 141 145 151 180 199

1 1 1 1 1 1 1 1 1 1 1 1

Terrorism Data

$ResponsibleHamas $ResponsibleisMartyrs

0 1 0 1

59 44 78 25

$ResponsibleisPIJ $ResponsibleisOther

0 1 0 1

79 24 99 4

$TargetisMilitary $TargetisCivilian

0 1 0 1

76 10 10 76

$TargetisBus $TargetisCafe

0 1 0 1

89 14 89 14

$TargetisCheckpoint $TargetisResidence

0 1 0 1

87 16 102 1

Terrorism Data

$TargetisOffshore $TargetisStore

0 1 0 1

101 2 96 7

$TargetisStreet $TargetisTravelstop

0 1 0 1

71 32 88 15

$DeviceisCar $DeviceisBoat

0 1 0 1

89 14 101 2

$AttackisPrevented $AttackerisChallenged

0 1 0 1

101 2 63 40

$FirstAttackerisMale $FirstAttackerisFemale

0 1 0 1

7 92 92 7

Terrorism Data

$AgeofFirstAttacker

16 17 18 19 20 21 22 23 24 25 26 27 29 31 43 45 48

1 8 7 10 15 11 10 12 2 3 2 1 3 1 1 1 1

◮ Data Notes:

⊲ measurement here is very “nongranular,”

⊲ some dichotomous variables are also very lopsided,

⊲ information filtered through a government reporting source,

⊲ and the real data generating process is never observed: motivations, planning, and training.

◮ An additional challenge is grouping or clustering in the data.

Terrorism Data Analysis

Attacker is Challenged Device is Car

2.8 3.0 3.2 3.4 3.6 3.8

log(AgeofFirstAttacker)

NumberK

Given : as.factor(AttackerisChallenged)

2.8 3.0 3.2 3.4 3.6 3.8

NumberK

Given : as.factor(DeviceisCar)

Target is Military Hamas Responsible

2.8 3.0 3.2 3.4 3.6 3.8

NumberK

Given : as.factor(TargetisMilitary)

2.8 3.0 3.2 3.4 3.6 3.8

NumberK

illedNo

Given : as.factor(ResponsibleHamas)

Terrorism Data Analysis

◮ One useful approach is to fit a log-linear form (generalized additive model) where the outcome

variable is the number killed, mixing estimated and smoothed fits simultaneously:

Parametric coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 1.198 0.121 9.89 < 2e-16

AttackerisChallenged -1.406 0.154 -9.14 < 2e-16

FirstAttackerisFemale 0.217 0.231 0.94 0.35

DeviceisCar 0.332 0.251 1.33 0.18

TargetisCafe 0.466 0.118 3.96 7.5e-05

TargetisMilitary -3.286 0.505 -6.50 7.9e-11

ResponsibleHamas 0.877 0.125 7.02 2.2e-12

Approximate significance of smooth terms:

edf Ref.df Chi.sq p-value

te(log(AgeofFirstAttacker),log(Date)) 4.81 4.97 94 <2e-16

Viewing the Nonparametric Results

te(log(AgeofFirstAttacker),log(Date),5.6)

Viewing the Nonparametric Results

log(Date)

te(log(AgeofFirstAttacker),log(Date),5.6)

Hidden Effects

◮ This past example is considered an “easy case” since it is confined to a single nation, with a

well-identified problem.

◮ Most collections of terrorism data contain heterogeneous hidden, possibly clustered, effects from:

⊲ actors who are trying to hide important information,

⊲ who are also trying to purposely mislead observers,

⊲ the presence of strong network effects, even if the whole network is not observable.

⊲ measurement on groups that are highly imitative,

all suggesting latent structures in the data that are not directly measured by the explanatory

variables.

◮ So how do we account for such unobserved heterogeneity?

A New Modeling Enhancement

◮ Joint work with George Casella (JASA 2009, Annals of Stats 2010, etc.).

◮ Let’s add a “random effect” term that accounts for heterogeneity:

Y = β0 +X1β1 + · · · +Xkβk +Ψ + ǫ

where the new term adds some differences by group to each case: Ψ = [ψ1, ψ2, . . . , ψ103] (with

mean zero, and not unique) just so that the model fits better.

◮ The problem with this is that it does not account for any information yet, and we have to know

grouping information.

◮ A more useful version is the Dirichlet Process Random Effects Model which pulls-out subtle

information in the X matrix “non-parametrically” so these ψi values are assigned accounting for

latent information:

Y = β0 +X1β1 + · · · +Xkβk +DP(m,G0) + ǫ,

which is a computationally-intensive process that iteratively fits many different binning assign-

ments as a Gibbs Sampler runs, and summarizes the results in the final model.

Dirichlet Process Priors, Some Background Definitions

◮ Y is a random variable taking values on the measurable space (Y ,B), defined by the support of

Y and an arbitrary (for now) abstract space B.

◮ The “parameter” of interest here is P , the associated, but unknown, probability measure taking

values in P , the collection of all probability measures on (Y ,B).

◮ Define S as the smallest σ-field (closed under countable unions) generated by sets of the form:

{P : P (A) < r}, where: A ∈ B, r ∈ [0 : 1]

◮ Now define ν as a probability measure on (P ,S), which can be used as a prior distribution for the

unknown P .

◮ We are interested in computing ν∗, the posterior distribution of P |Y .

◮ ν is called a Dirichlet Measure if for every measurable partition {B1, . . . , BK} (and finite K) of

the parameter space B, the distribution of P (B1), . . . , P (BK) under ν is Dirichlet:

f (y|α1, . . . , αK) ∝ yα1−11 · · · yαK−1

K , 0 ≤ yi ≤ 1,∑K

i=1 yi = 1, 0 < αi, ∀i ∈ [1, 2, . . . , K].

The Distributional Structure

◮ Ferguson (1973, 1974, 1983) and Antoniak (1974) introduced the Dirichlet process prior for non-

parametric G, which is this random probability measure on the space of all measures.

◮ We notate this distribution conventionally over the space of distributions by:

⊲ G0, a base distribution (finite non-null measure) which is analogous to an “expected value” of

the distributions,

⊲ λ > 0, a concentration/precision parameter (finite and non-negative scalar) giving the spread

of distributions around G0,

⊲ therefore φ0 = λG0 is a base measure,

⊲ leading to the prior specification G ∼ DP(λ,G0) ∈ P .

◮ For any finite partition of the parameter space, {B1, . . . , BK}, the joint distribution of these

probabilities has the Dirichlet distribution, now according to:

{G(B1), . . . ,G(BK) ∼ D(λG0(B1), . . . , λG0(BK)},

where for some observed partition, these are just multinomial probabilities.

Setting Up the Estimation Process

◮ Since realizations of the DP select a discrete distribution with probability one (even though the

generating mechanism is continuous), the model for the random effect ψ is a countably infinite

mixture (some key papers: Ferguson 1973, Antoniak 1974, Berry & Christensen 1979, Lo 1984,

Escobar & West 1995, MacEachern & Muller 1998).

◮ Blackwell and MacQueen (1973) noted the following (generally, not random effects):

⊲ If G is a DP , where ψ1, . . . , ψn iid from G,

⊲ then the marginal distribution of ψ1, . . . , ψn (marginalized over any prior parameters) is equal

in distribution to the first n steps of a Polya process.

◮ Blackwell and MacQueen then proved that the joint distribution of ψ is a product of successive

conditional distributions of the form:

ψi|ψ1, . . . , ψi−1 ∼λ

i− 1 + λφ0(ψi) +

i− 1 + λ

i−1∑

δ(ψi = ψl),

where δ denotes the Dirac delta function.

◮ Therefore reference can be made to finite rather than infinite dimensions, and Dirichlet process

posterior calculations involve a single parameter over this space (Ferguson’s Theorem 1, 1973).

Review of the Polya Process

◮ The Polya Process for sampling ψ is equivalent to the following permutation scheme:

⊲ a restaurant has many large circular tables.

⊲ n diners enter one-at-a-time to be seated, where the first person sits at the first table.

⊲ For a given weight, λ, the ith person sits at the unoccupied ith table with probability

λ/(i− 1 + λ).

⊲ Otherwise this diner selects the jth (j < i) previously occupied table with probability

nj/(i− 1 + λ), where nj is the number seated at that table already.

◮ Now the table locations of the seated diners, ξ1, . . . , ξn, is a dependent exchangeable sequence.

◮ ξ∗ = (ξ1, . . . , ξk) with k ≤ n, the set of non-empty tables, is a sample from G.

◮ This process can be iterated many times to numerically integrate over this space.

Models and Likelihood

◮ A general random effects Dirichlet Process model can now be written definitionally as:

(Y1, . . . , Yn) ∼ f (y1, . . . , yn | θ, ψ1, . . . , ψn) =∏

f (yi|θ, ψi), ψi ∼ DP(λ, φ0), i = 1, . . . , n

(the vector θ here is a placeholder for all of other the estimated parameters, X assumed).

◮ Applying the successive conditional distributions of Blackwell and McQueen, we integrate over the

random effects to get the likelihood function:

L(θ | y) =

f (y1, . . . , yn | θ, ψ1, . . . , ψn)π(ψ1, . . . , ψn) dψ1 · · · dψn

=Γ(λ)

Γ(λ + n)

C:|C|=k

Γ(nj)

f (y(j) |θ, ψj)φ0(ψj) dψj

where the second form is derived in Lo (1984 Annals) Lemma 2 and Liu (1996 Annals), and:

⊲ C is a partition of the sample of size n into k groups, k = 1, . . . n− 1

⊲ y(j) is the vector of yis in subcluster j

⊲ ψj is the common random effects parameter applied to that subcluster.

Matrix Representation of Partitions

◮ Since every “diner” at a given table gets the same random effects value, we want an efficient way

to keep track of assignments on each cycle of the sampler.

◮ Associate a binary matrix An×k with a given partition C, for example:

C = {S1, S2, S3} = {{1, 2}, {3, 4, 6}, {5}} ↔ A =

◮ Rows: ai is a 1× k vector of all zeros except for a 1 in its subcluster

◮ Columns: The column sums of A are the number of observations in the groups

◮ Variables: thus ψi ∈ Sj ⇒ ψi = ηj (constant in subclusters)

◮ This is similar to (but different from) the matrix approach in McCullagh and Yang (2006).

Mapping Partitions to the Underlying Random Effects

◮ Continuing with the contrived example:

C = {S1, S2, S3} = {{1, 2}, {3, 4, 6}, {5}} ↔ A =

◮ This leads to the matrix representation:

ψ = Aη where A =

a1a2...

ψ2...

η1η2η3

◮ So we only need to generate three random variables in the sampler.

Incorporating the A Matrix

◮ Return to:

Y|ψ ∼ N (Xβ + ψ, σ2I), where ψi ∼ DP(λ,N (0, τ 2)), i = 1, . . . , n

where we are explicitly averaging over all normals with mean zero as our DPP choice.

◮ Introduce the A matrices to get

Y|A, η ∼ N (Xβ + Aη, σ2I), η ∼ Nk(0, τ2I),

meaning that η is now the focus of the Bayesian nonparametric process.

◮ Now marginalizing over these η, we find that:

Y|A ∼ N (Xβ,Σ∗), Σ∗ =

I +τ 2

σ2AA′

since the DPP is applied to the random effects only.

Does Democracy Invite Terrorism?

◮ Looking at terrorist activity in 22 Asian countries over 8 years (1990-1997).

◮ Is there a relationship between levels of democracy and the number of terrorist attacks.

◮ Data problems restrict the number of cases to 150, and require us to use the Dirichlet Process

Random Effects Model to handle latent heterogeneity.

◮ The outcome of interest is dichotomous indicating whether or not there was at least one major

violent terrorist act in a country/year pair:

◮ We also include 4 explanatory variables in the model. . .

◮ DEM: measures democracy from the Polity IV 21-point democracy scale ranging from -10 indicating

a hereditary monarchy to +10 indicating a fully consolidated democracy:

-8 -7 -2 -1 0 1 3 4 5 6 7 8 9 10

1 31 7 2 4 4 3 4 19 7 5 18 10 35

◮ FED: assigned 0 if sub-national governments do not have substantial taxing, spending, and regula-

tory authority, and 1 otherwise:

122 28

◮ SYS: coded as 0 for direct presidential elections, 1 for strong president elected by assembly (in-

cluding sham assemblies), and 2 for dominant parliamentary government:

37 27 86

◮ AUT is a dichotomous variable indicating whether or not there are autonomous regions not directly

controlled by central government:

◮ So now our model looks like this:

logit(Y) = log

1− p

= β0 + DEMβ1 + FEDβ2 + SYSβ3 + AUTβ4 +DP(m,G0) + ǫ,

where p = p(Y = 1) is the probability of a “success,” given levels of the X variables.

◮ Expressed in this way, the specification features the “log-odds” model interpretation since log()

denotes the natural log function and p/(1− p) transforms probability to odds.

◮ Notice the use of the Dirichlet Process Random Effect here.

Dirichlet Process Model

Explanatory Variable COEF SE 95% CI Odds-Ratio

Intercept 0.127 0.188 -0.241 0.495 1.135

DEM (-10:10) 0.058 0.019 0.020 0.095 1.060

FED (0,1) 0.258 0.254 -0.241 0.756 1.294

SYS (0,1,2) -0.420 0.137 -0.690 -0.151 0.657

AUT (0,1) 0.450 0.371 -0.277 1.176 1.568

What Causes Terrorist Groups To Use Suicide Attacks, Background

◮ Suicide attacks pose a substantially higher challenge for governments since the assailant has great

control over placement and timing and also does not need to plan his or her escape.

◮ The data we use here come from the Global Terrorism Database II (LaFree & Dugan 2008),

restricted here to events in the Middle East and Northern Africa from 1998 to 2004.

◮ There were 273 terrorist attacks worldwide in 1998 with a (then) recorded high of 741 killed along

with 5952 injured.

◮ This starting year was also notable for the incredibly destructive simultaneous August truck bomb-

ings of U.S. Embassies in Nairobi, Kenya (212 killed and roughly 5000 injured), and Dar es Salaam,

Tanzania (11 killed and roughly 85 injured).

◮ After removing almost totally incomplete cases, this provides 1041 violent attacks by terrorist

groups, 154 (15%) of which were suicide attacks where at least one of the individual assailants was

killed by design.

◮ Our outcome variable of interest is therefore the dichotomous observation of a suicide attack or

◮ Again use the Dirichlet Process Random Effects Model.

What Causes Terrorist Groups To Use Suicide Attacks, Explanatory Variables

◮ MULT.INCIDENT: whether the attack is part of a coordinated multi-site event (13.1%).

◮ MULT.PARTY: multiple groups claiming credit, 136 out of 1041 cases coded as one.

◮ SUSP.UNCONFIRM is coded as one (209/1041) if government officials express notable doubt about

attributing responsibility.

◮ SUCCESSFUL: (some damage in 966 of the events) asks given that it is a successful attack, how

likely is it that a suicide assailant was used?

◮ WEAPON.TYPE: coded one for the use of: explosives, dynamite, or general bombs (558/1041).

◮ NUM.FATAL (3424 total).

◮ NUM.INJUR (8123 total).

◮ PSYCHOSOCIAL with ascending levels: none (18), minor (946), moderate (66), and major (11).

◮ PROPERTY.DAMAGE: no (480), 1 (minor), 560 (yes).

What Causes Terrorist Groups To Use Suicide Attacks, Model Results

Dirichlet Process Model

Explanatory Variable COEF SE 95% CI

Intercept -4.105 0.559 -5.276 -3.079

YEAR - 1998 0.195 0.039 0.121 0.273

MULT.INCIDENT -0.585 0.221 -1.028 -0.162

MULTI.PARTY -0.626 0.229 -1.088 -0.189

SUSP.UNCONFIRM -0.061 0.198 -0.455 0.331

SUCCESSFUL -0.695 0.245 -1.172 -0.210

WEAPON.TYPE 1.725 0.320 1.162 2.422

TARGET.TYPE -0.038 0.185 -0.434 0.323

NUM.FATAL -0.013 0.012 -0.036 0.009

NUM.INJUR 0.017 0.004 0.008 0.025

PSYCHOSOCIAL 0.555 0.192 0.188 0.944

PROPERTY.DAMAGE 0.297 0.094 0.114 0.483

Substantive Clustering Strategy

◮ In addition to the DPP component for random effects we search for partitions ofY into clusters

Cℓ, ℓ = 1, . . . ,m, where m (the number of clusters) is an unknown parameter.

◮ Let Yℓ be a vector of length nℓ containing the Yi in cluster Cℓ, then:

Yℓ = Xℓβℓ +Aℓηℓ + ǫℓ

where:

⊲ Xℓ and Aℓ are composed of the rows corresponding to the Yi in cluster Cℓ,

⊲ unknown βℓ and σ2ℓ (where ǫℓ ∼ N

(0, σ2ℓInℓ

)) are specific to cluster Cℓ.

◮ Given a partition Nn := {1, 2, . . . , n}, C that has m < n clusters denoted by C1, . . . , Cm, the dataare a realization from a density of the form:

f (y|βC,σ2C, C) =

i∈Cℓ

f (yi|βℓ, σ2ℓ ) .

◮ So unlike the mixture model, this model recognizes a parameter, C, that is directly connected to

the basic clustering problem.

◮ This model incorporates clustering in the data in two distinct ways:

⊲ it utilizes DP random effects to model unobserved heterogeneity in the data via subclusters,

⊲ the product partition model, using C, provides substantive clusters to the data that serve to

provide insights into how that data can be broken into groups that have different behavior.

◮ Note that these groupings do not nest, and so observations in the same cluster Cℓ can belong to

different subcluster defined by the columns of A (unlike Hartigan and Barry 1992).

BAAD Data

◮ Big Allied and Dangerous (BAAD) Database 1 (Asal, Rethemeyer & Anderson 2008).

◮ Assembled from several established databases: Memorial Institute for the Prevention of Terrorism’s

(MIPT) Terrorism Knowledge Base (TKB), Correlates of War (COW), Polity, and Polity2.

◮ This aggregates 395 worldwide lethal attacks from 1998-2005 by terrorist organizations.

◮ We use the version of their dataset that excludes Al Qaeda since its scope, profile, and effectiveness

place it in a unique category during this period.

◮ The variable fatalities (total number) is used as the outcome variable to focus on the primary

purpose of these attacks.

BAAD Explanatory Variables Used

◮ statespond indicates whether the group is financially or logistically supported by one or more

recognized governments (coded 1, n1 = 32), or not (coded 0, n0 = 363).

◮ masterccode denotes the COW CCODE value: where (country/region) attack took place.

◮ ordsize is size according to 0 for less than 100 members (n0 = 261), 1 for 101-1,000 members

(n1 = 77)), 2 for 1,001-10,000 members (n2 = 45), and 3 for more than 10,000 members (n = 12).

◮ terrStrong is coded 1 (n1 = 43) if they possess territory and 0 if they do not (n0 = 352).

◮ degree gives a count of alliance connections in the network sense.

More BAAD Explanatory Variables Used

◮ LeftNoReligEthno , where a 1 indicates that the group’s ideology is leftist and it is not com-

pounded with another ideological orientation (n1 = 94), and a 0 indicates that group’s ideology is

either not leftist or is a mix of leftist and at other ideological dimensions (n0 = 301).

◮ PureRelig indicates with a 1 whether the group’s ideology is purely religious and not associated

with other political or social factors (n1 = 50), and 0 otherwise (n0 = 345).

◮ PureEthno indicates with a 1 whether the group is ethnonationalist (nationalist causes tied to

ethnic identity) and not associated with other ideological factors (n1 = 26), and 0 otherwise

(n0 = 369).

◮ Islam where a 1 is assigned to groups inspired by some form of Islam (n1 = 287) and 0 otherwise

(n0 = 108).

BAAD Model Results

◮ We estimate the DPP/Product Partition model using the sampler described.

◮ The Gibbs Sampler is run for 10,000 iterations disposing of the first 5,000 as burn-in.

◮ Convergence is assessed with superdiag, a diagnostic suite provided by an R package (Tsai and

Gill 2012) that calls all of the conventional convergence diagnostics typically used (Gelman &

Rubin, Geweke, Heidelberger & Welch, Raftery & Louis).

◮ We also found no evidence of non-convergence with standard graphical tools (traceplots, cumsum

diagrams, etc.).

◮ The highest posterior probability cluster arrangement across these iterations (0.65191):

1 2 3 4

272 7 52 64

◮ Now we run a regular linear model (diffuse proper priors) with a single shared random effects and a

true multilevel linear model (diffuse proper priors) with the estimated clusters as group definitions.

Standard Linear Model Multilevel Linear Model

Mean Std.Err. 95% HPD Mean Std.Err. 95% HPD

α -0.290 1.287 [-2.811:2.232] α1 -3.835 0.843 [-5.486:-2.184]

α2 0.383 1.480 [-2.517: 3.283]

α3 -1.905 1.040 [-3.942: 0.133]

α4 19.235 1.139 [17.002:21.468]

statespond 0.514 1.193 [-1.824:2.851] 3.590 0.840 [ 1.945: 5.235]

masterccode 0.006 0.032 [-0.057:0.069] -0.054 0.019 [-0.092:-0.016]

ordsize 4.749 0.719 [ 3.339:6.159] 3.163 0.452 [ 2.277: 4.049]

terrStrong 3.849 1.355 [ 1.193:6.504] 1.886 0.974 [-0.022: 3.795]

degree 2.307 0.298 [ 1.723:2.890] 1.169 0.179 [ 0.818: 1.520]

LeftNoreligEthno 0.290 1.070 [-1.808:2.388] 0.838 0.707 [-0.548: 2.224]

PureRelig 1.131 1.307 [-1.431:3.694] 1.669 0.955 [-0.202: 3.540]

PureEthno -0.948 1.410 [-3.713:1.816] -1.378 1.045 [-3.427: 0.670]

Islam 2.851 1.203 [ 0.492:5.210] 3.179 0.857 [ 1.499: 4.858]

τ 0.009 0.001 [ 0.007:0.020] 0.027 0.002 [ 0.023: 0.031]

Summed Deviance 3002 Summed Deviance 2553

Variance Std.Dev.

σα 113.44 10.65

σy 1.31 1.15

Statistical Social Network Analysis Approaches to Understanding Terrorist Groups

◮ Mapping the social network around the 19 9/11 hijack-

ers revealed some of the outer organization. Krebs, “Mapping

Networks of Terrorist Cells.” Connections 24(3), 2002, 43-52.

◮ Cohesive subgroups and the number of hubs (central

points) in a network has an influence on the network’s

effectiveness. Pedahzur & Perliger, “The Changing Nature of Suicide At-

tacks.” Social Forces 84(4), 2006, 1987-2008.

◮ Self-learning network analyses are better describers with

covert targets. Carley & Breiger (eds.), Dynamic Network Analysis in

the Summary of the NRC workshop on Social Network Modeling and Analysis.

National Research Council.

◮ Data mining combined with SNA can reveal hidden struc-

tural patterns in large networks. Xu & Chen, “Criminal Network

Analysis and Visualization.” Communications of the ACM 48(6), 2005, 100-107.

ROMUL BONAVEN

AMBROSE

VICTOR

JOHNGREG

ALBERT

Covert/Terror Network Analysis

◮ This is classic Social Network Anal-

ysis, except with unwilling and sur-

reptitious targets.

◮ The central goal is to determine

which actors, nodes, are important

and how they communicate with

other actors, edges.

◮ Governments also want to under-

stand the effects of removing nodes

or edges.

◮ However, the defining characteristic

of these networks is that a large

amount of data is missing.

◮ And missing data is known to be

deleterious in network analysis.

So What Are Elicited Prior Distributions?

◮ So one idea is to draw (elicit) qualitative information that helps fill-in missingness.

◮ Joint work with John Freeman (Network Science 2013, etc.).

◮ A form of prior information produced by previous knowledge from structured interviews with

subjective area experts who have little or no concern for the statistical aspects of the project.

◮ Some potential targets for elicitation:

⊲ Policy-makers/elites

⊲ diplomats

⊲ military or intelligence experts

⊲ political professionals

⊲ previous study participants

⊲ theoretical economists

⊲ historians

⊲ jurists

⊲ regulators

⊲ community leaders

◮ The actual elicitation target in this application is a set of qualitative intelligence analysts.

A New Statistical Approach To Dealing with Network Missingness

◮ For missingness:

⊲ Elicit from analysts prior distributions for attributes that describe certainty and uncertainty.

⊲ Update these prior densities regularly to account for covert network dynamics.

⊲ Aggregate elicited prior densities to obtain still better information about edge attributes.

⊲ Incorporate elicited, aggregated information about attributes into network estimation algo-

rithms to increase their power to predict covert network links.

◮ Byproduct:

⊲ Use of attribute prior densities is a new way to evaluate source validity, if attribute priors

are elicited from different units within a single research group and, eventually, from other

government agencies.

Analyst Elicitation Stage, General

◮ Suppose elicitations are on attribute

strength: xij ∈ [0 : 1] between actor

i and actor j, or just information on

either individually.

◮ Example from a real data set: xij =

0 indicates certainty that actor i and

actor j are not from the same coun-

try, and xij = 1 indicates certainty

that actor i and actor j are from the

same country.

◮ In the absence of certainty we will

replace 0 and 1 with a beta distribu-

tion, which is conveniently bounded

[0 : 1] and can take on a wide variety

of shapes.

Analyst Elicitation Stage, General

◮ Challenges that we deal with here:

⊲ obtaining elicited prior distributions must be done without technical jargon,

⊲ many elicitees should be involved,

⊲ the quality of elicitations will differ across analysts,

⊲ elicitations should be at the convenience of the elicitees.

◮ These challenges are met by providing qualitative experts with an intuitive elicitation engine, and

keeping the detailed statistical analysis away from the elicitation process.

Analyst Elicitation Stage, Query Steps

1. The analyst at a supported location logs onto the system and picks a network edge, i, j.

2. The analyst then picks an attribute, xij to assess.

3. For the selected attribute the analyst is be asked for a mean value:

“On a scale of zero to one-hundred, what is your best estimate of the strength of this

attribute?”

which gives a beta distribution mean.

4. For the variance, we could follow the PERT (Program Evaluation and Review Technique) approach

and use σx ≈ 16, but instead we use a more conservative σx ≈ 1

4 as our starting point (from

asymptotic normal distribution theory).

5. Thus we have 25% of the maximum unimodality preserving variance just a starting point for our

software “slide.”

6. The analyst is then shown graphically on the terminal the beta distribution that results from these

statements and is allowed to modify it in terms of central location and width.

Elicited Prior Specification: One Elicitation

Elicited Prior Specification: Another Elicitation

Elicited Prior Specification: And Another Elicitation

Elicited Prior Specification: Confirmation Screen

Analyst Elicitation Stage, Parametric Principles

◮ The aggregated multi-step elicited prior actually uses the general beta distribution:

f (y) =Γ(α + β)

Γ(α)Γ(β)

(y−a)α−1(b− y)β−1

(b− a)α+β−1,

where: a < y < b, α, β > 0.

◮ Here b = 100 and a = 0 for operator convenience.

◮ The general form easily reduces to the standard form with the change of variable:

x =y − a

b− a, f (x) =

Γ(α + β)

Γ(α)Γ(β)xα−1(1− x)β−1

so that 0 < x < 1, but α and β are unchanged.

◮ So in this way our mean and variance are related directly to beta distribution parameters:

µy = a + µx(b− a) µx =α

α + β

σ2y = (b− a)2σ2x σ2x =αβ

(α + β)2(α + β + 1)

Analyst Elicitation Stage, Parametric Principles

◮ Solving these equations gives:

[µx(1− µx)

σ2x− 1

[µx(1− µx)

σ2x− 1

(1− µx)

◮ So if an elicitee provides estimates of both the mean and the variance, we can easily produce α

and β and thus fully describe the beta distribution of interest.

◮ Finally, if we restrict α ≥ 1 and β ≥ 1, then the beta distribution is guaranteed to be unimodal,

which is more intuitive and more supportable from a psychological point of view.

◮ Actually using α = 1 and β = 1 as an initial state before any elicitations is useful.

Aggregation Stage, Data Structures

◮ After a set of these elicitations we have:

α = [α1, α2, . . . , αn] β = [β1, β2, . . . , βn]

for n elicitees for each attribute of each edge.

◮ These can be organized as:[αijk,βijk

], i = 1:n, j = 1:J, k = 1:K

for i = 1:n elicitees, j = 1:J possible relationships, and k = 1:K attributes.

◮ Here K contains both individual attribute information and information on relationship attributes

for both targets designated by edge j

Aggregation Stage, Bayesian Updating

◮ The system is designed to be dynamic in that any authorized analyst can contribute at any time.

◮ Start with the original “day zero” assessment:

p1(x) ∝ xα1−1(1− x)β1−1,

which can be left as deliberately vague as desired.

◮ The distribution from the first analyst’s update is:

π1(x) ∝ p1(x)p2(x) = xα1+α2−2(1− x)β1+β2−2.

◮ So the nth update is given by:

πn(x) ∝ x

i=1αi−n

(1− x)

i=1βi−n

which is to say that x after update n is distributed as:

x|α,β ∼ BE

αi − n + 1,

βi − n + 1

Link Elicitation Experiment

◮ Subjects: 63 university student par-

ticipants at the University of Min-

nesota, who are given a tutorial first.

◮ Edge elicitation: a social network in

Eastenders.

◮ Procedure: show short clip from Eas-

tenders with interacting characters.

◮ Use DVD technology to present this on the same screen with headphones.

◮ First Elicitation: a question about likelihood two actors in the clip will take a certain action.

◮ Second Elicitation show additional clip (in sequence from original) with same interacting charac-

◮ Elicit assessment again of likelihood two characters will engage in social activity, phrasing falsely

implies sisterhood.

Data Structures

◮ Define first the n×n symmetric matrix Y giving a mapping of links between n named (terrorist)

individuals.

◮ Here, yij = 1 indicates a known link between node i and node j, yij = 0 indicates the absence of

evidence for a link, and numbers inbetween come from network predictions.

◮ Now define the n × n × K array X where for each n × n relationship between individual i

and individual j, there is a K-length vector of covariate information containing: attributes for

i, attributes for j, and natural relationship attributes (CoO, training camp, sect, skills, joint

operations, relatives, etc.) between i and j.

◮ These X values could be known, or they could be unknown but possess elicited priors, in which

case the array value is place-holder for the distribution.

Exponential Random Graph Model

◮ An appealing model that relates X and Y is the random effects logistic regression specification:

p(Y|θij) =∏

exp(θij)

1 + exp(θij)

θij = β′Xij + zij

zij = u′iγvj + ǫij

where β is aK-length vector of coefficients to estimate, and zij is a random effects term to account

for dependencies between attribute relationships.

◮ The random effects term is broken up into components: a u′i vector of sender-specific latent or

known factors, a vj vector of receiver-specific latent or known factors, a γ diagonal matrix of

unknown coefficients, plus a ǫij scalar error specific to the edge.

◮ This last component allows for asymmetric relationships, for example Abu-Mohammed al-Maqdisi

was the mentor to Abu Musab al-Zarqawi.

◮ So we have in this model log-odds(yij = 1) = θij where the parameters of interest are β and γ,

giving the relative importance of covariates or latent factors respectively.

Full Model Specification

◮ X∗ represents the X values that are not known with certainty and given (weighted) beta priors

from our elicitation procedure,

◮ U and V, are both n×K matrices that collect the u′i and vj terms.

◮ Then, given prior distributions on the model parameters and our elicited priors, we obtain their

posterior distribution with:

p(Θ,β,X∗,U,γ,V|X,Y)︸︷︷︸

posterior distribution

∝ p(Y,X|Θ,β,X∗,U,γ,V)︸︷︷︸

joint data distribution

× p(Θ,β,X∗,U,γ,V)︸︷︷︸

prior distributions

◮ And this model is also estimated with Gibbs Sampling.

◮ Taking the estimated parameters we can get predictions, Y, and graph the model. . .

Updating EastEnders Network with Experimental Priors

Estimated Edge Changes (“going out later”) Between Kat Slater and Mo Harris

Another Elicitation Example: the Northern Irish “Troubles”

◮ Consider 60 well-known figures of the Provisional Irish Republican Army:

henderson app.bricklayer macbrdaigh <NA> campbell breadserver kelly van-driver

Mcdermott electrician black-dnnly various.jobs mccrudden barman fox appr.wlder.unem

forsythe wkd.at.foundry ryan sell.ap.rn.etc clarke <NA> bailey <NA>

jordan various.jobs mcparland cabinet.maker quigley student mcgrillen self-emp.lrydrvr

finucane at.flower.mill steele bakers.roundsmn mcareavey chef tolan <NA>

hall steel.erector blake van.driver donaghy <NA> carson docl.laborer

fennell appr.engineer mcgoldrick app.plumber mckinney swyrls.cstle.st delaney <NA>

rooney various.jobs hughes <NA> mcguire barman olneil <NA>

mcdermott none.listed simpson fitter.Omackies carberry heatng.enginr hannaway <NA>

kane wk.scrap.merch olneil insurance.clerk liggett <NA> burns <NA>

lennon none.listed kavanagh app.compositor olrawe docker campbell <NA>

o’callaghan lorry.driver johnston Market.Short.H mulvenna appr.jointer dempsey <NA>

Turley <NA> crossan bricklayer bryson appr.bricklyer mckenna <NA>

mckernan <NA> mccann txtle.scn.prnter0l skillen bricklayer kane furniture.bus.

mccracken <NA> lewis <NA> stone car.sprayer saunders time-mtion.clrk

◮ Other covariates include: first initial, year born, year died, year joined, age died, where from, bat-

talion, how died, where died, career trajectory, rank at death, married, children, partner pregnant,

Republican family, been in jail, sex.

◮ Starting with basic knowledge, obtain elicitations from journalism students at City University

London (thanks to Prof. Richard Collins).

Updating the PIRA Network from Journalist Elicitations

THANK YOU!

Likelihood and Estimation

◮ A general random effects Dirichlet Process model can be written

(Y1, . . . , Yn) ∼ f (y1, . . . , yn | θ, ψ1, . . . , ψn) =∏

f (yi|θ, ψi), ψi ∼ DP(m,φ0), i = 1, . . . , n−1

(the vector θ here is a placeholder for all of other the estimated parameters, including the β).

◮ Applying the successive conditional distributions, we can integrate over the random effects to get

the joint distribution of the data:

L(θ | y) =

f (y1, . . . , yn | θ, ψ1, . . . , ψn)π(ψ1, . . . , ψn) dψ1 · · · dψn

=Γ(m)

Γ(m + n)

C:|C|=k

Γ(nj)

f (y(j) |θ, ψj)φ0(ψj) dψj

which gives estimates of all of the desired regression parameters, and

⊲ C is a partition of the sample of size n into k groups, k = 1, . . . n− 1

⊲ y(j) is the vector of yis in subcluster j

⊲ ψj is the common parameter applied to that subcluster.

Aggregation Stage, Single Node Updates

◮ In cases where xi and xj distributions (attributes on individuals only) are given, the prior on xijis calculated by “differencing” beta distributions according to:

αxij = kαmin(αi, αj) βxij = kβ max(βi, βj)

where the individual parameter values come from the most updated priors for individuals i and

◮ If the α or the β parameter pairs differ by a large amount, then the relationship attribute tends

towards a beta distribution that reflects a low relationship probability.

◮ Conversely, if there is substantial agreement in parameter values, the minimum and the maximum

will be very close together and aggregation will change the prior little.

◮ Here kα and kβ are tuning parameters that reflect management uncertainty in the node to rela-

tionship process just described.

◮ Thus we always provide a relationship assessment, xij, as input to the network model.

Reconciling Divergent Views

◮ Suppose we have beta priors for some attribute of nodes xi and xj according to xi ∼ BE(1.2, 6)and xj ∼ BE(6, 1.2) giving obviously divergent assessments.

◮ Reflecting some uncertainty, management assigns kα = 0.8 and kβ = 1.2, which is symmetric

around 1.0 but need not be.

◮ The relationship prior reflects significant skepticism about a relationship based on this resulting

beta specification as shown below

0.0 0.2 0.4 0.6 0.8 1.0

BE(α = 1.2, β = 6)

Node Assessment for xi

0.0 0.2 0.4 0.6 0.8 1.0

BE(α = 6, β = 1.2)

Node Assessment for xj

0.0 0.2 0.4 0.6 0.8 1.0

BE(α = 0.96, β = 7.2)

Edge Assessment for xixj

Dirichlet Process Prior Clusters Are Not Clusters

◮ A typical strategy is to use DPP models to generate a very large number of candidate “clusters,”

which are actually subclusters, then choose the best of these by a post-hoc scheme that processes

the MCMC output through some objective function to find the best grouping.

◮ This is wrong.

◮ The supposed-clusters produced by the MCMC process in repeated realizations of the Dirichlet

process are:

◮ not substantive in any way,

◮ not able to reflect any real cluster structure driven by the covariates,

◮ temporary random effect assignments to make the model fit better in the context of the sampler.

◮ Since there is no over-fitting penalty in the Dirichlet process, we can expect there to always be

more subclusters than actual substantive clusters in the data.

◮ Therefore we seek to complement the modeling approach just described with a feature that leads

to the simultaneous estimation of real clustering in the data with a product partition model.

Mixture and Product Partition Models

◮ The standard mixture model begins with the assumption that Y1, . . . , Yn are realizations of n

which are independent and identically distributed (iid) random variables within theirm-component

mixtures, giving the density:

f (y|β, ω) =m∑

ωℓ f (yℓ|βℓ) ,

where m < n is a fixed positive integer, 0 ≤ ωℓ ≤ 1,∑m

ℓ=1 ωℓ = 1.

◮ An alternative, the product partition model, starts by conditioning on a given partition, and then

determines the posterior probabilities of these.

◮ Given a partition Nn := {1, 2, . . . , n}, C that has m < n clusters denoted by C1, . . . , Cm, the dataare a realization from a density of the form:

f (y|βC,σ2C, C) =

i∈Cℓ

f (yi|βℓ, σ2ℓ ) .

◮ So unlike the mixture model, this model recognizes a parameter, C, that is directly connected to

the basic clustering problem and is part of the estimation process.

◮ This model was developed by Hartigan (1990) (see also Barry & Hartigan 1992, Crowley 1997).

Reasons Not to Prefer the Mixture Model for Clustering

◮ Parameterization: the mixture model lacks a model parameter that defines the clusters, which

can confound standard estimation processes (McCullagh & Yang 2008, Booth, Casella and Hobert

2008).

◮ Cluster Identification: even if the mixture model parameters of the model are known, there needs

to be some way of generating a latent variable to identify clusters (McLachlan & Peel 2004).

◮ Ad Hoc Selection: the final model needs to be run with a fixed m, with the typical strategy

running a user-defined selection of m values and choosing the one with the best BIC, or similar

criteria (Si and Reiter 2013).

◮ Applications: in applied settings the data “seldom contain much information about parameters

such as the number of clusters in the population” (McCullaugh & Yang 2008).

◮ Label Switching: the mixture model is prone to the label switching problem (invariance of the

likelihood under relabeling of the mixture components), particularly in Bayesian settings (Jasra,

Holmes & Stephens 2005, Stephens 2000, Celeux 1998).

Reasons To Prefer the Product Partition Model for Clustering

◮ Computation: the product partition model partition process can be predictor-dependent and

computationally efficient (Park and Dunson 2010).

◮ Model Dimensionality: a stochastic search algorithm can be setup to move between different size

partitions at each iteration of a sampler (Booth, Casella and Hobert 2008).

◮ Cluster Identification: Contrary to the mixture model, the product partition model clearly identi-

fies the parameter that determines the cluster, and has no restriction onm, the number of clusters,

other than m < n (Crowley 1997).

◮ Label Switching: since the product partition model is label-free (the clusters are all defined by

unique partitions of Nn = {1, 2, . . . , n}), we can easily identify mappings of cases to clusters

(Hartigan 1990, Barry & Hartigan 1992).

◮ In addition to the DPP component for random effects we search for partitions ofY into clusters

Cℓ, ℓ = 1, . . . ,m, where m (the number of clusters) is an unknown parameter.

◮ Let Yℓ be a vector of length nℓ containing the Yi in cluster Cℓ, then:

Yℓ = Xℓβℓ +Aℓηℓ + ǫℓ

where:

⊲ Xℓ and Aℓ are composed of the rows corresponding to the Yi in cluster Cℓ,

⊲ unknown βℓ and σ2ℓ (where ǫℓ ∼ N

(0, σ2ℓInℓ

)) are specific to cluster Cℓ.

◮ This model incorporates clustering in the data in two distinct ways:

⊲ it utilizes DP random effects to model unobserved heterogeneity in the data via subclusters,

⊲ the product partition model, using C, provides substantive clusters to the data that serve to

provide insights into how that data can be broken into groups that have different behavior.

◮ Note that these groupings do not nest, and so observations in the same cluster Cℓ can belong to

different subcluster defined by the columns of A (unlike Hartigan and Barry 1992).

◮ Our goal is to find the best partition C = (C1, . . . , Cm), but the A matrix defining k subclusters

cannot be ignored.

◮ Using the DPP we want to find the posterior probability of C, marginalized over the coefficients

and random effects, which requires both integration over η and summation over the A matrices.

◮ Note that use of the DP random effects produces a correlation between individuals both within

the same cluster and in different clusters, a non-nested hierarchical specification.

Cluster Prior Probabilities

◮ Each βℓ is given a multilevel model structure with common underlying mean β0 and locally scaled

precision matrix S:

βℓ ∼ N(β0, σ

−1).

◮ Each cluster-specific variance parameter σ2ℓ is assigned an inverse-gamma prior with common

assigned hyperparameters:

σ2ℓ ∼ IG

2,bσ2

◮ The remaining assigned priors have the forms:

DP: φ0 ∼ N(0, τ 2

)τ 2 ∼ IG

2 ,bτ2

λ ∼ G(aλ2 ,

PP: β0 ∼ N (0, σ2βS−1) σ2β ∼ IG

(aσ2β

2 ,bσ2β

S ∼ W(V −1, aS)

V = Diag(v1, . . . , vp) vi ∼ G(av2 ,

)C ∼???

Political Science Association, Methodology Initiative British … · 2018-08-14 · Harris,...

Documents