A Review of Multivariate Distributions for Count Data ...pradeepr/paperz/wics1398.pdf · variate...

Advanced Review

A review of multivariatedistributions for count data derivedfrom the Poisson distributionDavid I. Inouye,1 Eunho Yang,2 Genevera I. Allen3,4 and Pradeep Ravikumar5*

The Poisson distribution has been widely studied and used for modeling uni-variate count-valued data. However, multivariate generalizations of the Pois-son distribution that permit dependencies have been far less popular. Yet,real-world, high-dimensional, count-valued data found in word counts, geno-mics, and crime statistics, for example, exhibit rich dependencies and motivatethe need for multivariate distributions that can appropriately model this data.We review multivariate distributions derived from the univariate Poisson,categorizing these models into three main classes: (1) where the marginal dis-tributions are Poisson, (2) where the joint distribution is a mixture of inde-pendent multivariate Poisson distributions, and (3) where the node-conditionaldistributions are derived from the Poisson. We discuss the development ofmultiple instances of these classes and compare the models in terms of inter-pretability and theory. Then, we empirically compare multiple models fromeach class on three real-world datasets that have varying data characteristicsfrom different domains, namely traffic accident data, biological next genera-tion sequencing data, and text data. These empirical experiments developintuition about the comparative advantages and disadvantages of each class ofmultivariate distribution that was derived from the Poisson. Finally, we sug-gest new research directions as explored in the subsequent Discussion section.© 2017 Wiley Periodicals, Inc.

How to cite this article:WIREs Comput Stat 2017, 9:e1398. doi: 10.1002/wics.1398

Keywords: Poisson, Multivariate, Graphical models, Copulas, High dimensional

INTRODUCTION

Multivariate count-valued data has becomeincreasingly prevalent in modern big data set-

tings. Variables in such data are rarely independentand instead exhibit complex positive and negativedependencies. We highlight three examples of multi-variate count-valued data that exhibit rich depen-dencies: text analysis, genomics, and crime statistics.In text analysis, a standard way to represent docu-ments is to merely count the number of occurrencesof each word in the vocabulary and create a word-count vector for each document. This representationis often known as the bag-of-words representation,in which the word order and syntax are ignored.The vocabulary size—i.e., the number of variables

Additional Supporting Information may be found in the online ver-sion of this article.

*Correspondence to: [email protected] of Computer Science, The University of Texas at Aus-tin, Austin, TX, USA2School of Computing, Korea Advanced Institute of Science andTechnology, Daejeon, Republic of Korea3Department of Statistics, Rice University, Houston, TX, USA4Department of Pediatrics-Neurology, Baylor College of Medicine,Houston, TX, USA5School of Computer Science, Carnegie Mellon University, Pitts-burgh, PA, USA

Conflict of interest: The authors have declared no conflicts of inter-est for this article.

Volume 9, May/June 2017 © 2017 Wiley Per iodica ls , Inc. 1 of 25

in the data—is usually much greater than 1000unique words, and thus, a high-dimensional multi-variate distribution is required. Also, words areclearly not independent. For example, if the word‘Poisson’ appears in a document, then the word‘probability’ is more likely to also appear, signifyinga positive dependency. Similarly, if the word ‘art’appears, then the word ‘probability’ is less likely toalso appear, signifying a negative dependency. Ingenomics, RNA-sequencing technologies are used tomeasure gene and isoform expression levels. Thesetechnologies yield counts of reads mapped back toDNA locations, which even after normalizationyield non-negative data that is highly skewed withmany exact zeros. These genomics data are bothhigh dimensional, with the number of genes measur-ing in the tens of thousands, and strongly dependentas genes work together in pathways and complexsystems to produce particular phenotypes. In crimeanalysis, counts of crimes in different counties areclearly multidimensional, with dependencies betweencrime counts. For example, the counts of crime inadjacent counties are likely to be correlated withone another, indicating a positive dependency.While positive dependencies are probably moreprevalent in crime statistics, negative dependenciesmight be very interesting. For example, a negativedependency between adjacent counties may suggestthat a criminal gang has moved from one county tothe other.

These examples motivate the need for a high-dimensional, count-valued distribution that permitsrich dependencies between variables. In general, agood class of probabilistic models is a fundamentalbuilding block for many tasks in data analysis. Esti-mating such models from data could help answerexploratory questions such as: Which genomic path-ways are altered in a disease, e.g., by analyzing geno-mic networks? Or which county seems to have thestrongest effect, with respect to crime, on other coun-ties? A probabilistic model could also be used inBayesian classification to determine questions suchas: Does this Twitter post display positive or negativesentiment about a particular product (fitting onemodel on positive posts and one model on negativeposts)?

The classical model for a count-valuedrandom variable is the univariate Poisson distribu-tion, whose probability mass function forx 2 {0, 1, 2, …} is:

ℙPoiss xjλð Þ= λxexp −λð Þ=x!; ð1Þ

where λ is the standard mean parameter for the Poissondistribution. A trivial extension of this to a multivariatedistribution would be to assume independence betweenvariables and take the product of node-wise univariatePoisson distributions, but such a model would be illsuited for many examples of multivariate count-valueddata that require rich dependence structures. Wereview multivariate probability models that are derivedfrom the univariate Poisson distribution and permitnontrivial dependencies between variables. We catego-rize these models into three main classes based on theirprimary modeling assumption. The first class assumesthat the univariate marginal distributions are derivedfrom the Poisson. The second class is derived as a mix-ture of independent multivariate Poisson distributions.The third class assumes that the univariate conditionaldistributions are derived from the Poissondistribution—this last class of models can also be stud-ied in the context of probabilistic graphical models. Anillustration of each of these three main model classescan be seen in Figure 1. While these models might havebeen classified by primary application area or perfor-mance on a particular task, a classification based onmodeling assumptions helps emphasize the coreabstractions for each model class. In addition, this cate-gorization may help practitioners from different disci-plines learn from the models that have worked well indifferent areas. We discuss multiple instances of theseclasses in the later sections and highlight the strengthsand weaknesses of each class. We then provide a shortdiscussion on the differences between classes in termsof interpretability and theory. Using two differentempirical measures, we empirically compare multiplemodels from each class on three real-world datasetsthat have varying data characteristics from differentdomains, namely traffic accident data, biological nextgeneration sequencing data, and text data. Theseexperiments develop intuition about the comparativeadvantages and disadvantages of the models and sug-gest new research directions as explored in the subse-quent Discussion section.

Notationℝ denotes the set of real numbers, ℝ+ denotes thenon-negative real numbers, and ℝ++ denotes the posi-tive real numbers. Similarly, ℤ denotes the set of inte-gers. Matrices are denoted as capital letters(e.g., X, Φ), vectors are denoted as boldface lower-case letters (e.g., x, ϕ), and scalar values are nonboldlowercase letters (e.g., x, ϕ).

Advanced Review wires.wiley.com/compstats

2 of 25 © 2017 Wiley Per iodicals , Inc. Volume 9, May/June 2017

MARGINAL POISSONGENERALIZATIONS

The models in this section generalize the univariatePoisson to a multivariate distribution with the propertythat the marginal distributions of each variable arePoisson. This is analogous to the fact that the marginaldistributions of the multivariate Gaussian distributionare univariate Gaussian distributions and thus seemslike a natural constraint when extending the univariatePoisson to the multivariate case. Several historicalattempts at achieving this marginal property have inci-dentally developed the same class of models, with dif-ferent derivations.1–4 This marginal Poisson propertycan also be achieved via the more general frameworkof copulas.5–7

Multivariate Poisson DistributionThe formulation of the multivariate Poissona distri-bution goes back to M’Kendrick2 where the authorused differential equations to derive the bivariatePoisson process. An equivalent but more readableinterpretation to arrive at the bivariate Poisson distri-bution would be to use the summation of independ-ent Poisson variables as follows1: let y1, y2, and z beunivariate Poisson variables with parameters λ1, λ2,and λ0, respectively. Then, by setting x1 = y1 + z andx2 = y2 + z, (x1, x2) follows the bivariate Poisson dis-tribution, and its joint probability mass is defined as:

ℙBiPoi x1,x2 jλ1,λ2,λ0ð Þ = exp −λ1−λ2−λ0ð Þ

�λx11x1!

λx222!

Xmin x1,x2ð Þ

z= 0

x1z

� �x2z

� �z!

λ0λ1λ2

� �z

: ð2Þ

As the sum of independent Poissons is also Poisson(whose parameter is the sum of those of two compo-nents), the marginal distribution of x1 (similarly x2)is still a Poisson with the rate of λ1 + λ0. It can beeasily seen that the covariance of x1 and x2 is λ0, andas a result, the correlation coefficient is somewhere

between 0 and minffiffiffiffiffiffiffiffiffiffiλ1 + λ0

pffiffiffiffiffiffiffiffiffiffiλ2 + λ0

p ,ffiffiffiffiffiffiffiffiffiffiλ2 + λ0

pffiffiffiffiffiffiffiffiffiffiλ1 + λ0

pn o

.8 Independently,

Wicksell4 derived the bivariate Poisson as the limit ofa bivariate binomial distribution. Campbell1 showsthat the models in M’Kendrick2 and Wicksell4 canidentically be derived from the sums of three inde-pendent Poisson variables.

This approach to directly extend the Poissondistribution can be generalized further to handle themultivariate case x2ℤd

+ , in which each variable xi isthe sum of individual Poisson yi and the commonPoisson x0 as before. The joint probability for a mul-tivariate Poisson is developed in Teicher3 and furtherconsidered by other works9–12:

ℙMulPoi x;λð Þ= exp −Xdi = 0

λi

! Ydi =1

λxiixi!

!

�Xminixi

z= 0

Ydi = 1

xiz

� � !z!

λ0Ydi =1

λi

0BBBB@1CCCCA

z

: ð3Þ

Several studies have shown that this formulation ofthe multivariate Poisson can also be derived as a lim-iting distribution of a multivariate binomial distribu-tion when the success probabilities are small and the

Marginal Poisson

Marginals arePoisson

0

5

10 10

5

0

0

5

10 10

5

0

0

5

10 10

5

0

0

5

10 10

5

0

0

5

10 10

5

0

MultivariatePoissons

Mixture

Conditionalsare Poisson

Mixture of Poissons Conditional Poissons

FIGURE 1 | (Left) The first class of Poisson generalizations is based on the assumption that the univariate marginals are derived from thePoisson. (Middle) The second class is based on the idea of mixing independent multivariate Poissons into a joint multivariate distribution. (Right)The third class is based on the assumption that the univariate conditional distributions are derived from the Poisson.

WIREs Computational Statistics A review of multivariate distributions for count data


number of trials is large.13–15 As in the bivariate case,the marginal distribution of xi is Poisson with param-eter λi + λ0. As λ0 controls the covariance between allvariables, an extremely limited set of correlationsbetween variables is permitted.

Mahamunulu16 first proposed a more generalextension of the multivariate Poisson distributionthat permits a full covariance structure. This distribu-tion has been studied further by many studies.13,17–20

While the form of this general multivariate Poissondistribution is too complicated to spell out for d > 3,its distribution can be specified by a multivariatereduction scheme. Specifically, let yi for i = 1,…, (2d − 1) be independently Poisson distributedwith parameter λi. Now, define A = [A1, A2,

… Ad] where Ai is a d ×di

� �matrix consisting of

ones and zeros, where each column of Ai has exactlyi ones with no duplicate columns. Hence, A1 is thed × d identity matrix, and Ad is a column vector ofall ones. Then, x = Ay is a d -dimensional multivari-ate Poisson-distributed random vector with a fullcovariance structure. Note that the simpler multivari-ate Poisson distribution with constant covariance inEq. 0.3 is a special case of this general form, whereA = [A1, Ad].

The multivariate Poisson distribution has notbeen widely used for real data applications. This islikely due to two major limitations of this distribu-tion. First, the multivariate Poisson distribution onlypermits positive dependencies; this can easily be seenbecause the distribution arises as the sum of inde-pendent Poisson random variables, and hence, covar-iances are governed by the positive rate parametersλi. The assumption of positive dependencies is likelyunrealistic for most real count-valued data examples.Second, the computation of probabilities and infer-ence of parameters is especially cumbersome for themultivariate Poisson distribution; these are only com-putationally tractable for small d and hence not read-ily applicable in high-dimensional settings. Kanoand Kawamura17 proposed multivariate recursionschemes for computing probabilities, but theseschemes are only stable and computationally feasiblefor small d, thus complicating likelihood-based infer-ence procedures. Karlis18 more recently proposed alatent variable-based EM algorithm for parameterinference of the general multivariate Poisson distribu-tion. This approach treats every pairwise interactionas a latent variable and conducts inference over boththe observed and hidden parameters. While thismethod is more tractable than recursion schemes, it

still requires inference overd2

� �latent variables and

is hence not feasible in high-dimensional settings.Overall, the multivariate Poisson distribution intro-duced above is appealing in that its marginal distri-butions are Poisson; yet, there are many modelingdrawbacks, including severe restriction on the typesof dependencies permitted (e.g., only positive rela-tionships), a complicated and intractable form inhigh dimensions, and challenging inferenceprocedures.

Copula ApproachesA much more general way to construct valid multi-variate Poisson distributions with Poisson marginalsis by pairing a copula distribution with Poisson mar-ginal distributions. For continuous multivariate dis-tributions, the use of copula distributions is foundedon the celebrated Sklar’s theorem: any continuousjoint distribution can be decomposed into a copulaand the marginal distributions, and conversely, anycombination of a copula and marginal distributionsgives a valid continuous joint distribution.21 The keyadvantage of such models for continuous distribu-tions is that copulas fully specify the dependencestructure, hence separating the modeling of marginaldistributions from the modeling of dependencies.While copula distributions paired with continuousmarginal distributions enjoy wide popularity (seee.g., Ref 22 in finance applications), copula modelspaired with discrete marginal distributions, such asthe Poisson, are more challenging both for theoreticaland computational reasons.23–25 However, severalsimplifications and recent advances have attemptedto overcome these challenges.24–26

Copula Definition and ExamplesA copula is defined by a joint cumulative distributionfunction (CDF), C(u) : [0, 1]d ! [0, 1], with uniformmarginal distributions. As a concrete example, theGaussian copula (see left subfigure of Figure 2 for anexample) is derived from the multivariate normal dis-tribution and is one of the most popular multivariatecopulas because of its flexibility in the multidimen-sional case; the Gaussian copula is defined simply as:

CGaussR u1,u2,…,udð Þ=HR H−1 u1ð Þ,…,H−1 udð Þ� �

;

where H− 1(�) denotes the standard normal inverseCDF, and HR(�) denotes the joint CDF of a N 0,Rð Þrandom vector, where R is a correlation matrix.A similar multivariate copula can be derived fromthe multivariate Student’s t distribution if extremevalues are important to model.27



The Archimedean copulas are another family ofcopulas that have a single parameter that defines theglobal dependence between all variables.28 One prop-erty of Archimedean copulas is that they admit anexplicit form, unlike the Gaussian copula. Unfortu-nately, the Archimedean copulas do not directlyallow for a rich dependence structure like the Gaus-sian because they only have one dependence parame-ter rather than a parameter for each pair of variables.

Pair copula constructions (PCCs)29 for copulas,or vine copulas, allow combinations of different bivari-ate copulas to form a joint multivariate copula. PCCsdefine multivariate copulas that have an expressivedependency structure, like the Gaussian copula, butmay also model asymmetric or tail dependencies avail-able in Archimedean and t copulas. Pair copulas onlyuse univariate CDFs, conditional CDFs, and bivariatecopulas to construct a multivariate copula distributionand hence can use combinations of the Archimedeancopulas described previously. The multivariate distri-butions can be factorized in a variety of ways usingbivariate copulas to flexibly model dependencies.Vines, or graphical tree-like structures, denote the pos-sible factorizations that are feasible for PCCs.30

Copula Models for Discrete DataAs per Sklar’s theorem, any copula distribution canbe combined with marginal distribution CDFs,

Fi xið Þf gdi = 1, to create a joint distribution:

G x1,x2,…,xd jθ,F1,…,Fdð Þ=Cθ u1 = F1 x1ð Þ,…,ud =Fd xdð Þð Þ:

If sampling from the given copula is possible, this formadmits simple direct sampling from the joint

distribution (defined by the CDFG(�)) by first samplingfrom the copula u � Copula(θ) and then transformingu to the target space using the inverse CDFs of the mar-ginal distributions: x = F−1

1 u1ð Þ,…,F−1d udð Þ� �

.A valid multivariate discrete joint distribution

can be derived by pairing a copula distribution withPoisson marginal distributions. For example, a validjoint CDF with Poisson marginals is given by

G x1,x2,…,xd jθð Þ =Cθ F1 x1 jλ1ð Þ,…,Fd xd jλdð Þð Þ;

where Fi(xi | λi) is the Poisson CDF with meanparameter λi, and θ denotes the copula parameters.If we pair a Gaussian copula with Poisson marginaldistributions, we create a valid joint distributionthat has been widely used for generating samples ofmultivariate count data7,31,32; an example of theGaussian copula paired with Poisson marginals toform a discrete joint distribution can be seen inFigure 2.

Nikoloulopoulos5 present an excellent surveyof copulas to be paired with discrete marginals bydefining several desired properties of a copula(quoted from Ref 5):

1. Wide range of dependence, allowing both posi-tive and negative dependence.

2. Flexible dependence, meaning that the numberof bivariate marginals is (approximately) equalto the number of dependence parameters.

3. Computationally feasible CDF for likelihoodestimation.

4. Closure property under marginalization, mean-ing that lower-order marginals belong to thesame parametric family.

Gaussion copula density (r = 0.50) Copula-Poisson joint dist.Marginal Poissons

0

0

0.5

u1 1 1

0.5u

2

0

0

5

10 10

5

0

1 2 3 4 5

Poisson 1

6 7 8 9

0 1 2 3 4 5

Poisson 2

6 7 8 9

FIGURE 2 | A copula distribution (left)—which is defined over the unit hypercube and has uniform marginal distributions—paired withunivariate Poisson marginal distributions for each variable (middle) defines a valid discrete joint distribution with Poisson marginals (right).



5. No joint constraints for the dependence para-meters, meaning that the use of covariatefunctions for the dependence parameters isstraightforward.

Each copula model satisfies some of these propertiesbut not all of them. For example, Gaussian copulassatisfy properties (1), (2), and (4) but not (3) becausethe Gaussian CDF is not known in closed form nor(5) because the correlation parameter matrix is con-strained to be in the set of positive definite matrices.Nikoloulopoulos5 recommends Gaussian copulas forgeneral models and vine copulas if modeling depend-ence in the tails or asymmetry is needed.

Theoretical Properties of Copulas Derived fromDiscrete DistributionsFrom a theoretical perspective, a multivariate discretedistribution can be viewed as a continuous copuladistribution paired with discrete marginals, but thederived copula distributions are not unique and,hence, are unidentifiable.23 Note that this is in con-trast to continuous multivariate distributions wherethe derived copulas are uniquely defined.33 Becauseof this nonuniqueness property, Genest and Nešle-hová23 caution against performing inference on andinterpreting dependencies of copulas derived fromdiscrete distributions. A further consequence of non-uniqueness is that when copula distributions arepaired with discrete marginal distributions, the copu-las no longer fully specify the dependence structure,as with continuous marginals.23 In other words, thedependencies of the joint distribution will depend inpart on which marginal distributions are employed.In practice, this often means that the range of depen-dencies permitted with certain copula and discretemarginal distribution pairs is much more limited thanthe copula distribution would otherwise model.However, several studies have suggested that thisnonuniqueness property does not have major practi-cal ramifications.5,34

We discuss a few common approaches used forthe estimation of continuous copulas with discretemarginals.

Continuous Extension for ParameterEstimationFor the estimation of continuous copulas from data,a two-stage procedure, called Inference Function forMarginals (IFM)35 is commonly used in which themarginal distributions are estimated first and thenused to map the data onto the unit hypercube usingthe CDFs of the inferred marginal distributions.While this is straightforward for continuous

marginals, this procedure is less obvious for discretemarginal distributions when using a continuous cop-ula. One idea is to use the continuous extension(CE) of integer variables to the continuous domain36

by forming a new ‘jitter’ continuous random varia-ble ex:

ex = x+ u−1ð Þ;

where u is a random variable defined on the unitinterval. It is clear that this new random variable iscontinuous, and ex ≤ x. An obvious choice for the dis-tribution of u is the uniform distribution. With thisidea, inference can be performed using a surrogatelikelihood by randomly projecting each discrete datapoint into the continuous domain and averaging overthe random projections, as performed in Refs 37,38.Madsen39 and Madsen and Fang40 use the CE ideaas well but generate multiple jittered samplesex 1ð Þ,ex 1ð Þ,…,ex mð Þgn

for each original observation x to

estimate the discrete likelihood rather than merelygenerating one jittered sample ex for each originalobservation x as in Refs 37,38. Nikoloulopoulos24

finds that CE-based methods significantly underesti-mate the correlation structure because the CE jittertransform operates independently for each variableinstead of considering the correlation structurebetween the variables.

Distributional Transform for ParameterEstimationIn a somewhat different direction, Rüschendorf26

proposed the use of a generalization of the CDF dis-tribution function F(�) for the case with discrete vari-ables, which they term a distributional transform

(DT), denoted by eF �ð Þ:

eF x,vð Þ≡ F xð Þ+ vℙ xð Þ =ℙ X < xð Þ + vℙ X = xð Þ;

where v � Uniform(0, 1). Note that in the continu-ous case, ℙ(X = x) = 0, and thus, this reduces to thestandard CDF for continuous distributions. One wayof thinking of this modified CDF is that the randomvariable v adds a random jump when there are dis-continuities in the original CDF. If the distribution isdiscrete (or more generally, if there are discontinu-ities in the original CDF), this transformation enablesthe simple proof of a theorem akin to Sklar’s theoremfor discrete distributions.26

Kazianka41 and Kazianka and Pilz42 proposeusing the DT from Ref 26 to develop a simple andintuitive approximation for the likelihood.



Essentially, they simply take the expected jump valueof E vð Þ = 0:5 (where v � Uniform(0, 1)) and thustransform the discrete data to the continuous domainthrough the following:

ui ≡ Fi xi−1ð Þ +0:5ℙ xið Þ= 0:5 Fi xi−1ð Þ+ Fi xið Þð Þ;

which can be seen as simply taking the average of theCDF values at xi − 1 and xi. Then, they use a contin-uous copula such as the Gaussian copula. Note thatthis is much simpler to compute than the simulatedlikelihood (SL) method in Ref 24 or the CE methodsin Refs 37–40, which require averaging over manydifferent random initializations.

SL for Parameter EstimationFinally, Nikoloulopoulos24 proposes a method todirectly approximate the maximum likelihood esti-mate (MLE) by estimating a discretized Gaussiancopula. Essentially, unlike the CE and DT methods,which attempt to transform discrete variables to con-tinuous variables, the MLE for a Gaussian copulawith discrete marginal distributions F1, F2, …, Fdcan be formulated as estimating multivariate normalrectangular probabilities:

ℙ xjγ,Rð Þ =ðϕ−1 F1 x1 jγ1ð Þ½ �

ϕ−1 F1 x1 −1 jγ1ð Þ½ �� ðϕ−1 Fd xd jγdð Þ½ �

ϕ−1 F1 xd −1 jγdð Þ½ �

ΦR z1,…,zdð Þdz1…dzd ; ð4Þ

where γ is the marginal distribution parameters,ϕ− 1(�) is the univariate standard normal inverseCDF, and ΦR(� � �) is the multivariate normal densitywith correlation matrix R. Nikoloulopoulos24 pro-poses to approximate the multivariate normal rectan-gular probabilities via fast simulation algorithmsdiscussed in Ref 43. Because this method directlyapproximates the MLE via simulated algorithms, thismethod is called SL. Nikoloulopoulos25 compares theDT and SL methods for small sample sizes and findthat the DT method tends to overestimate the corre-lation structure. However, because of the computa-tional simplicity, Nikoloulopoulos25 provides someheuristics of when the DT method might work wellcompared to the more accurate but more computa-tionally expensive SL method.

Vine Copulas for Discrete DistributionsPanagiotelis et al.44 provide conditions under whicha multivariate discrete distribution can be decom-posed as a vine PCC copula paired with discrete mar-ginals. In addition, Panagiotelis et al.44 show that the

likelihood computation for vine PCCs with discretemarginals is quadratic as opposed to exponential, aswould be the case for general multivariate copulassuch as the Gaussian copula with discrete marginals.However, computation in truly high-dimensional set-tings remains a challenge as 2d(d − 1) bivariate cop-ula evaluations are required to calculate theprobability mass function (PMF) or likelihood of a d-variate PCC using the algorithm proposed by Pana-giotelis et al.44. These bivariate copula evaluations,however, can be coupled with some of the previouslydiscussed computational techniques, such as CE, DT,and SL, for further computational improvements.Finally, while vine PCCs offer a very flexible model-ing approach, this comes with the added challenge ofselecting the vine construction and bivariatecopulas,45 which has not been well studied for dis-crete distributions. Overall, Nikoloulopoulos5 recom-mends using vine PCCs for the complex modeling ofdiscrete data with tail dependencies and asymmetricdependencies.

Summary of Marginal PoissonGeneralizationsWe have reviewed the historical development of themultivariate Poisson, which has Poisson marginals,and then reviewed many of the recent developmentsof using the much more general copula framework toderive Poisson generalizations with Poisson margin-als. The original multivariate Poisson models basedon latent Poisson variables are limited to positivedependencies and require computationally expensivealgorithms to fit. However, the estimation of copuladistributions paired with Poisson marginals—although theoretically has some caveats—can be per-formed efficiently in practice. Simple approximations,such as the expectation under the DT, can providenearly trivial transformations that move the discretevariables to the continuous domain in which all thetools of continuous copulas can be exploited. Morecomplex transformations, such as the SL method24

can be used if the sample size is small, or high accu-racy is needed.

POISSON MIXTUREGENERALIZATIONS

Instead of directly extending univariate Poissons tothe multivariate case, a separate line of work pro-poses to indirectly extend the Poisson based on themixture of independent Poissons. Mixture models areoften considered to provide more flexibility by allow-ing the parameter to vary according to a mixing



distribution. One important property of mixturemodels is that they can model overdispersion. Over-dispersion occurs when the variance of the data islarger than the mean of the data—unlike in a Poissondistribution in which the mean and variance areequal. One way of quantifying dispersion is the dis-persion index:

δ=σ2

μ: ð5Þ

If δ > 1, then the distribution is overdispersed,whereas if δ < 1, then the distribution is underdis-persed. In real-world data, as will be seen in theexperimental section, overdispersion is more com-mon than underdispersion. Mixture models also ena-ble dependencies between the variables, as will bedescribed in the following paragraphs.

Suppose we are modeling a univariate randomvariable x with a density of f(x | θ). Rather than assum-ing θ is fixed, we let θ itself be a random variable fol-lowing some mixing distribution. More formally, ageneralmixture distribution can be defined as46:

ℙ xjg �ð Þð Þ =ðΘf xjθð Þg θð Þdθ; ð6Þ

where the parameter θ is assumed to come from themixing distribution g(θ), and Θ is the domain of θ.

For the Poisson case, let λ2ℝd+ + be a d-

dimensional vector whose i-th element λi is theparameter of the Poisson distribution for xi. Now,given some mixing distribution g(λ), the family ofPoisson mixture distributions is defined as

ℙMixedPoi xð Þ =ðℝd

+ +

g λð ÞYdi = 1

ℙPoiss xi jλið Þdλ; ð7Þ

where the domain of the joint distribution is anycount-valued assignment (i.e., xi 2 ℤ+, 8 i). Whilethe probability density function (Eq. (7)) has a com-plicated form involving a multidimensional integral(a complex, high-dimensional integral when d islarge), the mean and variance are known to beexpressed succinctly as

E xð Þ=E λð Þ; ð8Þ

Var xð Þ=E λð Þ+Var λð Þ: ð9Þ

Note that Eq. (9) implies that the variance of a mix-ture is always larger than the variance of a single dis-tribution. The higher-order moments of x are also

easily represented by those of λ. Besides the moments,other interesting properties (convolutions, identifia-bility, etc.) of Poisson mixture distributions areextensively reviewed and studied in Ref 46.

One key benefit of Poisson mixtures is that theypermit both positive as well as negative dependenciessimply by properly defining g(λ). The intuitionbehind these dependencies can be more clearly under-stood when we consider the sample generation proc-ess. Suppose we have the distribution g(λ) in twodimensions (i.e., d = 2) with a strong positivedependency between λ1 and λ2. Then, given a sample(λ1, λ2) from g(λ), x1 and x2 are likely to also be posi-tively correlated.

In an early application of the model, Arbousand Kerrich47 constrain the Poisson parameters asthe different scales of the common gamma variable λ:for i = 1, …, d, the time interval ti is given, and λi isset to tiλ. Hence, g(λ) is a univariate gamma distribu-tion specified by λ 2 ℝ++, which only allows a simpledependency structure. As another early attempt,Steyn48 choose the multivariate normal distributionfor the mixing distribution g(λ) to provide more flexi-bility on the correlation structure. However, the nor-mal distribution poses problems because λ mustreside in ℝ++, while the normal distribution is definedon ℝ.

One of the most popular choices for g(λ) is thelog-normal distribution because of its rich covariancestructure and natural positivity constraintb:

N log λjμ,Σð Þ = 1

Πdi = 1λi

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2πð ÞdjΣj

qexp −

12

logλ−μð ÞTΣ −1 logλ−μð Þ� �

: ð10Þ

The log-normal distribution above is parameterizedby μ and Σ, which are the mean and the covarianceof (logλ1, logλ2, …, logλd), respectively. Setting therandom variable xi to follow the Poisson distributionwith parameter λi, we have the multivariate Poissonlog-normal distribution49 from Eq. (7):

ℙPoiLogN xjμ,Σð Þ=ðℝd

+

N log λjμ,Σð ÞYdi =1

ℙPoiss xi jλið Þdλ:

ð11Þ

While the joint distribution (Eq. (11)) does not havea closed-form expression, and hence, as d increases,it becomes computationally cumbersome to workwith, its moments are available in closed form as aspecial case of Eq. (9):



αi ≡E xið Þ= exp μi +12σii

� �;

Var xið Þ = αi + α2i exp σiið Þ−1ð Þ;Cov xi,xj

� �= αiαj exp σij

� �−1

� �: ð12Þ

The correlation and the degree of overdispersion(defined as the variance divided by the mean) of themarginal distributions are strictly coupled by α andσ. Also, possible Spearman’s ρ correlation values forthis distribution are limited if the mean value αi issmall. To briefly explore this phenomenon, we simu-lated a two-dimensional Poisson log-normal modelwith log-normal parameters μ¼ ½0;0� and

Σ = 2log γð Þ 1 �0:999�0:999 1

;

which corresponds to a mean vector of [γ, γ] perEq. (12) and the strongest positive and negative cor-relation possible between the two variables. We sim-ulated one million samples from this distribution andfound that when fixing γ = 2, Spearman’s ρ valuesare between −0.53 and 0.58. When fixing γ = 10,Spearman’s ρ values are between −0.73 and 0.81.Thus, for small mean values, the log-normal mixtureis limited in modeling strong dependencies, but forlarge mean values, the log-normal mixture can modelstronger dependencies. Besides the examples providedhere, various Poisson mixture models from differentmixing distributions are available, although limitedin the applied statistical literature due to their com-plexities. See Ref 46 and the references therein formore examples of Poisson mixtures. Karlis andXekalaki46 also provide the general properties ofmixtures as well as the specific properties of Poissonmixtures, such as moments, convolutions, and theposterior.

While this review focuses on modeling multi-variate, count-valued responses without any extrainformation, several extensions of multivariate Pois-son log-normal models have been proposed to pro-vide more general correlation structures whencovariates are available.50–55 These works formulatethe mean parameter of log-normal mixing distribu-tion, logμi, as a linear model on given covariates inthe Bayesian framework.

In order to alleviate the computational burdenof using log-normal distributions as an infinite mix-ing density as above, Karlis and Meligkotsidou56

proposed an EM-type estimation for a finite mixtureof k > 1 Poisson distributions, which still preservessimilar properties, such as both positive and negative

dependencies as well as closed form moments. WhileKarlis and Meligkotsidou56 consider mixing multi-variate Poissons with positive dependencies, the sim-plified form—where the component distributions areindependent Poisson distributions—is much simplerto implement using an expectation-maximization(EM) algorithm. This simple finite mixture distribu-tion can be viewed as a middle ground between a sin-gle Poisson and a nonparametric estimation methodwhere a Poisson is located at every trainingpoint—i.e., the number of mixtures is equal to thenumber of training data points (k = n).

The gamma distribution is another commonmixing distribution for the Poisson because it is theconjugate distribution for the Poisson mean parame-ter λ. For the univariate case, if the mixing distribu-tion is gamma, then the resulting univariatedistribution is the well-known negative binomial dis-tribution. The negative binomial distribution canhandle overdispersion in count-valued data when thevariance is larger than the mean. Unlike the Poissonlog-normal mixture, the univariate gamma-Poissonmixture density—i.e., the negative binomialdensity—is known in closed form:

ℙ xjr,pð Þ = Γ r+ xð ÞΓ rð ÞΓ x + 1ð Þp

r 1−pð Þx :

As r ! ∞, the negative binomial distributionapproaches the Poisson distribution. Thus, this canbe seen as a generalization of the Poisson distribu-tion. Note that the variance of this distribution isalways larger than the Poisson distribution with thesame mean value.

In a similar vein of placing a gamma prior distri-bution on the mean parameter λ, a log-gammaprior distribution can be placed on the log meanparameter θ = log (λ). Bradley et al.57 recentlyleveraged the log-gamma conjugacy to the Poissonlog-mean parameter θ by introducing the Poissonlog-gamma hierarchical mixture distribution. Theyparticularly discuss the multivariate log-gamma distri-bution that can have a flexible dependency structuresimilar to the multivariate log-normal distributionand illustrate some modeling advantages over the log-normal mixture model.

Summary of Mixture Model GeneralizationsOverall, mixture models are particularly helpful ifthere is overdispersion in the data, which is often thecase for real-world data as seen in the experimentssection, while also allowing for variable dependenciesto be modeled implicitly through the mixing



distribution. If the data exhibits overdispersion, thenthe log-normal or log-gamma distributions57 givesomewhat flexible dependency structures. The princi-pal caveat with the complex mixture of Poisson dis-tributions is computational; exact inference of theparameters is typically computationally difficult dueto the presence of latent mixing variables. However,simpler models such as the finite mixture using sim-ple EM may provide good results in practice (seeComparison section).

CONDITIONAL POISSONGENERALIZATIONS

While the multivariate Poisson formulation inEq. (3), as well as the distribution formed by pair-ing a copula with Poisson marginals, assumes thatunivariate marginal distributions are derived fromthe Poisson, a different line of work generalizes theunivariate Poisson by assuming that the univariatenode-conditional distributions are derived from thePoisson.58–63 Like the assumption of Poisson mar-ginals in previous sections, this conditional Poissonassumption seems a different yet natural extensionof the univariate Poisson distribution. The multivar-iate Gaussian can be seen to satisfy such a condi-tional property as the node-conditional distributionsof a multivariate Gaussian are univariate Gaussian.One benefit of these conditional models is thatthey can be seen as undirected graphical models orMarkov Random Fields, and they have a simpleparametric form. In addition, estimating thesemodels generally reduces to estimating simple node-wise regressions, and some of these estimators havetheoretical guarantees on estimating the globalgraphical model structure even under high-dimensional sampling regimes, where the number ofvariables (d) is potentially even larger than thenumber of samples (n).

Background on Exponential FamilyDistributionsWe briefly describe exponential family distributionsand graphical models, which form the basis for theconditional Poisson models. Many commonly useddistributions fall into this family, including Gaussian,Bernoulli, exponential, gamma, and Poisson, amongothers. The exponential family is specified by a vectorof sufficient statistics, denoted by T(x)≡[T1(x),T2(x), …, Tm(x)], the log base measure B(x), and thedomain of the random variable D. With this nota-tion, the generic exponential family is defined as:

ℙExpFam xjηð Þ = expXmi = 1

ηiTi xð Þ +B xð Þ−A ηð Þ !

A ηð Þ= logðDexp

Xmi =1

ηiTi xð Þ +B xð Þ !

dμ xð Þ;

where η is called the natural or canonical parametersof the distribution, μ is the Lebesgue or countingmeasure depending on whether D is continuous ordiscrete, respectively, and A(η) is called the log parti-tion function or log normalization constant becauseit normalizes the distribution over the domain D.Note that the sufficient statistics Ti xð Þf gmi = 1 can beany arbitrary function of x; for example, Ti(x) = x1x2could be used to model an interaction between x1and x2. The log partition function A(η) will be a keyquantity when discussing the following models: A(η)must be finite for the distribution to be valid so thatthe realizable domain of parameters is given byη2D : A ηð Þ< ∞f g. Thus, for instance, if the realiza-

ble domain only allows positive or negative interac-tion terms, then the set of allowed dependencieswould be severely restricted.

Let us now consider the exponential familyform of the univariate Poisson:

ℙPoiss xjλð Þ= λxexp −λð Þ=x!= exp log λxð Þ− log x!ð Þ−λð Þ= exp

�log λð Þ|fflffl{zfflffl}

η

x|{z}T xð Þ

+ − log x!ð Þð Þ|fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl}B xð Þ

−λ:,

and therefore

ℙPoiss xjηð Þ= exp ηx− log x!ð Þ−exp ηð Þð Þ; ð13Þ

where η ≡ log(λ) is the natural parameter of the Pois-son, T(x) = x is the Poisson sufficient statistic, − log(x !) is the Poisson log base measure, and A(η) = exp(η) is the Poisson log partition function. Note that forthe general exponential family distribution, the logpartition function may not have a closed form.

Background on Graphical ModelsThe graphical model over x, given some graph G—aset of nodes and edges—is a set of distributions onx that satisfy the Markov independence assumptionswith respect to G.

64 In particular, an undirectedgraphical model provides a compact way to representconditional independence among random variables—the Markov properties of the graph. Conditionalindependence relaxes the notion of full independenceby defining which variables are independent giventhat the other variables are fixed or known.



More formally, let V be a set of d nodes corre-sponding to the d random variables in x, let E be theset of undirected edges connecting nodes in V, and letG = (V, E) be the corresponding undirected graph. Bythe Hammersley-Clifford theorem,65 any such distri-bution has the following form:

ℙ xjηð Þ= expXC2C

ηCTC xCð Þ−A ηð Þ !

ð14Þ

where C is a set of cliques (fully-connected sub-graphs) of G, and TC(xC) are the clique-wise sufficientstatistics. For example, if C = 1,2,3f g2 C, then therewould be a term η1,2,3T1,2,3(x1, x2, x3) that involvesthe first, second, and third random variables in x.Hence, a graphical model can be understood as anexponential family distribution with the form givenin Eq. (14). An important special case—which will bethe focus in this paper—is a pairwise graphicalmodel, where C consists of merely V and ℰ,i.e., jCj = 1,2f g,8C2C, so that we have

ℙ xjηð Þ= expXi2V

ηiTi xið Þ+Xi, jð Þ2ℰ

ηijTij xi,xj� �

−A ηð Þ0@ 1A:

As graphical models provide direct interpretations onthe Markov independence assumptions, for thePoisson-based graphical models in this section, wecan easily investigate the conditional independencerelationships between random variables rather thanmarginal correlations.

As an example, we will consider the Gaussiangraphical model formulation of the standard multi-variate normal distribution (for simplicity, we willassume the mean vector is zero, i.e., μ = 0):

Standard Form , Graphical Model Form

Σ = −12Θ−1 , Θ = −

12Σ −1 ð15Þ

ℙ xjΣð Þ/ exp −12xTΣ −1x

� �, ℙ xjΘð Þ/ exp xTΘx

� �= exp

Xi

θiix2i +Xi6¼j

θijxixj

!: ð16Þ

Note how Eq. (16) is related to Eq. (14) by settingηi = θii, ηij = θij, Ti xið Þ = x2i , Tij(xi, xj) = xixj, andℰ = {(i, j) : i 6¼ j, θij 6¼ 0}—i.e., the edges in the graphcorrespond to the non-zeros in Θ. In addition, this

example shows that the marginal moments—i.e., thecovariance matrix Σ—are quite different from thegraphical model parameters—i.e., the negative of theinverse covariance matrix Θ = − 1

2Σ−1. In general, for

graphical models, such as the Poisson graphical mod-els (PGMs) defined in the next section, the transfor-mation from the covariance to the graphical modelparameter (Eq. (15)) is not known in closed form; infact, this transformation is often very difficult tocompute for non-Gaussian models.66 For more infor-mation about graphical models and exponentialfamilies, see Refs 66,67.

Poisson Graphical ModelThe first to consider multivariate extensions con-structed by assuming that conditional distributions areunivariate exponential family distributions, such asand including the Poisson distribution, was Besag58.In particular, suppose all node-conditionaldistributions—the conditional distribution of a nodeconditioned on the rest of the nodes—are univariatePoisson. Then, there is a unique joint distribution con-sistent with these node-conditional distributions undersome conditions, and moreover, this joint distributionis a graphical model distribution that factors accordingto a graph specified by the node-conditional distribu-tions. In fact, this approach can be uniformly applica-ble for any exponential family beyond the Poissondistribution and can be extended to more generalgraphical model settings61,63 beyond the pairwise set-ting in Ref 58. The particular instance of the univariatePoisson as the exponential family underlying the node-conditional distributions is called a PGM.c

Specifically, suppose that for every i 2 {1, …, d},the node-conditional distribution is specified by univar-iate Poisson distribution in exponential family form asspecified in Eq. (13):

ℙ xi jx− ið Þ = exp ψ x− ið Þxi− log xi!ð Þ−exp ψ x− ið Þð Þf g; ð17Þwhere x− i is the set of all xj except xi, and the func-tion ψ (x− i) is any function that depends on the restof all the random variables except xi. Furthermore,suppose that the corresponding joint distribution onx factors according to the set of cliques C of a graphG. Yang et al.63 then show that such a joint distribu-tion consistent with the above node-conditional dis-tributions exists and, moreover, necessarily hasthe form

ℙ xjηð Þ = expXC2C

ηCYi2C

xi−Xdi = 1

log xi!ð Þ−A ηð Þ( )

; ð18Þ



where the function A(η) is the log-partition functionon all parameters η = ηCf gC2C. The pairwise PGM, asa special case, is defined as follows:

ℙPGM xjηð Þ = expXdi = 1

ηixi +Xi, jð Þ2ℰ

ηijxixj

8<:−Xdi = 1

log xi!ð Þ−APGM ηð Þ); ð19Þ

where ℰ is the set of edges of the graphical model,and η = {η1, η2, …, ηd} [ {ηij, 8 (i, j) 2 ℰ}. For nota-tional simplicity and development of extensions toPGM, we will gather the node parameters ηi into avector θ = [η1, η2, …, ηd] 2 ℝd and gather the edgeparameters into a symmetric matrix Φ 2 ℝd × d suchthat ϕij = ϕji = ηij/2, 8 (i, j) 2 ℰ and ϕij =0,8 i, jð Þ �2ℰ.Note that for PGM, Φ has zeros along the diagonal.With this notation, the pairwise PGM can be equiva-lently represented in a compact vectorized form as:

ℙPGM xjθ,Φð Þ= exp θTx + xTΦx�

−Xdi =1

log xi!ð Þ−APGM θ,Φð Þg; ð20Þ

Parameter estimation in a PGM is naturally sug-gested by its construction: all of the PGM parametersin Eq. (20) can be estimated by considering the node-conditional distributions for each node separatelyand solving an ℓ1-regularized Poisson regression foreach variable. In contrast to the previous approachesin the sections above, this parameter estimationapproach is not only simple but is also guaranteed tobe consistent even under high-dimensional samplingregimes and under some other mild conditions,including a sparse graph structural assumption (seeYang et al.61,63 for more details on the analysis). Asin Poisson log-normal models, the parameters ofPGM can be made to depend on covariates to allowfor more flexible correlations.68

In spite of its simple parameter estimationmethod, the major drawback with this vanilla PGMdistribution is that it only permits negative condi-tional dependencies between variables:

Proposition 1. (Ref 58) Consider the PGM distribu-tion in Eq. (20). Then, for any parameters θ and Φ,APGM(θ, Φ) < + ∞ only if the pairwise parametersare nonpositive: ϕij ≤ 0, 8 (i, j) 2 ℰ.

Intuitively, if any entry in Φ, say Φij, is positive,the term Φijxixj in Eq. (20) would grow

quadratically, whereas the log base measure terms− log(xi !) − log(xj !) only decrease as O(xi logxi +xj log xj), so A(θ, Φ) ! ∞ as xi, xj ! ∞. Thus, eventhough the PGM is a natural extension of the univar-iate Poisson distribution (from the node-conditionalviewpoint), it entails a highly restrictive parameterspace, with severely limited applicability. Thus, mul-tiple PGM extensions attempt to relax this negativityrestriction to permit positive dependencies, asdescribed next.

Extensions of PGMsTo circumvent the severe limitations of the PGM dis-tribution, which particularly only permits negativeconditional dependencies, several extensions to PGMthat permit a richer dependence structure have beenproposed.

Truncated PGMBecause the negativity constraint is due in part to theinfinite domain of count variables, a natural solutionwould be to truncate the domain of variables. It wasKaiser and Cressie69 who first introduced anapproach to truncate the Poisson distribution in thecontext of graphical models. Their idea was simplyto use a Winsorized Poisson distribution for node-conditional distributions: x is a Winsorized Poisson ifz = I z0 <Rð Þz0 + I z0 ≥Rð ÞR, where z0 is Poisson, I �ð Þ isan indicator function, and R is a fixed positive con-stant denoting the truncation level. However, Yanget al.62 showed that Winsorized node-conditional dis-tributions actually do not lead to a consistent jointdistribution.

As an alternative way of truncation, Yanget al.62 instead keep the same parametric form asPGM (Eq. (20)) but merely truncate the domain tonon-negative integers less than or equal to R—i.e., DTPGM = 0,1,…,Rf g—so that the joint distribu-tion takes the form63:

ℙTPGM xð Þ= exp θTx + xTΦx−

Xi

log xi!ð Þ

−ATPGM θ,Φð Þ�: ð21Þ

As Yang et al.62 show, the node-conditional distribu-tions of this graphical model distribution belong toan exponential family that is Poisson-like but withthe domain bounded by R. Thus, the key differencefrom the vanilla PGM (Eq. (20)) is that the domain isfinite, and hence, the log partition function ATPGM(�)only involves a finite number of summations. Thus,



no restrictions are imposed on the parameters for thenormalizability of the distribution.

Yang et al.62 discuss several major drawbacksto TPGM. First, the domain needs to be bounded apriori, so R should ideally be set larger than anyunseen observation. Second, the effective range ofparameter space for a nondegenerate distribution isstill limited: as the truncation value R increases, theeffective values of pairwise parameters becomeincreasingly negative or close to zero; otherwise, thedistribution can be degenerate, placing most of itsprobability mass at 0 or R.

Quadratic PGM and Sublinear PGMYang et al.62 also investigate the possibility of PGMsthat (1) allow both positive and negative dependen-cies as well as (2) allow the domain to range over allnon-negative integers. As described previously, a keyreason for the negative constraint on the pairwiseparameters ϕij in Eq. (20) is that the log base measureP

ilog(xi!) scales more slowly than the quadraticpairwise term xTΦx where x2ℤd

+ . Yang et al.62 thuspropose two possible solutions: increase the basemeasure or decrease the quadratic pairwise term.

First, if we modify the base measure of Poissondistribution with ‘Gaussian-esque’ quadratic functions(note that for the linear sufficient statistics with positivedependencies, the base measures should be quadraticat the very least62), then the joint distribution, whichthey call a quadratic PGM, is normalizable whileallowing both positive and negative dependencies62:

ℙQPGM xð Þ = exp θTx + xTΦx−AQPGM θ,Φð Þ� �: ð22Þ

Essentially, QPGM has the same form as the Gaussiandistribution but where its domain is the set of non-negative integers. The key differences from PGM arethat Φ can have negative values along the diagonal,and the Poisson base measure

Pi − log(xi !) is

replaced by the quadratic termX

iϕiix

2i . Note that a

sufficient condition for the distribution to be normal-izable is given by:

xTΦx < −ckxk22 8x2ℤd+ ; ð23Þ

for some constant c > 0, which in turn can be satisfiedif Φ is negative definite. One significant drawback ofQPGM is that the tail is Gaussian-esque and thinrather than Poisson-esque and thicker as in PGM.

Another possible modification is to use sub-linear sufficient statistics in order to preserve thePoisson base measure and possibly heavier tails.

Consider the following univariate distribution overcount-valued variables:

ℙ zð Þ/ exp θT z;R0,Rð Þ− log z!f g; ð24Þ

which has the same base measure log z ! as the Pois-son but with the following sublinear sufficientstatistics:

T z;R0,Rð Þ =z if z ≤ R0

−1

2 R−R0ð Þ z2 +R

R−R0x−

R20

2 R−R0ð Þ if R0 < z ≤ R

R +R0

2if z ≥ R :

8>>><>>>:ð25Þ

For values of x upto R0, T(x) increases linearly, whileafter R0, its slope decreases linearly, and finally afterR, T(x) becomes constant. The joint graphical model,which they call a sublinear PGM (SPGM), specifiedby the node-conditional distributions belonging tothe family in Eq. (24), has the following form:

ℙSPGM xð Þ = exp θTT xð Þ+T xð ÞTΦT xð Þn

−Xi

log xi!ð Þ−ASPGM θ,ΦjR0,Rð Þo; ð26Þ

where

ASPGM θ,ΦjR0,Rð Þ = logXx2ℤ +

exp θTT xð Þ�+T xð ÞTΦT xð Þ−

Xi

log xi!ð Þg; ð27Þ

and T(x) is the entry-wise application of the functionin Eq. (25). SPGM is always normalizable forϕij 2 ℝ 8 i 6¼ j.

62

The main difficulty in estimating the PGM var-iants above with an infinite domain is the lack ofclosed-form expressions for the log partition func-tion, even only for the node-conditional distributionsthat are needed for parameter estimation. Yanget al.62 propose an approximate estimation proce-dure that uses the univariate Poisson and Gaussianlog-partition functions as upper bounds for the node-conditional log-partition functions for the QPGMand SPGM models, respectively.

Poisson Square Root Graphical ModelIn the similar vein as SPGM in the earlier section,Inouye et al.60 consider the use of exponential



families with square root-sufficient statistics. Whilethey consider general graphical model families, theirPGM variant called the square root graphical model(SQR) can be written as:

ℙSQR xjθð Þ= exp θTffiffiffix

p+ffiffiffix

p TΦffiffiffix

pn−Xi

log xi!ð Þ−ASQR θ,Φð Þo; ð28Þ

where ϕii can be non-zero in contrast to the zerodiagonal of the parameter matrix in Eq. (20). As withPGM, when there are no edges (i.e., ϕij = 0 8 i 6¼ j)and θ = 0, this reduces to the independent Poissonmodel. The node conditionals of this distributionhave the form:

ℙ xijx− ið Þ/ exp ϕiixi + θi +2ϕTi, − i

ffiffiffiffiffiffiffix− i

p� � ffiffiffiffixi

p− log xi!ð Þ

n o;

ð29Þ

where ϕi,− i is the i-th column of Φ with the i-th entryremoved. This can be rewritten in the form of a two-parameter exponential family:

ℙ xijη1,η2ð Þ = exp η1xi + η2ffiffiffiffixi

p− log xi!ð Þ−A η1,η2ð Þf g;

ð30Þ

where η1 = ϕii, η2 = θi + 2ϕTi, − i

ffiffiffiffiffiffiffix− i

p, and A(η1, η2) is

the log partition function. Note that a key differencewith the PGM variants in the previous section isthat the diagonal of ΦSQR can be non-zero, whereasthe diagonal of ΦPGM must be zero. Because the

interaction termffiffiffix

p TΦffiffiffix

pis asymptotically linear

rather than quadratic, the Poisson SQR graphical

model does not suffer from the degenerate distribu-tions of TPGM as well as the fixed-length PGM(FLPGM) discussed in the next section while stillallowing both positive and negative dependencies.

To show that SQR graphical models can eas-ily be normalized, Inouye et al.60 first defineradial-conditional distributions. The radial-conditional distribution assumes that the unit direc-tion is fixed, but the length of the vector isunknown. The difference between the standard 1Dnode-conditional distributions and the 1D radial-conditional distributions is illustrated in Figure 3.Suppose we condition on the unit direction v = x

kxk1of the sufficient statistics, but the scaling of this unitdirection z = kxk1 is unknown. With this notation,Inouye et al.60 define the radial-conditional distribu-tion as:

ℙ x = zvjv,θ,Φð Þ/ exp θTffiffiffiffiffizv

p+ffiffiffiffiffizv

p TΦffiffiffiffiffizv

pn−Xi

log zvið Þ!ð Þg

/ exp θTv� � ffiffiffi

zp

+ffiffiffiv

p TΦffiffiffiv

p� �z−Xi

log zvið Þ!ð Þ( )

:

Similar to the node-conditional distribution, theradial-conditional distribution can be rewritten as atwo-parameter exponential family:

ℙ zjv,θ,Φð Þ=exp η1z + η2ffiffiffiz

p|fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl}O zð Þ

+ eBv zð Þ|fflffl{zfflffl}O −zlog zð Þð Þ

−Arad η1,η2ð Þ

0B@1CA;

ð31Þ

0.35Node conditional

distributionsRadial conditional

distributions0.3

0.25

0.2

0.15

x 2 x 2

x1 x1

0.1

0.05

00 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

FIGURE 3 | Node-conditional distributions (left) are univariate probability distributions of one variable conditioned on the other variables,while radial-conditional distributions are univariate probability distributions of the vector scaling conditioned on the vector direction. Bothconditional distributions are helpful in understanding square root (SQR) graphical models. (Illustration from Ref 60)



where η1 =ffiffiffiv

p TΦffiffiffiv

p, η2 = θ

Tv, and eBv zð Þ=−Xd

i =1log zvið Þ!ð Þ. The only difference between

this exponential family and the node-conditional dis-tribution is the different base measure—i.e.,

−Xd

i =1log zvið Þ!ð Þ 6¼ − log z!ð Þ. However, note that the

log base measure is still O(−zlog(z)), and thus, thelog base measure will overcome the linear term asz ! ∞. Therefore, the radial-conditional distributionis normalizable for any η1,η2 2ℝ.

With the radial-conditional distributions nota-tion, Inouye et al.60 show that the log partition func-tion for Poisson SQR graphical models is finite byseparating the summation into a nested radial direc-tion and scalar summation. Let V = v : kvkf 1 = 1,v2ℝdg be the set of unit vectors in the positive orthant.The SQR log partition function ASQR(θ, Φ) can bedecomposed into nested summation over the unitdirection and the one-dimensional radial conditional:

ASQR θ,Φð Þ= logð

v2V

Xz2ℤ̂

exp η1 vjΦð Þz + η2 vjθð Þ ffiffiffiz

p�−Xi

log zvi!ð Þgdv; ð32Þ

where η1 vjΦð Þ and η2 vjθð Þ are the radial-conditionalparameters as defined above, and ℤ̂ = z : zv2ℤd

+

� �.

Note that ℤ̂�ℤ, and thus, the inner summation canbe replaced by the radial-conditional log partitionfunction. Therefore, because V is a bounded set, andthe radial-conditional log partition function is finitefor any η1 vjθð Þ and η2 vjΦð Þ, ASQR < ∞ and the Pois-son SQR joint distribution is normalizable.

The main drawback of the Poisson SQR is thatfor parameter estimation, the log partition functionA(η1, η2) of the node conditionals in Eq. (30) is notknown in closed form in general. Inouye et al.60 pro-vide a closed-form estimate for the exponential SQR,but a closed-form solution for the Poisson SQRmodel seems unlikely to exist. Inouye et al.60 suggestnumerically approximating A(η1, η2) as it onlyrequires a one-dimensional summation.

Local PGMInspired by the neighborhood selection technique ofMeinshausen and Bühlmann70, Allen and Liu71,72

propose to estimate the network structure of count-valued data by fitting a series of ℓ1-regularized Pois-son regressions to estimate the node-neighborhoods.Such an estimation method may yield interesting net-work estimates, but as Allen and Liu72 note, theseestimates do not correspond to a consistent joint

density. Instead, the underlying model is defined interms of a series of local models where each variableis conditionally Poisson given its node-neighbors; thisapproach is thus termed the local PGM (LPGM).Note that LPGM does not impose any restrictions onthe parameter space or types of dependencies; if theparameter space of each local model was constrainedto be non-positive, then the LPGM reduces to thevanilla PGM as previously discussed. Hence, theLPGM is less interesting as a candidate multivariatemodel for count-valued data, but many may still findits simple and interpretable network estimatesappealing. Recently, several studies have proposed toadopt this estimation strategy for alternative networktypes.73,74

Fixed-Length Poisson MRFsIn a somewhat different direction, Inouye et al.59

propose a distribution that has the same parametricform as the original PGM but allows positive depen-dencies by decomposing the joint distribution intotwo distributions. The first distribution is the mar-ginal distribution over the length of the vectordenoted ℙ(L)—i.e., the distribution of the ℓ1-norm ofthe vector or the total sum of counts. The second dis-tribution, the FLPGM, is the conditional distributionof PGM given the fact that the vector length L isknown or fixed, denoted ℙFLPGM(x | kxk1 = L). Notethat this allows the marginal distribution on lengthand the distribution given the length to be specifiedindependently.d The restriction to negative dependen-cies is removed because the second distribution, giventhe vector length ℙFLPGM(x | kxk1 = L), has a finitedomain DFLPGM = x : x2ℤd

+ ,kxk1 =L� �

and is thustrivially normalizable—similar to the normalizabilityof the finite-domain TPGM. More formally,59

defined the FLPGM as:

ℙ xjθ,Φ,λð Þ =ℙ Ljλð ÞℙFLPGM xjkxk1 =L,θ,Φð Þ; ð33Þ

ℙFLPGM xjkxk1 =L,θ,Φð Þ= exp θTx +xTΦx�

−Xi

log xi!ð Þ−AL θ,Φð Þg; ð34Þ

where λ is the parameter for the marginal length dis-tribution, which could be Poisson, negative binomial,or any other distribution on non-negative integers. Inaddition, FLPGM could be used as a replacement forthe multinomial distribution because it has the samedomain as the multinomial and actually reduces tothe multinomial if there are no dependencies. Earlier,Altham and Hankin75 developed an identical modelby generalizing an earlier bivariate generalization of



the binomial.76 However, in Refs 75,76, the modelassumed that L was constant over all samples,whereas Ref 59 allowed for L to vary for each sam-ple according to some distributions ℙ(L).

One significant drawback is that FLPGM is notamenable to the simple node-wise parameter estima-tion method of the previous PGM models. Nonethe-less, in Inouye et al.59, the parameters areheuristically estimated with Poisson regressions simi-lar to PGM, although the theoretical properties ofthis heuristic estimate are unknown. Another draw-back is that while FLPGM allows for positive depen-dencies, the distribution can yet yield a degeneratedistribution for large values of L—similar to theproblem of TPGM where the mass is concentratednear 0 or R. Thus, Inouye et al.59 introduce adecreasing weighting function ω(L) that scales theinteraction term:

ℙFLPGM xjkxk1 =L,θ,Φ,ω �ð Þð Þ

= exp θTx +ω Lð ÞxTΦx−Xi

log xi!ð Þ−AL θ,ω Lð ÞΦð Þ( )

:

ð35Þ

While the log likelihood is not available in tractableform, Inouye et al.59 approximate the log likelihoodusing annealed importance sampling,77 which mightbe applicable to the extensions covered previouslyas well.

Summary of Conditional PoissonGeneralizationsThe conditional Poisson models benefit from the richliterature in exponential families and undirectedgraphical models, or Markov Random Fields. Inaddition, the conditional Poisson models have a sim-ple parametric form. The historical PGM—or theauto-Poisson model58—only allowed negative depen-dencies between variables. Multiple extensions havesought to overcome this severe limitation by alteringthe PGM so that the log partition function is finiteeven with positive dependencies. One major draw-back to the graphical model approach is that com-puting the likelihood requires the approximation ofthe joint log partition function A(θ, Φ); a relatedproblem is that the distribution moments and mar-ginals are not known in closed form. Despite thesedrawbacks, parameter estimation using compositelikelihood methods via ℓ1-penalized node-wise regres-sions (in which the joint likelihood is not computed)has solid theoretical properties under certainconditions.

MODEL COMPARISON

We compare models by first discussing two structuralaspects of the models: (a) interpretability and (b) therelative stringency and ease of verifying theoreticalassumptions and guarantees. We then present anddiscuss an empirical comparison of the models onthree real-world datasets.

Comparison of Model InterpretationMarginal models can be interpreted as weakly decou-pling the univariate marginal distributions from thedependency structure between the variables. How-ever, in the discrete case, specifically for distributionsbased on pairing copulas with Poisson marginals, thedependency structure estimation is also dependent onthe marginal estimation, unlike for copulas pairedwith continuous marginals.23 Conditional models orgraphical models, on the other hand, can be inter-preted as specifying generative models for each varia-ble given the variable’s neighborhood (i.e., theconditional distribution). In addition, dependenciesin graphical models can be visualized and interpretedvia networks. Here, each variable is a node, and theweighted edges in the network structure depict thepair-wise conditional dependencies between vari-ables. The simple network depiction for graphicalmodels may enable domain experts to interpret com-plex dependency structures more easily compared toother models. Overall, marginal models may be pre-ferred if modeling the statistics of the data, particu-larly the marginal statistics over individual variables,is of primary importance, while conditional modelsmay be preferred if the prediction of some variables,given others, is of primary importance. Mixture mod-els may be more or less difficult to interpret depend-ing on whether there is an application-specificinterpretation of the latent mixing variable. Forexample, a finite mixture of two Poisson distributionsmay model the crime statistics of a city that containsdowntown and suburban areas. On the other hand, afinite mixture of Poisson distributions or a log-normal Poisson mixture when modeling crash sever-ity counts (as seen in the empirical comparison sec-tion) seems more difficult to interpret; even if themodel empirically fits the data well, the hidden mix-ture variable might not have an obvious application-specific interpretation.

Comparison of Theoretical ConsiderationsThe estimation of marginal models from data hasvarious theoretical problems, as evidenced by theanalysis of copulas paired with discrete marginals in



Ref 23. The extent to which these theoretical pro-blems cause any significant practical issues remainsunclear. In particular, the estimators of the marginaldistributions themselves typically have easily checkedassumptions as the empirical marginal distributionscan be inspected directly. On the other hand, the esti-mation of conditional models is both computation-ally tractable and comes with strong theoreticalguarantees even under high-dimensional regimeswhere n < d.

63 However, the assumptions underwhich the guarantees of the estimators hold are diffi-cult to check in practice and could cause problems ifthey are violated (e.g., outliers caused by unobservedfactors). The estimation of mixture models tends tohave limited theoretical guarantees. In particular,finite Poisson mixture models have very weakassumptions on the underlying distribution—eventually becoming a nonparametric distribution ifk = O(n)—but the estimation problems in theory areextremely difficult, with very few theoretical guaran-tees for practical estimators. Yet, as seen in the nextsection, empirically estimating a finite mixture modelusing EM iterations performs well in practice.

Empirical ComparisonIn this section, we seek to empirically compare mod-els from the three classes presented to assess howwell they fit real-world count data.

Comparison Experimental SetupWe empirically compare models on selected datasetsfrom three diverse domains which, have differentdata characteristics in terms of their mean countvalues and dispersion indices (Eq. 0.5), as can be seenin Table 1. The crash severity dataset is a small acci-dent dataset from Ref 78 with three different countvariables corresponding to crash severity classes:‘Property-only’, ‘Possible Injury’, and ‘Injury’. Thecrash severity data exhibit high count values and

high overdispersion. We retrieve raw next generationsequencing data for breast cancer (BRCA) using thesoftware TCGA2STAT79 and compute a simple log-count transformation of the raw counts: blog(x+1)c,a common preprocessing technique for RNA-Seqdata. The BRCA data exhibit medium counts andmedium overdispersion. We collect the word countvectors from the Classic3 text corpus, which containsabstracts from aerospace engineering, medical, andinformation sciences journals.e The Classic3 datasetexhibits low counts—including many zeros—andmedium overdispersion. In the Supporting informa-tion, we also give results for a crime statistics datasetand the 20 Newsgroup dataset, but they have similarcharacteristics and perform similarly to the BRCAand Classic3 datasets, respectively; thus, we omitthem for simplicity. We select variables (e.g., ford = 10 or d = 100) by sorting the variables by meancount value or sorting by variance in the case of theBRCA dataset as highly variable genes are of moreinterest in biology.

In order to understand how each model mightperform under varying data characteristics, we con-sider the following two questions: (1) How welldoes the model (i.e., the joint distribution) fit theunderlying data distribution? and (2) How well doesthe model capture the dependency structure betweenvariables? To help answer these questions, we evalu-ate the empirical fit of models using two metrics,which only require samples from the model. Thefirst metric is based on a statistic called maximummean discrepancy (MMD),80 which estimates themaximum moment difference over all possiblemoments. The empirical MMD can be approxi-mated as follows from two sets of samples X2ℝn1 × d and Y 2ℝn2 × d:

dMMD G,X,Yð Þ= supf2G

1n1

Xn1i = 1

f xið Þ− 1n2

Xn2j = 1

f yj� �

; ð36Þ

TABLE 1 | Dataset Statistics

(Per Variable )) Means Dispersion Indices Spearman’s ρ

Dataset d n Min Med Max Min Med Max Min Med Max

Crash Severity 3 275 3.4 3.8 9.7 6 9.3 16 0.61 0.73 0.79

BRCA 10 878 3.2 5 7.7 1.5 2.2 3.8 −0.2 0.25 0.95

100 878 1.1 4 9 0.63 1.7 4.6 −0.5 0.08 0.95

1000 878 0.51 3.5 11 0.26 1 4.6 −0.64 0.06 0.97

Classic3 10 3893 0.26 0.33 0.51 1.4 3.4 3.8 −0.17 0.12 0.82

100 3893 0.09 0.14 0.51 1.1 2.1 8.3 −0.17 0.02 0.82

1000 3893 0.02 0.03 0.51 0.98 1.7 8.5 −0.17 −0 0.82



where G is the union of the RKHS spaces based onthe Gaussian kernel using 21 σ values log-spacedbetween 0.01 and 100. In our experiments, we esti-mate the MMD between the pairwise marginals ofmodel samples and the pairwise marginals of theoriginal observations:

DMMDst =

dMMD G, x sð Þ� �, x̂ sð Þh i� �

, s = tdMMD G, x sð Þ,x tð Þ� �, x̂ sð Þ, x̂ tð Þh i� �

, otherwise

8<: :

ð37Þwhere x(s) is the vector of data for the s-th variable ofthe true data, and x̂ sð Þ is the vector of data for the n-th variable of samples from the estimatedmodel—i.e., x(s) are observations from the trueunderlying distribution, and x̂ sð Þ are samples from theestimated model distribution. In our experiments, weuse the fast approximation code for MMD from81

with 26 number of basis vectors for the FastMMDapproximation algorithm. The second metric merelycomputes the absolute difference between pairwiseSpearman’s ρ values of model samples and Spear-man’s ρ values of the original observations:

Dρst = jρ x sð Þ,x tð Þ

� �−ρ x̂ sð Þ, x̂ tð Þ� �

j, 8s, t : ð38Þ

The MMD metric is of more general interest becauseit evaluates whether the models actually fit the empir-ical data distribution, while the Spearman metricmay be more interesting for practitioners who prima-rily care about the dependency structure, such asbiologists who specifically want to study gene depen-dencies rather than gene distributions.

We empirically compare the model fits on thesereal-world datasets for several types of models fromthe three general classes presented. As a baseline, weestimate an independent Poisson model (‘Ind Pois-son’). We include Gaussian copulas and vine copulas,both paired with Poisson marginals (‘Copula Poisson’and ‘Vine Poisson’) to represent the marginal modelclass. We estimate the copula-based models via thetwo-stage IFM method35 via the DT.26 For the mix-ture class, we include both a simple finite mixture ofindependent Poissons (‘Mixture Poiss’) and a log-normal mixture of Poissons (‘Log-Normal’). The finitemixture was estimated using a simple EM algorithm;the log-normal mixture model was estimated viaMCMC sampling using the code from Ref 55. For theconditional model class, we estimate the simple PGM(‘PGM’), which only allows negative dependencies,and three variants that allow for positive dependen-cies: the truncated PGM (‘Truncated PGM’), the

FLPGM with a Poisson distribution on the vectorlength L = kxk1 (‘FLPGM Poisson’), and the Poissonsquare root graphical model (‘Poisson SQR’). Usingcomposite likelihood methods of penalized ℓ1 node-wise regressions, we estimate these models via codefrom Refs 63,82,83, and the XMRFf R package. Afterparameter estimation, we generate 1000 samples foreach method using different types of sampling foreach of the model classes.

To avoid overfitting to the data, we employthreefold cross-validation and report the averageover the three folds. Because the conditional models(PGM, TPGM, FLPGM, and Poisson SQR) can besignificantly different depending on the regularizationparameter—i.e., the weight for the ℓ1 regularizationterm in the objective function for these models—weselect the regularization parameter of these modelsby computing the metrics on a tuning split of thetraining data. For the mixture model, we similarlytune the number of components k by testingk = {10, 20, 30, …, 100}. For the very high-dimensional datasets where d = 1000, we use a regu-larization parameter near the tuning parametersfound when d = 100 and fix k = 50 in order to avoidthe extra computation of selecting a parameter. Moresampling and implementation details for each modelare available in the Supporting information.

Empirical Comparison ResultsThe full results for both the MMD and Spearman’s ρmetrics for the crash severity, breast cancer RNA-Seq,and Classic3 text datasets can be seen in Figures 4, 5,and 6, respectively. The low-dimensional results(d ≤ 10) give evidence across all the datasets that threemodels outperform the others in their classesg:

The Gaussian copula paired with Poisson mar-ginals model (‘Copula Poisson’) for the marginalmodel class, the mixture of Poissons distribution(‘Mixture Poiss’) for the mixture model class, and thePoisson SQR distribution (‘Poisson SQR’) for theconditional model class. Thus, we only include theserepresentative models along with an independentPoisson baseline in the high-dimensional experimentswhen d > 10. We discuss the results for specific datacharacteristics as represented by each dataset.h

For the crash severity dataset with high countsand high overdispersion (Figure 4), mixture models(i.e., ‘Log-Normal’ and ‘Mixture Poiss’) perform thebest as expected as they can model overdispersionwell. However, if dependency structure is the onlyobject of interest, the Gaussian copula paired withPoisson marginals (‘Copula Poisson’) performs well.For the BRCA dataset with medium counts andmedium overdispersion (Figure 5), we note similar



trends with two notable exceptions: (1) The PoissonSQR model actually performs reasonably in lowdimensions, suggesting that it can model moderateoverdispersion, and (2) the high-dimensional(d ≥ 100) Spearman’s ρ difference results show thatthe Gaussian copula paired with Poisson marginals(‘Copula Poisson’) performs significantly better thanthe mixture model; this result suggests that copulaspaired with Poisson marginals are likely better formodeling dependencies than mixture models. Finally,for the Classic3 dataset with low counts and mediumoverdispersion (Figure 6), the Poisson SQR modelseems to perform well in this low-count setting, espe-cially in low dimensions, unlike in previous data set-tings. While the simple independent mixture ofPoisson distributions still performs well, the Poissonlog-normal mixture distribution (‘Log-Normal’) per-forms quite poorly in this setting with small countsand many zeros. This poor performance of the Pois-son log-normal mixture is somewhat surprising asthe dispersion indices are almost all greater than one,as seen in Table 1. The differing results between lowcounts and medium counts with similar overdisper-sion demonstrate the importance of considering boththe overdispersion and the mean count values whencharacterizing a dataset.

In summary, we note several overall trends. Mix-ture models are important for overdispersion whencounts are medium or high. The Gaussian copula with

Poisson marginals joint distribution can estimatedependency structure (per the Spearman metric) for awide range of data characteristics even when the distri-bution does not fit the underlying data (per the MMDmetric). The Poisson SQR model performs well forlow count values with many zeros (i.e., sparse data)and may be able to handle moderate overdispersion.

DISCUSSION

While this review analyzes each model class sepa-rately, it would be quite interesting to consider com-binations or synergies between the model classes.Because negative binomial distributions can beviewed as a gamma–Poisson mixture model, one sim-ple idea is to consider pairing a copula with negativebinomial marginals or developing a negative bino-mial SQR graphical model. As another example, wecould form a finite mixture of copula-based or graph-ical model-based models. This might combine thestrengths of a mixture model in handling multiplemodes and overdispersion with the strengths of thecopula-based models and graphical models, whichcan explicitly model dependencies.

We may also consider how one type of modelinforms the other. For example, by the generalizedSklar’s theorem,26 each conditional PGM actuallyinduces a copula—just as the Gaussian graphical

Crash severity (d = 3)

Histogram

Ind Poisson

Ind Poisson

FLPGM poisson FLPGM poisson

Vine Poisson

Vine PoissonCopula Poisson

Copula Poisson

Truncated PGM

Truncated PGM

Log-Normal

Log-Normal

Mixture Poiss

Mixture Poiss

0 0.1 0.2

Pairwise maximum mean discrepancy

0.3 0.4 0.5

PGM

PGM

Poisson SQR

Poisson SQR

0 0.2

Pairwise spearman′s � difference

0.4 0.6 0.8

Mean Min Median Max Histogram Mean Min Median Max

Crash severity (d = 3)

FIGURE 4 | Crash severity dataset (high counts and high overdispersion): maximum mean discrepancy (left) and Spearman ρ’s difference(right). As expected, for high overdispersion, mixture models (‘Log-Normal’ and ‘Mixture Poiss’) seem to perform the best.



model induces the Gaussian copula. Studying thecopulas induced by graphical models seems to be arelatively unexplored area. On the other hand, it maybe useful to consider fitting a Gaussian copula pairedwith discrete marginals using the theoreticallygrounded techniques from graphical models forsparse dependency structure estimation, especially forthe small sample regimes in which d > n; this hasbeen studied for the case of continuous marginals inRef 84. Overall, bringing together and comparingthese diverse paradigms for probability models opensup the door for many combinations and synergies.

CONCLUSION

We have reviewed three main approaches to con-structing multivariate distributions derived from thePoisson using three different assumptions: (1) the

marginal distributions are derived from the Poisson,(2) the joint distribution is a mixture of independentPoisson distributions, and (3) the node-conditionaldistributions are derived from the Poisson. The firstclass based on Poisson marginals, and particularlythe general approach of pairing copulas with Poissonmarginals, provides an elegant way to partiallyi

decouple the marginals from the dependency struc-ture and gives strong empirical results despite sometheoretical issues related to nonuniqueness. Whileadvanced methods to estimate the joint distributionof copulas paired with discrete marginals, such asSL25 or vine copula constructions, provide moreaccurate or more flexible copula models, respectively,our empirical results suggest that a simple Gaussiancopula paired with Poisson marginals with the trivialDT can perform quite well in practice. The secondclass based on mixture models can be particularlyhelpful for handling overdispersion that often occurs

BRCA (d = 10)

BRCA (d = 10)

BRCA (d = 100) BRCA (d = 1000)

Histogram

Ind Poisson

Ind Poisson

Ind Poisson

FLPGM Poisson

Vine Poisson

Copula Poisson

Copula Poisson

Copula Poisson

Truncated PGM

Log-Normal

Mixture Poiss Mixture Poiss

Mixture Poiss

0 0.1 0.2

Pairwise maximum mean discrepancy Pairwise maximum mean discrepancy Pairwise maximum mean discrepancy

0.3 0.4 0 0.1 0.2 0.3 0.4 0.5

PGM

Poisson SQR

Poisson SQR

Poisson SQR

Mean Min Median Max

Histogram

Ind Poisson

FLPGM poisson

Vine Poisson

Copula Poisson

Truncated PGM

Log-Normal

Mixture Poiss

PGM

Poisson SQR

Mean Min Median Max

Histogram Mean Min Median Max

0 0.1 0.2 0.3 0.4 0.5


BRCA (d = 100) BRCA (d = 1000)

Ind Poisson

Copula Poisson

Mixture Poiss

Poisson SQR

Ind Poisson

Copula Poisson

Mixture Poiss

Poisson SQR

Histogram Mean Min Median Max Histogram Mean Min Median Max

0 0.2


0.4 0.6 0.8 0 0.2


0.4 0.6 0.8 0 0.2


0.4 0.6 0.8

FIGURE 5 | BRCA RNA-Seq dataset (medium counts and medium overdispersion): maximum mean discrepancy (MMD) (top) and Spearman ρ’sdifference (bottom) with different number of variables: 10 (left), 100 (middle), 1000 (right). While mixtures (‘Log-Normal’ and ‘Mixture Poiss’)perform well in terms of MMD, the Gaussian copula paired with Poisson marginals (‘Copula Poisson’) can model dependency structure well asevidenced by the Spearman metric.



in real count data with the log-normal–Poisson mix-ture and a finite mixture of independent Poisson dis-tributions being prime examples. In addition,mixture models have closed-form moments and, inthe case of a finite mixture, closed-form likelihoodcalculations—something not generally true for theother classes. The third class based on Poisson condi-tionals can be represented as graphical models, thusproviding both compact and visually appealing repre-sentations of joint distributions. Conditional modelsbenefit from strong theoretical guarantees aboutmodel recovery given certain modeling assumptions.However, checking conditional modeling assump-tions may be impossible and may not always be satis-fied for real-world count data. From our empiricalexperiments, we found that (1) mixture models areimportant for overdispersion when counts aremedium or high, (2) the Gaussian copula with Pois-son marginals joint distribution can estimate depend-ency structure for a wide range of data

characteristics even when the distribution does not fitthe underlying data, and (3) Poisson SQR modelsperform well for low count values with many zeros(i.e., sparse data) and can handle moderate overdis-persion. Overall, in practice, we would recommendcomparing the three best-performing methods fromeach class, namely the Gaussian copula model pairedwith Poisson marginals, the finite mixture of inde-pendent Poisson distributions, and the Poisson SQRmodel. This initial comparison will likely highlightsome interesting properties of a given dataset andsuggest which class to pursue in more detail.

This review has highlighted several keystrengths and weaknesses of the main approaches toconstructing multivariate Poisson distributions. Yet,there remain many open questions. For example,what are the marginal distributions of the PGMs thatare defined in terms of their conditional distribu-tions? Or conversely, what are the conditional distri-butions of the copula models that are defined in

Classic3 (d = 10)

Classic3 (d = 10)

Classic3 (d = 100) Classic3 (d = 1000)

Classic3 (d = 100) Classic3 (d = 1000)

Histogram

Ind Poisson

Ind Poisson Ind Poisson

FLPGM Poisson

Vine Poisson

Copula Poisson

Copula Poisson Copula Poisson

Truncated PGM

Log-Normal

Mixture Poiss

Mixture Poiss Mixture Poiss

0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4

PGM

Poisson SQR

Vine Poisson

Ind Poisson

Truncated PGM

PGM

Copula Poisson

Log-Normal

Mixture Poiss

FLPGM Poisson

Poisson SQR

Poisson SQR Poisson SQR

Mean Min Median Max



Ind Poisson

Copula Poisson

Mixture Poiss

Poisson SQR

Ind Poisson

Copula Poisson

Mixture Poiss

Poisson SQR


0 0.2


0.4 0.6 0.8 0 0.2


0.4 0.6 0.8 0 0.2


0.4 0.6 0.8

Pairwise maximum mean discrepancy Pairwise maximum mean discrepancy Pairwise maximum mean discrepancy

FIGURE 6 | Classic3 text dataset (low counts and medium overdispersion): maximum mean discrepancy (top) and Spearman ρ’s difference(bottom) with different number of variables: 10 (left), 100 (middle), 1000 (right). The Poisson SQR model performs better on this low count datasetthan in previous settings.



terms of their marginal distributions? Can novelmodels be created at the intersection of these modelclasses that could combine the strengths of differentclasses as suggested in the Discussion section? Couldcertain model classes be developed in an applicationarea that has been largely dominated by anothermodel class? For example, graphical models are wellknown in the machine learning literature, while cop-ula models are well known in the financial modelingliterature. Overall, multivariate Poisson models arepoised to increase in popularity given the wide poten-tial applications for real-world, high-dimensional,count-valued data in text analysis, genomics, spatialstatistics, economics, and epidemiology.

NOTESa The label ‘multivariate Poisson’ was introduced in thestatistics community to refer to the particular model intro-duced in this section, but other generalizations could alsobe considered multivariate Poisson distributions.b This is because if y 2 ℝ � Normal, then exp(y) 2 ℝ+ + � LogNormal.

c Besag58 originally named these Poisson auto models,focusing on pairwise graphical models, but here, we con-sider the general graphical model setting.d If the marginal distribution on the length is set to be thesame as the marginal distribution on length for thePGM—i.e., if ℙ Lð Þ =

Xx:kxk1 =LℙPGM xð Þ, then the PGM

distribution is recovered.e http://ir.dcs.gla.ac.uk/resources/test_collections/f https://cran.r-project.org/web/packages/XMRF/index.htmlg For the crash-severity dataset, the truncated Poissongraphical model (‘Truncated PGM’) outperforms the Pois-son SQR model under the pairwise MMD metric. Afterinspection, however, we realized that the Truncated PGMmodel performed better merely because outlier values weretruncated to the 99th percentile as described in the supple-mentary material. This reduced the overfitting of outliervalues caused by the crash-severity dataset’s highoverdispersion.h These basic trends are also corroborated by the two data-sets in the supplementary material.i In the discrete case, the dependency structure cannot beperfectly decoupled from the marginal distributions, unlikein the continuous case where the dependency structure andmarginals can be perfectly decoupled.

ACKNOWLEDGMENTS

D.I. and P.R. acknowledge the support of ARO via W911NF-12-1-0390, NSF via IIS-1149803, IIS-1447574,DMS-1264033, and NIH via R01 GM117594-01 as part of the Joint DMS/NIGMS Initiative to SupportResearch at the Interface of the Biological and Mathematical Sciences. E.Y. acknowledges the support ofMSIP/IITP via ICT R&D program 2016-0-00563. G.A. acknowledges support from NSF DMS-1264058 andNSF DMS-1554821.

REFERENCES1. Campbell J. The Poisson correlation function. Proc

Edinburgh Math Soc 1934, 4:18–26.

2. M’Kendrick A. Applications of mathematics tomedical problems. Proc Edinburgh Math Soc 1925,44:98–130.

3. Teicher H. On the multivariate Poisson distribution.Scand Actuar J 1954, 1954:1–9.

4. Wicksell SD. Some theorems in the theory of probabil-ity, with special reference to their importance in thetheory of homograde correlation. Svenska Aktuarieför-eningen Tidskrift. 1916:165–213.

5. Nikoloulopoulos AK. Copula-based models for multi-variate discrete response data. In: Jaworski P,Durante F, Härdle W, eds. Copulae in Mathematicaland Quantitative Finance. Berlin Heidelberg: Springer-Verlag; 2013.

6. Nikoloulopoulos AK, Karlis D. Modeling multivariatecount data using copulas. Commun Stat Simul Comput2009, 39:172–187.

7. Xue-Kun Song P. Multivariate dispersion models gen-erated from gaussian copula. Scand J Stat 2000,27:305–320.

8. Holgate P. Estimation for the bivariate Poisson distri-bution. Biometrika 1964, 51:241–287.

9. Dwass M, Teicher H. On infinitely divisible randomvectors. Ann Math Stat 1957, 28:461–470.

10. Kawamura K. The structure of multivariate Poissondistribution. Kodai Math J 1979, 2:337–345.

11. Srivastava R, Srivastava A. On a characterization ofPoisson distribution. J Appl Probab 1970, 7:497–501.

12. Wang Y. Characterizations of certain multivariate dis-tributions. Mathematical Proceedings of the Cam-bridge Philosophical Society 1974, 75:219–234.



http://ir.dcs.gla.ac.uk/resources/test_collections/

https://cran.r-project.org/web/packages/XMRF/index.html

13. Johnson NL, Kotz S, Balakrishnan N. Discrete Multi-variate Distributions, vol. 165. New York: JohnWiley & Sons; 1997.

14. Krishnamoorthy A. Multivariate binomial and Poissondistributions. Sankhy�a: Indian J Stat 1951,11:117–124.

15. Krummenauer F. Limit theorems for multivariate dis-crete distributions. Metrika 1998, 47:47–69.

16. Mahamunulu D. A note on regression in the multivari-ate Poisson distribution. J Am Stat Assoc 1967,62:251–258.

17. Kano K, Kawamura K. On recurrence relations for theprobability function of multivariate generalized Pois-son distribution. Commun Stat Theory Methods 1991,20:165–178.

18. Karlis D. An em algorithm for multivariate Poissondistribution and related models. J Appl Stat 2003,30:63–77.

19. Loukas S, Kemp C. On computer sampling from tri-variate and multivariate discrete distributions: multi-variate discrete distributions. J Stat Comput Simul1983, 17:113–123.

20. Tsiamyrtzis P, Karlis D. Strategies for efficient compu-tation of multivariate Poisson probabilities. CommunStat Simul Comput 2004, 33:271–292.

21. Sklar A. Fonctions de répartition à n dimensions etleurs marges. Publ Inst Statist Univ Paris 1959,8:229–231.

22. Cherubini U, Luciano E, Vecchiato W. Copula Meth-ods in Finance. Hoboken, NJ: John Wiley &Sons; 2004.

23. Genest C, Nešlehová J. A primer on copulas for countdata. Astin Bull 2007, 37:475–515.

24. Nikoloulopoulos AK. On the estimation of normalcopula discrete regression models using the continuousextension and simulated likelihood. J Stat Plann Infer-ence 2013, 143:1923–1937.

25. Nikoloulopoulos AK. Efficient estimation of high-dimensional multivariate normal copula models withdiscrete spatial responses. Stoch Environ Res RiskAssess 2016, 30:493–505.

26. Rüschendorf L. Copulas, Sklar’s theorem, and distribu-tional transform. In: Mathematical Risk Analysis.Berlin, Heidelberg: Springer-Verlag; 2013, 3–34(Chapter 1).

27. Demarta S, McNeil AJ. The t copula and related copu-las. Int Stat Rev 2005, 73:111–129.

28. Trivedi PK, Zimmer DM. Copula Modeling: An Intro-duction for Practitioners. Foundations and Trends® inEconometrics, vol. 1. Boston - Delft: now publishers;2007, 1–111.

29. Aas K, Czado C, Frigessi A, Bakken H. Pair-copulaconstructions of multiple dependence. Insur: MathEcon 2009, 44:182–198.

30. Bedford T, Cooke RM. Vines: a new graphical modelfor dependent random variables. Ann Stat 2002,30:1031–1068.

31. Cook RJ, Lawless JF, Lee K-A. A copula-based mixedPoisson model for bivariate recurrent events underevent-dependent censoring. Stat Med 2010,29:694–707.

32. Yahav I, Shmueli G. On generating multivariate Pois-son data in management science applications. ApplStoch Models Bus Ind 2012, 28:91–102.

33. Sklar A. Random variables, joint distribution func-tions, and copulas. Kybernetika 1973, 9:449–460.

34. Karlis D. Models for multivariate count time series. In:Davis RA, Holan SH, Lund R, Ravishanker N, eds.Handbook of Discrete-Valued Time Series. BocaRaton, FL: CRC Press; 2016, 407–424(Chapter 19).

35. Joe H, Xu JJ. The estimation method of inference func-tions for margins for multivariate models. TechnicalReport 166, The University of British Columbia, Van-couver, Canada, 1996.

36. Denuit M, Lambert P. Constraints on concordancemeasures in bivariate discrete data. J Multivariate Anal2005, 93:40–57.

37. Heinen A, Rengifo E. Multivariate autoregressive mod-eling of time series count data using copulas. J EmpirFinance 2007, 14:564–583.

38. Heinen A, Rengifo E. Multivariate reduced rankregression in nonGaussian contexts, using copulas.Comput Stat Data Anal 2008, 52:2931–2944.

39. Madsen L. Maximum likelihood estimation of regres-sion parameters with spatially dependent discrete data.J Agric Biol Environ Stat 2009, 14:375–391.

40. Madsen L, Fang Y. Joint regression analysis fordiscrete longitudinal data. Biometrics 2011, 67:1171–1175.

41. Kazianka H. Approximate copula-based estimationand prediction of discrete spatial data. Stoch EnvironRes Risk Assess 2013, 27:2015–2026.

42. Kazianka H, Pilz J. Copula-based geostatistical model-ing of continuous and discrete data including covari-ates. Stoch Environ Res Risk Assess 2010, 24:661–673.

43. Genz A, Bretz F. Computation of Multivariate Normaland t Probabilities, vol. 195. Dordrecht HeidelbergLondon New York: Springer; 2009.

44. Panagiotelis A, Czado C, Joe H. Pair copula construc-tions for multivariate discrete data. J Am Stat Assoc2012, 107:1063–1072.

45. Czado C, Brechmann EC, Gruber L. Selection of vinecopulas. In: Copulae in Mathematical and QuantitativeFinance. Berlin Heidelberg: Springer; 2013, 17–37.

46. Karlis D, Xekalaki E. Mixed Poisson distributions. IntStat Rev 2005, 73:35–58.



47. Arbous AG, Kerrich J. Accident statistics and theconcept of accidentproneness. Biometrics 1951, 7:340–432.

48. Steyn H. On the multivariate Poisson normal distribu-tion. J Am Stat Assoc 1976, 71:233–236.

49. Aitchison J, Ho C. The multivariate Poisson-log nor-mal distribution. Biometrika 1989, 76:643–653.

50. Aguero-Valverde J, Jovanis PP. Bayesian multivariatePoisson lognormal models for crash severity modelingand site ranking. Transp Res Rec: J Transp Res Board2009, 2136:82–91.

51. Chib S, Winkelmann R. Markov chain Monte Carloanalysis of correlated count data. J Bus Econ Stat2001, 19:428–435.

52. El-Basyouny K, Sayed T. Collision prediction modelsusing multivariate Poisson-lognormal regression. AccidAnal Prev 2009, 41:820–828.

53. Ma J, Kockelman KM, Damien P. A multivariatePoisson-lognormal regression model for prediction ofcrash counts by severity, using Bayesian methods.Accid Anal Prev 2008, 40:964–975.

54. Park E, Lord D. Multivariate Poisson-lognormal mod-els for jointly modeling crash frequency by severity.Transp Res Rec: J Transp Res Board 2007, 2019:1–6.

55. Zhan X, Abdul Aziz HM, Ukkusuri SV. An efficientparallel sampling technique for multivariate Poisson-lognormal model: Analysis with two crash count data-sets. Anal Methods Accid Res 2015, 8:45–60.

56. Karlis D, Meligkotsidou L. Finite mixtures of multivar-iate Poisson distributions with application. J Stat PlannInference 2007, 137:1942–1960.

57. Bradley JR, Holan SH, Wikle CK. Computationallyefficient distribution theory for Bayesian inference ofhigh-dimensional dependent count-valued data. arXivpreprint arXiv:1512.07273, 2015.

58. Besag J. Spatial interaction and the statistical analysisof lattice systems. J R Stat Soc Ser B Stat Methodol1974, 36:192–236.

59. Inouye DI, Ravikumar P, Dhillon IS. Fixed-length Pois-son MRF: adding dependencies to the multinomial. In:Neural Information Processing Systems, 28, 2015.

60. Inouye DI, Ravikumar P, Dhillon IS. Square root graphi-cal models: multivariate generalizations of univariateexponential families that permit positive dependencies. In:International Conference on Machine Learning, 2016.

61. Yang E, Ravikumar P, Allen G, Liu Z. Graphical mod-els via generalized linear models. In: Neural Informa-tion Processing Systems, 25, 2012.

62. Yang E, Ravikumar P, Allen G, Liu Z. On Poissongraphical models. In: Neural Information ProcessingSystems, 26, 2013.

63. Yang E, Ravikumar P, Allen G, Liu Z. Graphical mod-els via univariate exponential family distribution.J Mach Learn Res 2015, 16:3813–3847.

64. Lauritzen SL. Graphical Models. Oxford: Oxford Uni-versity Press; 1996.

65. Clifford P. Markov random fields in statistics. In: Dis-order in Physical Systems. Oxford: Oxford SciencePublications; 1990.

66. Wainwright MJ, Jordan MI. Graphical models, expo-nential families, and variational inference. FoundTrends R Mach Learn 2008, 1:1–305.

67. Koller D, Friedman N. Probabilistic Graphical Models:Principles and Techniques. Cambridge, MA: MITPress; 2009.

68. Yang E, Ravikumar P, Allen G, Liu Z. Conditionalrandom fields via univariate exponential families. In:Neural Information Processing Systems, 26, 2013.

69. Kaiser MS, Cressie N. Modeling Poisson variables withpositive spatial dependence. Stat Probab Lett 1997,35:423–432.

70. Meinshausen N, Bühlmann P. High-dimensionalgraphs and variable selection with the lasso. Ann Stat2006, 34:1436–1462.

71. Allen GI, Liu Z. A log-linear graphical model for infer-ring genetic networks from high-throughput sequen-cing data. In: 2012 I.E. International Conference onBioinformatics and Biomedicine, IEEE, 2012, 1–6.

72. Allen GI, Liu Z. A local Poisson graphical model forinferring networks from sequencing data. IEEE TransNanoBiosci 2013, 12:189–198.

73. Hadiji F, Molina A, Natarajan S, Kersting K. Poissondependency networks: gradient boosted models for mul-tivariate count data.Mach Learn 2015, 100:477–507.

74. Han SW, Zhong H. Estimation of sparse directed acy-clic graphs for multivariate counts data. Biometrics2016, 72:791–803.

75. Altham PME, Hankin RKS. Multivariate generaliza-tions of the multiplicative binomial distribution: intro-ducing the mm package. J Stat Softw 2012, 46:1–23.doi:10.18637/jss.v046.i12.

76. Altham PME. Two generalizations of the binomial distri-bution. J R Stat Soc Ser C Appl Stat 1978, 27:162–167.

77. Neal RM. Annealed importance sampling. Stat Com-put 2001, 11:125–139.

78. Milton JC, Shankar VN, Mannering FL. Highwayaccident severities and the mixed logit model: anexploratory empirical analysis. Accid Anal Prev 2008,40:260–266.

79. Wan Y-W, Allen GI, Liu Z. TCGA2STAT: simpleTCGA data access for integrated statistical analysis inR. Bioinformatics 2016, 32:952–954.

80. Gretton A. A kernel two-sample test. J Mach LearnRes 2012, 13:723–773.

81. Zhao J, Meng D. FastMMD: ensemble of circular dis-crepancy for efficient two-sample test. Neural Comput2015, 27:1345–1372.



http://dx.doi.org/10.18637/jss.v046.i12

82. Inouye DI, Ravikumar P, Dhillon IS. Admixtures ofPoisson MRFs: a topic model with word dependencies.In: International Conference on Machine Learning, vol31, 2014.

83. Inouye DI, Ravikumar P, Dhillon IS. Generalized rootmodels: beyond pairwise graphical models for

univariate exponential families. arXiv preprintarXiv:1606.00813, 2016.

84. H. Liu, F. Han, M. Yuan, J. Lafferty, andL. Wasserman. High dimensional semiparametricGaussian copula graphical models. Ann Stat, 40:34,2012. 10.1214/12-AOS1037.



http://dx.doi.org/10.1214/12-AOS1037

Date post:	16-Oct-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

A Review of Multivariate Distributions for Count Data ...pradeepr/paperz/wics1398.pdf · variate...

Documents