DOTTORATO DI RICERCA IN MATEMATICA E APPLICAZIONI
LOG-LINEAR MODELS
AND TORIC IDEALS
Fabio Rapallo
Relatore di tesi: Coordinatore del Dottorato:
Prof. GIOVANNI PISTONE Prof. CLAUDIO PEDRINI
(Politecnico di Torino) (Universita di Genova)
DOTTORATO DI RICERCA IN MATEMATICA E APPLICAZIONI
XV CICLO
Sede Amministrativa: Universita di Genova
Sedi consorziate: Universita di Torino, Politecnico di Torino
Tesi presentata per il conseguimento del titolo di
Dottore di Ricerca in Matematica e Applicazioni nel mese di Aprile 2003
Contents
Introduction 3
1 Statistical and algebraic background 7
1.1 Log-linear models . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.1 Definitions and basic properties . . . . . . . . . . . . . . 8
1.1.2 Maximum likelihood estimates . . . . . . . . . . . . . . . 12
1.1.3 Goodness of fit tests . . . . . . . . . . . . . . . . . . . . 14
1.1.4 An important remark . . . . . . . . . . . . . . . . . . . . 16
1.1.5 Exact methods . . . . . . . . . . . . . . . . . . . . . . . 17
1.2 Algebraic theory of toric ideals . . . . . . . . . . . . . . . . . . . 21
1.2.1 Definitions and first properties . . . . . . . . . . . . . . . 22
1.2.2 Computation of toric ideals. First method . . . . . . . . 24
1.2.3 Navigating inside the set of solutions of an integer linear
system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.2.4 Computation of toric ideals. Second method . . . . . . . 27
2 The Diaconis-Sturmfels algorithm 29
2.1 Hypergeometric distribution . . . . . . . . . . . . . . . . . . . . 29
2.2 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Rates of convergence . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4 Two examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 Two-way contingency tables 39
3.1 Two-way contingency tables: Independence . . . . . . . . . . . . 40
3.2 Some models for incomplete and square tables . . . . . . . . . . 44
3.2.1 Incomplete tables . . . . . . . . . . . . . . . . . . . . . . 44
3.2.2 The quasi-independence model . . . . . . . . . . . . . . . 45
3.2.3 The symmetry model . . . . . . . . . . . . . . . . . . . . 47
3.2.4 The quasi-symmetry model . . . . . . . . . . . . . . . . 47
1
2 CONTENTS
3.2.5 Computational notes . . . . . . . . . . . . . . . . . . . . 48
3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 An application. The h-sample problem . . . . . . . . . . . . . . 52
4 Rater agreement models 55
4.1 A medical problem . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Multidimensional rater agreement models . . . . . . . . . . . . . 56
4.3 Exact inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5 Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5 Sufficiency and estimation 73
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2 Sufficiency and sampling . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Geometry of toric models . . . . . . . . . . . . . . . . . . . . . . 78
5.4 Algebraic representation of the sufficient statistics . . . . . . . . 84
5.5 Minimum-variance estimator . . . . . . . . . . . . . . . . . . . . 85
5.6 Algebraic method . . . . . . . . . . . . . . . . . . . . . . . . . . 87
A Programs 89
Bibliography 98
Introduction
It is useful to start this work with two sentences from a recent survey paper
by Agresti (2001) recently appeared in “Statistics in Medicine”.
“For the null hypothesis of independence and conditional independence,
the relevant conditional distribution is simple, which is why ordinary Monte
Carlo applies so easily. This is not true of more complex hypotheses, such
as quasi-independence, quasi-symmetry, and others for square and triangular
contingency tables that were until recently given little attention in the exact
literature.”
“If statisticians cannot agree how to analyze even a 2 × 2 table, with no
one approach being obviously best, what hope is there for a consensus on more
complex analyses? The probability is probably decreasing to 0 with time, . . .”
These two claims give concisely an excellent idea on how statisticians need
simple and comprehensive procedures for analyzing complex problems involv-
ing contingency tables.
In this thesis we use techniques from Computational Commutative Algebra
to solve complex problems in estimation and hypothesis testing for discrete
data.
The application of Computational Commutative Algebra in statistics and
probability is a recent topic and a first book in the field is Pistone et al. (2001a).
In this thesis, starting from the work by Diaconis & Sturmfels (1998), we
study the log-linear models through polynomial algebra, with important ap-
plications to statistical problems, such as goodness-of-fit tests for contingency
tables. In particular, we apply the algebraic theory of toric ideals to study
a Markov Chain Monte Carlo method for sampling from conditional distribu-
tions, through the notion of Markov basis. Throughout the thesis we will con-
sider a wide range of statistical problems: Models for two-way and multi-way
contingency tables; rater agreement models; contingency tables with structural
zeros. In some cases we will characterize the relevant Markov basis, while in
3
4 CONTENTS
other cases we will give results which lead to a fundamental simplification of
the symbolic computations. A number of examples will show the relevance,
the efficiency and the versatility of these algebraic procedures.
The relevance of the topics considered here is proved by the increasing
number of specific workshops, conferences and lectures on this subject. More-
over, the number of people interested in grows rapidly, also including Professor
Bernd Sturmfels, and Professor Stephen Fienberg.
In Chapter 1, we introduce some background material about the log-linear
models and the toric ideals, which have a key role in the analysis of contin-
gency tables. In Chapter 2, we present a slight generalization of the Diaconis-
Sturmfels algorithm, while in Chapter 3 we analyze the case of two-way con-
tingency tables, with some results for the characterization of the relevant toric
ideals and some other results useful to simplify their computation. In Chapter
4 we study an application of the previous theory to rater agreement models,
which are widely used in biomedical and pharmaceutical research. We also
give results for the characterization of toric ideals for multi-way contingency
tables. Finally, in Chapter 5 we study the algebraic properties of the notions of
sufficiency and maximum likelihood estimation in statistical models on finite
sample space, with results about the computation of the sufficient statistic
of the model and its geometric representation. The programs, both symbolic
and numerics, presented in Appendix A, show the practical applicability of the
algorithms, together with the examples on real data sets which are presented
in all Chapters.
The results presented in Chapters 3, 4 and 5 represent original work, as
well as the generalization of the algorithm presented in Chapter 2 and the
programs in Appendix A.
A part of the thesis has been published in preliminary versions. In particu-
lar, a part of Chapter 2 is in press on the “Scandinavian Journal of Statistics”,
while a part of Chapters 4 and 5 represents the subject of a preprint of the
Department of Mathematics of Genova. Two papers are currently submitted
to international journals.
The topics presented here have been the subject of some presentations
to conferences and international workshops. Among other we want to cite
here the series “Grostat” and the Meetings of the Italian Statistical Society.
Moreover, the tutorial for the course “Statistical Toric Models”, at the “First
International School on Algebraic Statistics”, has been based on the work
CONTENTS 5
presented in this thesis.
Finally, the theory and the applications developed in Chapter 4 have been
the starting point for a current cooperation with the pharmaceutical company
“Schering” in Berlin.
Acknowledgements
During the writing of this work, through the last two years, I have benefited
from the help of many people, both from the scientific and moral point of view.
In alphabetical order, a minimal list is formed by Daniele De Martini (Uni-
versity of “Piemonte Orientale”, Novara), Franco Fagnola (University of Gen-
ova), Mauro Gasparini (Politecnico di Torino), my advisor Giovanni Pistone
(Politecnico di Torino), Eva Riccomagno (University of Warwick), Lorenzo
Robbiano (University of Genova), Maria Piera Rogantin (University of Gen-
ova), Ivano Repetto (University of Genova), Richard Vonk (Schering, Berlin).
Outside the academia, I want to acknowledge the constant support of Mat-
teo Cella, my cousin, and Andrea Giacobbe. Their support has been essential
in the most critical moments.
6 CONTENTS
Chapter 1
Statistical and algebraicbackground
In this chapter we introduce the main background theory we need, both from
the statistical and the algebraic point of view. In Section 1.1 we recall the
properties of log-linear models for the analysis of qualitative data, with a
particular attention on the maximum likelihood estimation and the goodness
of fit tests, using the classical asymptotic approach. Moreover, we emphasize
how simple algebraic objects, such as power products and binomials, naturally
appear in the classical analysis of log-linear models. Finally, in Section 1.1.5,
we review the main exact methods for hypothesis testing for contingency tables.
Exact methods come back to the works by Fisher in the forties, where an exact
test for independence in 2×2 contingency tables was introduced. A wide range
of exact procedures for inference for contingency tables is now available and
in most cases the increased speed and computing power of computers has
played a fundamental role for the practical applicability of such procedures.
We describe here the main methods for exact analysis of contingency tables.
In Section 1.2, we introduce the algebraic theory of toric ideals, with spe-
cial attention to the statistical meaning of the algebraic procedures. A brief
and neat review of some relevant concepts from Computational Commutative
Algebra is presented in order to give a presentation as self-contained as pos-
sible. In particular, we deal with the computation of toric ideals, which will
represent a key step in the algorithms presented in the next chapters.
In this chapter, the proofs of the results are omitted, with references to
literature, in particular Haberman (1974), Bishop et al. (1975) and Agresti
(2002) for the theory of log-linear models, and Kreuzer & Robbiano (2000)
and Bigatti & Robbiano (2001) for the theory of toric ideals.
7
8 CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
1.1 Log-linear models
1.1.1 Definitions and basic properties
The material presented in this section is mainly taken from classical works in
the field of the log-linear models. The aim of this section is to show that in the
classical representation of the log-linear models, i.e. using vector space theory,
polynomial equations play a fundamental role. Some simple examples of this
section will be used many times throughout the thesis. The main references
are Haberman (1974), Bishop et al. (1975), Agresti (1996), Fienberg (1980)
and Dobson (1990). In particular we follow the notation in Haberman (1974),
as it allows us to give an unique presentation of the theory of log-linear models
in a general setting. Some examples and useful remarks are taken from the
other references.
Log-linear models are statistical models for the analysis of qualitative data,
then the sample space is finite and the data are usually presented in a contin-
gency table.
Definition 1.1 A contingency table is a collection f of frequencies (f(x) ∈N : x ∈ X ), where X is the sample space.
A contingency table contains the joint observations of d random variables
on N subjects. A first classification of contingency tables can be made looking
at the sample space X . In particular, the contingency table is said to be a d-way
contingency table if f is the cross-classification of the subjects with respect to
d random variables. Consequently, the sample space X is a finite subset of Nd,
possibly with a suitable coding of the levels of the random variables. Moreover,
the contingency table f is complete if the sample space is a Cartesian product
of one-dimensional sample spaces and is incomplete otherwise.
Log-linear models are used to describe the stochastic structure of the joint
probability distribution underlying the sample, i.e. to determine stochastic re-
lations between the observed variables, such as independence, non-interaction,
symmetry and others more complex relations.
If we denote the cardinality of the sample space X by q, the table f is
an element of the q-dimensional space Rq, with the standard inner product
defined by
< f1, f2 >=∑x∈X
f1(x)f2(x)
1.1. LOG-LINEAR MODELS 9
and the canonical basis (ex)x∈X .
In a log-linear model, f is assumed to be the realization of a random vector
F with mean m = (m(x) : x ∈ X ), where m(x) > 0 for all x ∈ X . Here
and throughout we denote the observed tables with lowercase letters and the
random vectors with the corresponding uppercase letters. Thus for each x ∈ X ,
log m(x) is well defined. Here we assume that the probabilities of the sample
points are positive, and that the means m(x) > 0 are a parameterization of
the model. Therefore, if µ = (log m(x) : x ∈ X ), then µ ∈ Rq. In a log-linear
model it is assumed that µ ∈ M , where M is a p-dimensional linear manifold
contained in Rq. When q = p, we say that the corresponding log-linear model
is saturated.
Example 1.2 Consider a 2-way contingency table with two levels for any
random variable, i.e. a 2 × 2 contingency table. The space R4 is spanned by
these four orthogonal vectors:
v1 =
(1 11 1
)v2 =
(1 1−1 −1
)v3 =
(1 −11 −1
)v4 =
(1 −1−1 1
).
We can consider a model M such that M = span(v1, v2, v3). In such case,
Bishop et al. (1975) showed that the model can be written in the form
log m(i, j) = λ + λXi + λY
j (1.1)
for i, j = 1, 2, with the constraints
λX1 + λX
2 = λY1 + λY
2 = 0 . (1.2)
If we denote by mi+ and m+j the marginal totals, one can write
mi+ = eλeλXi (eλY
1 + eλY2 ) ,
m+j = eλeλYj (eλX
1 + eλX2 ) ,
N = eλ(eλX1 + eλX
2 )(eλY1 + eλY
2 ) .
Hence we obtain
m(i, j) = mi+m+j/N (1.3)
where mi+ and m+j are the marginal totals. Moreover, since v4 is orthogonal
to M1, we obtain the relation
logm(1, 1)m(2, 2)
m(1, 2)m(2, 1)= 0 (1.4)
10 CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
from which we obtain
m(1, 1)m(2, 2)−m(1, 2)m(2, 1) = 0 . (1.5)
Equation (1.5) represents an algebraic relation between the means of the cells
and in particular it is a binomial. We will see later the relevance of such
binomials in our framework.
A detailed representation of the vector space bases of log-linear models is
presented in Collombier (1980), with special attention to the meaning of such
representation in terms of hypothesis testing.
For the statistical inference, it is now essential to specify the underlying
probability distribution of the table f . The main models in literature are the
Poisson model and the multinomial model.
The Poisson model
In the Poisson model, the elements of F are independent Poisson random
variables with E[F (x)] = m(x) for all x ∈ X and m(x) > 0 for all x ∈ X . The
log-likelihood is
log L(f, µ) =∑x∈X
f(x) log m(x)−m(x) =
= < f, µ > −∑x∈X
exp(µ(x)) .
If PM is the orthogonal projection from Rq to M , then since µ ∈ M and PM
is a symmetric operator, we have
log L(f, µ) =< PMf, µ > −∑x∈X
exp(µ(x)) . (1.6)
Therefore, the family of Poisson models such that µ ∈ M is an exponential fam-
ily. Following classical results about exponential families, the projection PMf
is a complete minimal sufficient statistic, see for example Lehmann (1983).
Moreover, any nonsingular linear transformation of PMf is again a complete
minimal sufficient statistic. Traditionally, Equation (1.6) is presented as the
following theorem.
Theorem 1.3 Under the Poisson model, the sufficient statistic is expressible
as linear combination of the cell counts (f(x) : x ∈ X ).
1.1. LOG-LINEAR MODELS 11
For the proof, see Agresti (2002), page 136.
The random vector F has mean m and variance-covariance matrix D(µ),
where D(µ) is a diagonal matrix with the variances m(x) on the main diagonal
and 0 otherwise.
We denote the convergences in probability and in distribution by the sym-
bolsp−→ and
d−→, respectively. Moreover the symbol N denotes the Gaussian
distribution. The first asymptotic results are summarized in the following
proposition.
Proposition 1.4 If (F (N))N>0 is a sequence of random vectors from a Poisson
model with respective means (m(N))N>0 and if N−1m(N) −→ m, then
N−1F (N) p−→ m (1.7)
and
N−1/2(F (N) −m(N))d−→ N (0, D(µ)) . (1.8)
For the proof, see Haberman (1974), page 7.
The multinomial model
In the multinomial model, the vector F consists of one or more independent
multinomial random vectors (Fi : i ∈ Xk) with mean (mi : i ∈ Xk). The sets
Xk, k = 1, . . . , r are disjoint and have union X , and m(x) > 0 for all x ∈ X .
We suppose that the model M is such that for each k = 1, . . . , r
ν(k) = {IXk(x) : x ∈ X} ∈ M .
For the determination of the complete minimal sufficient statistic, we consider
the decomposition of M in M1 and M2 = M−M1, where M1 is the linear man-
ifold generated by (ν(k) : k = 1, . . . , r) and M2 is its orthogonal complement.
We denote the sample size for (f(x) : x ∈ Xk) by Nk. The log-likelihood is
log L(f, µ) =< PM2f, PM2µ > −r∑
k=1
Nk log(< m(PM2µ), ν(k) >) +r∑
k=1
log Nk! .
(1.9)
Thus, PM2f is a complete minimal sufficient statistic. As any nonsingular
linear transformation of PM2f is again a complete minimal sufficient statistic,
we can use the statistic PMf as in the Poisson model. The random vector F
has mean m and variance-covariance matrix
Σ(µ) = D(µ)(I − P ∗M1
(µ))
12 CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
where P ∗M1
is the orthogonal projection from Rq onto M1 and I is the identity
matrix. We summarize in the following proposition the basic convergence
results.
Proposition 1.5 If (F (N))N>0 is a sequence of random vectors from a multi-
nomial model with respective means (m(N))N>0 and if N−1m(N) −→ m, then
N−1F (N) p−→ m (1.10)
and
N−1/2(F (N) −m(N))d−→ N (0, D(µ)(I − P ∗
M1(µ))) . (1.11)
For the proof, see Haberman (1974), page 13.
The close connection between the Poisson model and the multinomial model
can be found in Haberman (1974), where the author proves that many results
about maximum likelihood estimation and hypothesis testing are the same
under both the Poisson model and the multinomial model.
Other models can be considered, such as the product-multinomial model
and the conditional Poisson model. See Haberman (1974) and Agresti (2002)
for details. Moreover, the different meanings of the models from an applicative
point of view are widely discussed in Chapters 2 and 3 of Fienberg (1980).
1.1.2 Maximum likelihood estimates
We start by considering the problem of maximum likelihood estimation for the
Poisson model. The maximum likelihood estimate µ of µ is the value such that
log L(f, µ) = supµ∈M
log L(f, µ) . (1.12)
We briefly recall the basic theorems about the existence and the uniqueness of
the maximum likelihood estimate
Theorem 1.6 If a maximum likelihood estimate µ exists, then it is unique
and satisfies the equation
PMm = PMf . (1.13)
Conversely, if for some µ ∈ M and m = exp(µ) and Equation (1.13) is satis-
fied, then µ is the maximum likelihood estimate of µ.
1.1. LOG-LINEAR MODELS 13
Theorem 1.7 A necessary and sufficient condition that maximum likelihood
estimate µ of µ exists is that there exists δ ∈ M⊥ such that f(x) + δ(x) > 0
for all x ∈ X .
Corollary 1.8 If f(x) > 0 for all x ∈ X , then the maximum likelihood esti-
mate µ exists.
Theorem 1.9 A necessary and sufficient condition that maximum likelihood
estimate µ exists is that there not exist µ ∈ M such that µ 6= 0, µ(x) ≤ 0 for
all x ∈ X and < f, µ >= 0.
For the multinomial model, the relevant theorem is the following, which
states that the maximum likelihood estimate is the same under the Poisson
model and the multinomial model.
Theorem 1.10 If µ(m) is the maximum likelihood estimate for a multinomial
model with µ ∈ M and if µ is the corresponding maximum likelihood estimate
for a Poisson model with µ ∈ M , then µ(m) = µ, in the sense that when one
side of the equation exists, then the other side exists and the two sides are
equal.
For the proof of all the above theorems, see Haberman (1974), pages 35-41.
In view of this theorem, the necessary and sufficient conditions stated for
the Poisson model also apply to the multinomial model.
In many cases, the maximum likelihood estimate can not be computed
directly and we resort to numerical approximations. Among the numerical
methods, the most important ones in this field are the Newton-Raphson al-
gorithm and the Iterative Proportional Fitting algorithm. Details on such
methods can be found in Haberman (1974) and Agresti (2002). We will come
back later on the numerical computation of the maximum likelihood estimate.
Now, we analyze the asymptotic behavior of the maximum likelihood esti-
mate.
Suppose that (F (N))N>0 is a succession of random vectors with means
(m(N))N>0. We denote by (µ(N))N>0 the sequence of the logarithms of the
means. Using the same notation as above, the main asymptotic results are
summarized in the following theorem.
14 CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
Theorem 1.11 Let m∗ = limN→+∞ N−1m(N), assume m∗ > 0, and let µ∗ =
log m∗. The following relations hold when N goes to +∞.
µ(N) − µNp−→ 0 (1.14)
N−1m(N) p−→ m∗ (1.15)
N1/2(µ(N) − µ(N))d−→ N (0, (P ∗
M(µ∗)− P ∗M1
(µ∗))D−1(µ∗)) (1.16)
N1/2(m(N) −m(N))d−→ N (0, D(µ∗)(P ∗
M(µ∗)− P ∗M1
(µ∗))) (1.17)
For the proof, see Agresti (2002), page 584.
The previous result, and in particular Equations (1.16) and (1.17) allows
us to compute asymptotic confidence intervals for any linear combination of
the components of µ and m, and simple hypotheses tests for such quantities,
applying the theory described in Lehmann (1983) and Lehmann (1987).
Remark 1.12 For historical reasons, the log-linear models are usually ex-
pressed in terms of the expected cell counts, instead of in terms of the cell
probabilities. However, the representation of our models in terms of the cell
probabilities is given by noting that in our notation f is the vector of the ob-
served frequencies, while N−1f is the vector of empirical probabilities. Thus,
m(N) represents the vector of the expected frequencies, and N−1m(N) repre-
sents the vector of cell probabilities, as well as m∗. Sometimes in the thesis,
we will refer to the vector of the cell probabilities as p = (p(x) : x ∈ X ).
1.1.3 Goodness of fit tests
Hypothesis tests for log-linear models are generally based on test statistics
which have an asymptotic chi-squared distribution under the null hypothesis.
We consider here tests of the form H0 : µ ∈ M0 ⊂ M1 against the alternative
hypothesis µ ∈ M1. Such tests are commonly called goodness of fit tests, in
the sense that they test if a smaller model M0 can replace a bigger model
M1 without a significant loss of information. Moreover, in many situations
these tests give some information about the stochastic relations between the
variables. The goodness of fit tests are based on the Pearson test statistic
C(µ(1), µ(0)) =∑x∈X
(m(1)(x)− m(0)(x))2
m(0)(x)(1.18)
1.1. LOG-LINEAR MODELS 15
or on the log-likelihood ratio test statistic
∆(µ(1), µ(0)) = log L(f, µ(0))− log L(f, µ(1)) . (1.19)
Suppose that M0 has dimension p0 and M1 has dimension p1. The relevant
results about hypothesis testing is the following.
Theorem 1.13 Consider a sequence of null hypotheses H0 : µ(N) ∈ M0 and
alternatives H1 : µ(N) ∈ M1. Suppose that µ(N,1) and µ(N,0) are the maximum
likelihood estimates of µ(N) under H0 and H1, respectively. Then under H0,
the statistics −2∆ and C are asymptotically equivalent and
limN→+∞
P[−2∆(µ(N,1), µ(N,0)) > c∗(α)] = P[C(µ(N,1), µ(N,0)) > c∗(α)] = α ,
(1.20)
where c∗(α) is the α-th quantile of the chi-square distribution with p1 − p0
degrees of freedom.
For the proof, see Bishop et al. (1975), page 514.
If M1 is the saturated model, we can apply the fact that µ(N,1) = f and
the test statistics in Equations (1.18) and (1.19) assume the most familiar
expressions
C(f, µ(0)) =∑x∈X
(f(x)− m(0)(x))2
m(0)(x)(1.21)
and
∆(f, µ(0)) =∑x∈X
m(0)(x) logf(x)
m(0)(x). (1.22)
A detailed discussion on these test statistics can be found in Dobson (1990)
or Haberman (1978). The small sample properties of the asymptotic tests are
presented in Fienberg (1980), Appendix IV. We will consider in details the
small sample properties later in the thesis.
Remark 1.14 In the applications, the goodness of fit tests are performed
also in those cases where the maximum likelihood estimate does not exist. In
this case we can use the notion of extended maximum likelihood estimate, see
Appendix B of Haberman (1974) for details.
16 CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
1.1.4 An important remark
As we have seen in Example (1.2), the vector space theory leads to a canonical
representation of the log-linear models in the form
log m(i) = AΛ , (1.23)
where A is a the model matrix and Λ is the vector of the model parameters.
In many cases, see for example Haberman (1978), the matrix A has integer
non-negative entries and the model can be written in the form
log m(i) =s∑
h=1
A(h, i)λ(h)(i) . (1.24)
This is the classical representation of the log-linear models and it allows a
simple interpretation of the statistical relations among the variables, see Fien-
berg (1980), Fingleton (1984) and Agresti (1996). We present here a detailed
analysis of the independence model for two categorical variables. This model
is enough simple to be analyzed in few pages with any detail and it contains all
the relevant algebraic objects we have pointed out in the previous discussion.
More complex models which involve a greater number of categorical variables
and different stochastic relations among the variables will be presented in the
next chapters.
Example 1.15 Consider the cross-classification of two categorical random
variables X and Y with supports {1, . . . , I} and {1, . . . , J}, respectively.
The saturated model is written in the form
log m(i, j) = λ + λXi + λY
j + λXYij
with the constraints
I∑i=1
λXi =
J∑j=1
λYj =
I∑i=1
λXYij =
J∑j=1
λXYij = 0 ,
while the independence model, where we impose no interaction between the
variables, assumes the form
log m(i, j) = λ + λXi + λY
j
with the constraintsI∑
i=1
λXi =
J∑j=1
λYj = 0 .
1.1. LOG-LINEAR MODELS 17
This representation follows immediately from the vector space representation
presented in Example 1.2. Note that a vector space basis is easy to be found
in the two-way case, while in the multi-way case some difficulties arise.
From our point of view is of great interest the multiplicative form of the log-
linear model, obtained by exponentiating the log-linear form. For the saturated
model we obtain the expression
m(i, j) = ζ0ζXi ζY
j ζXYij (1.25)
and for the independence model we can write
m(i, j) = ζ0ζXi ζY
j (1.26)
where the ζ parameters satisfy appropriate constraints. Note that our notation
is coherent with the notation in Chapter 6 of Pistone et al. (2001a).
Manipulating Equation (1.26), it is easy to check that under the indepen-
dence model the means m(i, j) must satisfy the relations
m(i1, j1)m(i2, j2)−m(i1, j2)m(i2, j1) = 0 (1.27)
for all i1, i2 ∈ {1, . . . , I} and j1, j2 ∈ {1, . . . , J}. Of course, the same relations
hold for the cell probabilities p(i, j). Thus, the independence model is repre-
sented by the zero set of the polynomial system formed by the binomials in
Equation (1.27). We will see in the next section the correspondence between
algebraic restrictions of this form and the geometric restrictions on the space
of the means parameters, or equivalently on the space of the cell probabilities.
The multiplicative form of the log-linear models has been widely studied
in Goodman (1979a), Goodman (1979b) and Goodman (1985). We will see in
the next sections the relevance of this representation.
1.1.5 Exact methods
Exact statistics can be useful in situations where the asymptotic assumptions
are not met, and so the asymptotic p−values are not close approximations for
the true p−values. Standard asymptotic methods involve the assumption that
the test statistic follows a particular distribution when the sample size is suf-
ficiently large. When the sample size is not large, asymptotic results may not
be valid, with the asymptotic p−values differing substantially from the exact
p−values. Asymptotic results may also be unreliable when the distribution of
18 CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
the data is sparse. Examples can be found in Agresti (1996) and Bishop et al.
(1975). We will also see some examples in our framework later in the thesis.
Exact computations are based on the statistical theory of exact conditional
inference for contingency tables, reviewed by Agresti (2001).
Following such exact conditional methods, one computes the exact distri-
bution of the test statistic given the sufficient statistic. These procedures often
arise for independence and goodness of fit tests, as well as in the construction of
uniformly most powerful tests and accurate confidence interval computations
through the Rao-Blackwell theorem, see Lehmann (1987). Thus, the analysis
is restricted to the set of tables with the same value t of the sufficient statistic
TN as the observed table
Ft = {f : X −→ N : TN(f) = t} . (1.28)
This set is often called the reference set. Throughout all the thesis, we will
assume that Ft is a finite set.
In literature, there exist many exact algorithms for a wide number of exact
tests, such as Pearson chi-square, likelihood-ratio chi-square, Mantel-Haenszel
chi-square, Fisher’s exact test and McNemar’s test, but in general such algo-
rithms are limited to two-way contingency tables. A number of exact tests for
contingency tables are presented for example in Agresti (2001).
These algorithms are also implemented in statistical softwares, see for ex-
ample SAS/STAT User’s Guide (2000). These exact algorithms compute exact
p−values for general I × J tables using the network algorithm developed by
Mehta & Patel (1983). This algorithm provides a substantial advantage over
direct enumeration of the reference set Ft, which can be very time-consuming
and feasible only for small problems. Refer to Agresti (1992) for a review of
algorithms for computation of exact p−values, and refer to Metha et al. (1991)
for information on the performance of the network algorithm.
Limiting ourselves to the basic steps of the algorithm, the network algo-
rithm proceeds as follows. Corresponding to the reference set Ft, the algorithm
forms a directed acyclic network consisting of nodes in a number of stages. A
path through the network corresponds to a distinct table in the reference set.
The distances between nodes are defined so that the total distance of a path
through the network is the corresponding value of the test statistic. At each
node, the algorithm computes the shortest and longest path distances for all
the paths that pass through that node. The exact computation of these quan-
1.1. LOG-LINEAR MODELS 19
tities for statistics which are linear combinations of the cells counts is presented
in Agresti et al. (1990). For statistics of other forms, the computation of an
upper bound for the longest path and a lower bound for the shortest path
are needed. The longest and shortest path distances or bounds for a node
are compared to the value of the test statistic to determine whether all paths
through the node contribute to the p−value, none of the paths through the
node contribute to the p−value, or neither of these situations occur. If all
paths through the node contribute, the p−value is incremented accordingly,
and these paths are eliminated from further analysis. If no paths contribute,
these paths are eliminated from the analysis. Otherwise, the algorithm contin-
ues, still processing this node and the associated paths. The algorithm finishes
when all nodes have been accounted for, incrementing the p−value accordingly,
or eliminated.
For each possible table, the algorithm compares its test statistic value with
the corresponding value for the observed table. If the value for the table is
greater than or equal to the observed test statistic value, it increments the
exact p−value by the probability of that table, which is calculated under the
null hypothesis using the multinomial frequency distribution.
For several tests we meet in the analysis of categorical data, the test
statistic is nonnegative, and large values of the test statistic indicate a de-
parture from the null hypothesis. Such tests include Pearson’s chi-square, the
likelihood-ratio chi-square, and other ones, such as the Mantel-Haenszel chi-
square, Fisher’s exact test for tables larger than 2× 2 tables, McNemar’s test.
The exact p−value for these tests is the sum of probabilities for those tables
having a test statistic greater than or equal to the value of the observed test
statistic.
These recently developed algorithms for two-way tables, together with im-
provements in computer power, make it feasible now to perform exact com-
putations for data sets where previously only asymptotic methods could be
applied. Nevertheless, there are still large problems that may require a pro-
hibitive amount of time and memory for exact computations
As noted by Diaconis & Sturmfels (1998), the reference set Ft is very
large and then difficult to enumerate even for very small data sets, where the
asymptotic approximation is heavily inadequate.
Recently, a software has been implemented in order to compute efficiently
the cardinality of the reference set Ft also in non trivial cases. This soft-
20 CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
ware, called LattE, acronym of “Lattice Point Enumeration”, is presented in
De Loera et al. (2003) and uses non-standard algebraic results about convex
polytopes. For example, using this software, the authors are able to com-
pute the cardinality of the reference set Ft for Table 1.1 with two-dimensional
marginal totals as components of the sufficient statistic, concluding that this
cardinality is equal to 441, while for a table whose two-dimensional marginal
total are displayed in Table 1.2, the cardinality of the reference set is about
2.2498 ∗ 1040.
X2
X1 1 2 31 96 72 1612 10 7 63 1 1 2
X2
X1 1 2 31 186 127 512 11 7 33 0 1 0
X3 = 1 X3 = 2
Table 1.1: A 3× 3× 2 example.
X2
X1 1 2 31 164424 324745 1272392 262784 601074 93691163 149654 7618489 1736281
X3
X1 1 2 31 163445 49395 4035682 1151824 767866 83132843 16099500 6331023 1563901
X3
X2 1 2 31 184032 123585 2692452 886393 6722333 9355823 1854344 302366 9075926
Table 1.2: Two-dimensional marginal totals for a 3× 3× 3 example.
As the software LattE is at this time under development, we only dispose
of some demo, and this is the reason why we present here examples taken from
De Loera et al. (2003), instead of new examples.
1.2. ALGEBRAIC THEORY OF TORIC IDEALS 21
However, it is worth noting that for the statistical applications, it is often
sufficient to have an approximation of the cardinality of Ft.
When asymptotic methods may not be sufficient for such large problems,
we can use Monte Carlo estimation of the exact p−values. A formula does
not exist that can predict in advance how much time and memory are needed
to compute an exact p-value for a certain problem. The time and memory
required depend on several factors, including which test is being performed,
the total sample size, the number of rows and columns, and the specific ar-
rangement of the observations into table cells. Moreover, for a fixed sample
size, time and memory requirements tend to increase as the number of rows
and columns increases, since this corresponds to an increase in the number
of tables in the reference set. Also for a fixed sample size, time and memory
requirements increase as the marginal row and column totals become more
homogeneous, see Agresti et al. (1990) for details.
1.2 Algebraic theory of toric ideals
Following the description of the log-linear models in Section 1.1, we remark
that binomials and power products play a fundamental role in the formal
representation of the log-linear models.
Experts in Commutative Algebra have studied extensively the algebraic
relations of power products, leading to the theory of toric ideals. In this section
we briefly review the theory of toric ideals with a particular emphasis on the
results more relevant for the statistical applications. Moreover, we present
some recent computational issues which lead to a feasible computation of toric
ideals.
In the following, some basic notions in Computational Commutative Al-
gebra is needed. In particular, we will use the notions of polynomial ideal,
Grobner basis, elimination ideal, saturation. The reader can refer to Kreuzer
& Robbiano (2000) or Cox et al. (1992). Many of the statistical applications
of the theory of toric ideals in Probability and Statistics comes from the fun-
damental work by Sturmfels (1996)
For the theory of toric ideals we mainly refer to the papers Bigatti &
Robbiano (2001) and Bigatti et al. (1999). The computations in the examples
are obtained with the free software CoCoA, see Capani et al. (2000).
22 CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
1.2.1 Definitions and first properties
Let K be a numeric field and let K[y1, . . . , ys] and K[ξ1, . . . , ξq] be the polyno-
mial rings in the indeterminates y1, . . . , ys and ξ1, . . . , ξq, respectively. Recall
that a term in a polynomial ring is a power product and that a binomial is a
difference of terms. We denote by Ts the set of all terms in s indeterminates.
Definition 1.16 Let t1, . . . , tq be terms in K[y1, . . . , ys]. The toric ideal I =
I(t1, . . . , tq) ⊆ K[ξ1, . . . , ξq] associated to the set {t1, . . . , tq} is the ideal of all
polynomials p such that p(t1, . . . , tq) = 0.
Note that different sets of power products produce the same toric ideals.
This remark will be analyzed from the point of view of probability models late
in the thesis.
In the following, we use the multi-index notation, i.e. yd means yd11 · · · yds
s .
Given D = {d1, . . . , dq} ⊂ Ns such that ydi = ti for i = 1, . . . , q, consider the
semigroup homomorphism
T : Nq −→ Ns (1.29)
ei 7−→ di .
We can associate to this map a ring homomorphism
π : K[ξ1, . . . , ξq] −→ K[y1, . . . , ys] (1.30)
ξi 7−→ ti .
With these definitions, we refer to the toric ideal I as the toric ideal asso-
ciated to T or the toric ideal associated to π. It is worth noting that the toric
ideal I is the kernel of the ring homomorphism π.
Note that we have π(ξu) = yv = yT (u), with u ∈ Nq and v ∈ Ns such that
v =∑q
i=1 uidi.
We report in the following results the main properties about toric ideals of
statistical interest.
Proposition 1.17 Let T : Nq −→ Ns be a semigroup homomorphism and
let us denote by π : K[ξ1, . . . , ξq] −→ K[y1, . . . , ys] the corresponding ring
homomorphism. The toric ideal I associated to π is generated, as vector space,
by the binomials of the set
{ξu − ξv : T (u) = T (v)} . (1.31)
1.2. ALGEBRAIC THEORY OF TORIC IDEALS 23
Proof. Choose a term-ordering τ on Ts and suppose there exists a polyno-
mial g in I which cannot be written as linear combination of polynomials in
{ξu−ξv : T (u) = T (v)}. Choose g 6= 0 such that the leading term LT (g) = ξa
is minimal with respect to τ .
As g ∈ ker(π), we have g(t1, . . . , tq) = 0. In particular, the term yT (a)
must be eliminated and in g there is another monomial ξb ≺τ ξa such that
T (b) = T (a).
Now, the polynomial g′ = g − ξa + ξb is non-zero, LT (g′) ≺τ LT (g) and it
cannot be written as linear combination of polynomials in {ξu − ξv : T (u) =
T (v)}. This is a contradiction, because we supposed LT (g) minimal.
Now we define
T : Zq −→ Zs (1.32)
the extension of T to the group Z. As any u ∈ Zq can be written in the
form u = u+ − u−, where u+ and u− are the positive and negative part of u
respectively, the above result together with the standard theory of Grobner
bases leads to the following proposition.
Proposition 1.18 For any term-ordering τ on Tq, there exists a finite set of
vectors Gτ ⊂ ker(T ) such that the reduced Grobner basis of I with respect to τ
is
{ξu+ − ξu− : u ∈ Gτ} . (1.33)
Proof. The result follows from Proposition 1.17 and from standard results
about Grobner bases.
Theorem 1.19 Let t1, . . . , tq be terms in the indeterminates y1, . . . , ys and let
J be the ideal in K[ξ1, . . . , ξq, y1, . . . , ys] generated by {ξ1− t1, . . . , ξq− tq}. We
have that
1. I = J ∩K[ξ1, . . . , ξq];
2. if Gτ is a Grobner basis of J with respect to an elimination ordering τ for
{y1, . . . , ys}, then I is generated by the polynomials in Gτ ∩K[ξ1, . . . , ξq].
For the proof, see Bigatti & Robbiano (2001), Proposition 1.4.
Moreover, nice properties of toric ideals are summarized in the following
theorem. We will see later in the thesis the relevance of these properties.
24 CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
Theorem 1.20 In the previous settings, consider any grading on the polyno-
mial ring K[ξ1, . . . , ξq, y1, . . . , ys] such that the degrees of the yj are arbitrary
integers and deg(ξi) = deg(ti), i = 1, . . . , r. Then
1. the ideal I is prime;
2. the ideal I is generated by pure binomials, i.e.
I = Ideal(t1 − t2 : π(t1) = π(t2), gcd(t1, t2) = 1) ; (1.34)
3. the ideal I is homogeneous.
For the proof, see Bigatti & Robbiano (2001), Theorem 1.6.
The variety Variety(I) is the geometric counterpart of the ideal I. A
point (ξ1, . . . , ξq) belongs to Variety(I) if and only if p(ξ1, . . . , ξq) = 0 for all
polynomials p ∈ I. This condition can be verified making use of a system of
generators of the ideal I. For details about the relationships between ideals
and varieties, see for example Cox et al. (1992).
1.2.2 Computation of toric ideals. First method
A simple algorithm to compute the Grobner basis of the toric ideal I =
I(t1, . . . , tq) is based on the elimination algorithm as described in Theorem
(1.19).
We first consider the homomorphism π : K[ξ1, . . . , ξq] −→ K[y1, . . . , ys]
defined by ξi 7−→ ti; then, we consider the ideal J in K[ξ1, . . . , ξq, y1, . . . , ys]
generated by the set of polynomials
{ξ1 − t1, . . . , ξq − tq} . (1.35)
Using the results from Theorem (1.19), we have that a (reduced) Grobner basis
of I is obtained computing a (reduced) Grobner basis of J with respect to
a term-ordering of elimination for y1, . . . , ys and taking the only polynomials
not involving the y’s.
Example 1.21 Consider the polynomial ring K[y1, . . . , y5] and the following
set of six terms:
{y1y4, y1y5, y2y4, y2y5, y3y4, y3y5} (1.36)
We will consider this example later in the thesis, viewed from a statistical
point of view.
1.2. ALGEBRAIC THEORY OF TORIC IDEALS 25
In order to compute the toric ideal I = I(y1y4, y1y5, y2y4, y2y5, y3y4, y3y5)
we introduce the polynomial ring K[ξ1, . . . , ξ6] and we define the ring homo-
morphism π as described in Equation (1.30). The toric ideal of interest I is
the kernel of π. To compute this kernel, we define the ideal
J = Ideal(ξ1 − y1y4, ξ2 − y1y5, ξ3 − y2y4, ξ4 − y2y5, ξ5 − y3y4, ξ6 − y3y6)
in the polynomial ring K[ξ1, . . . , ξ6, y1, . . . , y5]. Using CoCoA, the Grobner
basis of this ideal with respect to an elimination term ordering for the inde-
terminates y1, . . . , y5 is
G(J ) ={−y3y5 + ξ6,−y3y4 + ξ5,−y2y5 + ξ4,−y2y4 + ξ3,
− y1y5 + ξ2,−y1y4 + ξ1, ξ5y5 − ξ6y4, ξ3y5 − ξ4y4, ξ1y5 − ξ2y4,
ξ4y3 − ξ6y2, ξ2y3 − ξ6y1, ξ2y2 − ξ4y1, ξ3y3 − ξ5y2, ξ1y3 − ξ5y1,
ξ1y2 − ξ3y1, ξ4ξ5 − ξ3ξ6, ξ2ξ5 − ξ1ξ6, ξ2ξ3 − ξ1ξ4}and the only polynomials which do not involve the y’s are the last three ones.
Thus
I = Ideal(ξ4ξ5 − ξ3ξ6, ξ2ξ5 − ξ1ξ6, ξ2ξ3 − ξ1ξ4) (1.37)
and the Grobner basis of the relevant toric ideal consists of three binomials.
This first method, although intuitive, is not the best from the computa-
tional point of view. Better methods for computing toric ideals can be found in
Bigatti et al. (1999). These methods are based on the theory of the saturation.
We will illustrate such methods in Section 1.2.4.
1.2.3 Navigating inside the set of solutions of an integerlinear system
The saturation of an ideal with respect to a polynomial is given by the fol-
lowing construction. Consider a polynomial p in K[ξ1, . . . , ξr] and an ideal
B ⊂ K[ξ1, . . . , ξr]. The ideal I is the saturation of B with respect to p if
I = Elim(v,B + Ideal(pv − 1)) (1.38)
where v is a new indeterminate and the computations are made in the poly-
nomial ring K[ξ1, . . . , ξr, v].
Let t1, . . . , tq power products in the polynomial ring K[y1, . . . , ys], i.e.
yi = ta1i1 · · · yasi
s . (1.39)
26 CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
We can define the matrix A ∈ Matq×s(N) associated to the power products.
The matrix A has non-negative entries. Conversely, any matrix A defines a
set of power products using Equation (1.39).
Definition 1.22 Let A ∈ Matq×s(N). We define the toric ideal associated to
A as the toric ideal IA = I(t1, . . . , tq), where tq is defined in Equation (1.39).
Now, we investigate the relation between the toric ideal IA and the kernel
of A viewed as Z-module. First, we need some technical definitions.
Consider the homogeneous system of Diophantine equations associated to
A:
a11z1 + · · ·+ α1qzq = 0
· · ·
as1z1 + · · ·+ αsqzq = 0
(1.40)
where aij is the (i, j)-th element of the matrix A. The set of integer solutions
ker(A) of such system is a Z-module. In the sequel, we will use the decompo-
sition of a vector into its positive and negative parts, that is v = v+ − v−.
Now, let Bin(K[ξ1, . . . , ξq]) be the set of all binomials in K[ξ1, . . . , ξq] and
define the maps ρ′ : Zq −→ K[ξ1, . . . , ξq] and η′ : Bin(K[ξ1, . . . , ξq]) −→ Zq as
ρ′(u) = ξu+ − ξu− (1.41)
and
η′(ξu − ξv) = u− v . (1.42)
Definition 1.23 Let B = {v1, . . . , vr} be a set of vectors. If B generates
ker(A) as Z-module, the ideal I(B) = I(ρ′(v1), . . . , ρ′(vr)) is called a lattice
ideal associated to ker(A).
Let π =∏q
i=1 ξi be the product of all the ξ indeterminates. The main
theorem about the application of the saturation for computing toric ideals is
the following.
Theorem 1.24 Let B = {v1, . . . , vr} ⊆ ker(A). The following conditions are
equivalent.
a) I(B) : π∞ = I(A);
b) I(B)Pπ = I(A)Pπ;
1.2. ALGEBRAIC THEORY OF TORIC IDEALS 27
c) B is a set of generators of ker(A), i.e. I(B) is a lattice ideal.
For the proof, see Bigatti & Robbiano (2001), Theorem 2.10. For further
details about Theorem 1.24, see also Sturmfels (1996).
Now, let A ∈ Matq×s(N) and let b = (b1, . . . , bs) be a vector in Ns. We
want to find the non-negative solutions of the system
a11z1 + · · ·+ α1qzq = b1
· · ·
as1z1 + · · ·+ αsqzq = bs
(1.43)
The following result easily lead to a method for navigating inside the set
of solutions of a Diophantine system.
Theorem 1.25 Let S a Diophantine system as defined in Equation (1.43).
Let ti, i = 1, . . . , q be the power products associated to the matrix A and
t = yb11 · · · ybs
s . Moreover, let (α1, . . . , αq) ∈ Nq. Let J be the ideal in the
polynomial ring K[ξ1, . . . , ξq, y1, . . . , ys] generated by the set of binomials {ξ1−t1, . . . , ξq − tq}. The following conditions are equivalent.
a) The vector (α1, . . . , αq) is a solution of S.
b) There is an equality of power products tα11 · · · tαq
q = yb11 · · · ybs
s .
c) The binomial yb11 · · · ybs
s − ξα11 · · · ξαq
q is in J .
For the proof, see Bigatti & Robbiano (2001), Proposition 3.2.
We will see an important example of such kind of navigation in the following
Chapter, in the framework of the contingency tables.
1.2.4 Computation of toric ideals. Second method
In view of the results in Section 1.2.3, and in particular Theorem 1.24, we can
compute the toric ideal with the following steps:
• consider the matrix representation A of the power products;
• compute a lattice basis B of ker(A) and then the lattice ideal I(B);
• compute the saturation of I(B) with respect to the polynomial π =∏qi=1 ξq.
28 CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
Example 1.26 Consider again the problem of Example 1.21. Now we apply
the saturation-based algorithm.
The matrix representation of the power products in Equation (1.36) is
A =
1 0 0 1 01 0 0 0 10 1 0 1 00 1 0 0 10 0 1 1 00 0 1 0 1
(1.44)
A lattice basis of ker(A) is formed by the two vectors
(1,−1,−1, 1, 0, 0) and (1,−1, 0, 0,−1, 1) .
Thus, the lattice ideal I(B) is Ideal(ξ1ξ4 − ξ2ξ3, ξ1ξ6 − ξ2ξ5). The ideal J =
I(B) + (πv − 1) is generated by the polynomials
G(J ) ={−ξ2ξ5 + ξ1ξ6,−ξ2ξ3 + ξ1ξ4,−ξ21ξ3ξ4ξ
26v + 1,
− ξ21ξ
23ξ
36v + ξ5,−ξ3
1ξ24ξ
26v + ξ2, ξ4ξ5 − ξ3ξ6}
(1.45)
and, eliminating the v indeterminate, we obtain the same result as in Example
1.21, but with a relevant computational improvement.
We will consider again Examples 1.21 and 1.26, after the discussion of the
Diaconis-Sturmfels algorithm.
Note that the first step of the algorithm, i.e. the computation of a basis
of ker(A), can be made using the theory of vector spaces, with a notable
saving of time. A comparison of the different timings between the algorithms
introduced here can be found in Bigatti et al. (1999). The saturation-based
algorithm is implemented in the CoCoA function Toric, which computes a
system of generators of the toric ideal, starting from the matrix representation
of the power products. We will use CoCoA for all symbolic computations
presented in this thesis.
Chapter 2
The Diaconis-Sturmfelsalgorithm
In the present chapter we present an algorithm for sampling from conditional
distributions on finite sample spaces, first introduced by Diaconis & Sturmfels
(1998). While in the original paper by Diaconis & Sturmfels (1998) the at-
tention is restricted to exponential models, in this chapter we show that the
algorithm can be applied to a more general class of probability models, namely
all probability models which verify a linearity property of the sufficient statis-
tic.
In Section 2.1, we introduce the multivariate hypergeometric distribution,
showing how that distribution is essential in the problems of estimation and
hypothesis testing. In Section 2.2, we describe the algorithm, and we discuss
the relevance of the use of toric ideals. The algorithm is a Monte Carlo Markov
chain one, based on the key notion of Markov basis. The links between the
theorems proving the validity of the algorithm and the theory of toric ideals
presented in Chapter 1 will be pointed out. In Section 2.3 we briefly present
some results about the rates of convergence of the Markov chain and in Section
2.4 we present two simple examples.
2.1 Hypergeometric distribution
This Section is a slight generalization of Diaconis & Sturmfels (1998), Section
1, where this theory is presented under the exponential model.
Let X be a finite set and let
P[x] = ϕ(T (x)) x ∈ X , (2.1)
29
30 CHAPTER 2. THE DIACONIS-STURMFELS ALGORITHM
be a probability model with sufficient statistic T : X −→ Ns and such that the
distribution of a sample of independent and identically P-distributed random
variables X = (X1, . . . , XN) is of the form
P∗N = ψ(TN) where TN =N∑
k=1
T (Xk) . (2.2)
This means that the sufficient statistic of X is the sum of the sufficient statistics
of the one-dimensional random variables Xk, k = 1, . . . , N .
This model contains the usual exponential models of the form
Pθ[x] = c(θ)e<θ,T (x)> ,
but other ones are included, such as the models of the form
Pθ[(x, y)] = c(θ)eθx+θ2y .
We denote
Yt =
{(x1, . . . , xN)
∣∣∣∣∣N∑
k=1
T (xk) = t
}, (2.3)
i.e., the set of all samples with fixed value t of the sufficient statistic TN . It is
known that the distribution of X given {TN = t} is uniform on Yt. In fact
P∗N [X = x | TN = t] =P∗N [x]∑
{y : TN (y)=t} P∗N [y]=
1
#{x : TN(x) = t} =1
#Yt
for all x = (x1, . . . , xN) ∈ Yt.
As presented in the previous chapter, standard statistical techniques, such
as bootstrap methods, need uniformly distributed samples on Yt. Even in
simplest cases, the set Yt is very large and difficult to enumerate and it is not
easy to sample directly from Yt with classical Monte Carlo methods. Refer in
particular to the discussion in Chapter 1, Section 1.1.5.
As we are in the finite case, the problem can be reduced to a more suitable
form, in terms of counting. Then, we consider the reference set Ft introduced
in Chapter 1
Ft =
{f : X −→ N :
∑x∈X
f(x)T (x) = t
}, (2.4)
i.e., the set of all frequency tables obtained from samples with value t of the
sufficient statistic TN .
2.2. THE ALGORITHM 31
The relation between Yt and Ft is given by the map F : Yt −→ Ft defined
by
F (x1, . . . , xN) =∑x∈X
ex
N∑
k=1
I{xk=x} (2.5)
which associates to every sample the corresponding frequency table. Here
(ex)x∈X denotes the canonical basis of RX . Note that the additivity condition
(2.2), together with the conditions imposed in the definitions of Yt and Ft, is
essential for the following composition:
Yt Ft
Rs
-F
?
∑f(x)T (x)
@@
@@
@@R
TN
The image probability of P∗N [ · |TN = t] induced by F is
Ht(f) = P∗N [F−1(f) | TN = t] =#{x : F (x) = f}
#Yt
, (2.6)
which is by definition the hypergeometric distribution on Ft. Simple compu-
tations show that
Ht(f) =N !
#Yt
∏x∈X
(f(x)!)−1 . (2.7)
For further details on hypergeometric distribution, see Bishop et al. (1975) and
Agresti (2002). However, MCMC methods presented in this thesis need not
the computation of #Yt, as it will be clear in the next Section.
2.2 The algorithm
As pointed out in Chapter 1, in order to perform hypothesis testing for con-
tingency tables, the usual approach is the asymptotic one, which involves chi-
squared distributions, but in many cases, especially when the table is sparse,
the chi-squared approximation may not be adequate (for further details on
this topic see, for example, Appendix IV of Fienberg (1980)). We can ob-
tain approximations of test statistics via Monte Carlo methods, drawing an iid
hypergeometric ample of contingency tables in the reference set Ft.
The problem is then reduced to sample from the hypergeometric distri-
bution on Ft, or equivalently from the uniform distribution on Yt. Litera-
ture suggests to avoid the enumeration problem via the use of Markov Chains
32 CHAPTER 2. THE DIACONIS-STURMFELS ALGORITHM
Monte Carlo (MCMC) methods. In particular we are interested in Metropolis–
Hastings algorithm which is based on a set of moves for constructing the rel-
evant Markov chain. A review on the Metropolis–Hastings algorithm can be
found in Chib & Greenberg (1995).
The following is a key definition in that theory.
Definition 2.1 A Markov basis of Ft is a set of functions m1, . . . , mL : X −→Z, called moves, such that for any 1 ≤ i ≤ L
∑x∈X
mi(x)T (x) = 0 , (2.8)
where T is the sufficient statistic, and for any f, f ′ ∈ Ft there exist a sequence
of moves (mi1 , . . . , miA) and a sequence (εj)Aj=1 with εj = ±1 such that
f ′ = f +A∑
j=1
εjmij (2.9)
and
f +a∑
j=1
εjmij ≥ 0 (2.10)
for all 1 ≤ a ≤ A.
The condition (2.8) implies that a move is a table with integer entries (even
negative) and such that the value of the sufficient statistic is constant for every
table obtained with moves in {m1, . . . , mL}. Note once again the importance
of the linearity condition for the sufficient statistic T . In particular, if the
components of the sufficient statistic are the margins, then every move is a
table with null margins.
For example, if we consider the 3 × 3 tables with the marginal totals as
components of the sufficient statistic a move is For example the following move
for the 3× 3 tables
0 +2 0+1 −1 0
0 −1 +1
From this definition it is clear that Markov basis is the main tool to define
a random-walk-like Markov chain on Ft. It is well known that a connected,
reversible and aperiodic Markov chain converges to the stationary distribution.
In particular, we use here the following theorem.
2.2. THE ALGORITHM 33
Theorem 2.2 Given a Markov basis {m1, . . . ,mL}, generate a Markov chain
on Ft by choosing I uniformly in {1, . . . , L}. If the chain is currently at g ∈ Ft,
determine the set of j ∈ Z such that g + jmI ∈ Ft. Choose j in this set with
probability proportional to
∏x∈CI
{(g(x) + jmI(x))!}−1 (2.11)
with CI = {x : mI(x) > 0}. This is a connected, reversible, aperiodic Markov
chain with stationary distribution Ht.
Proof. It is easy to see that the product in Equation (2.11) is proportional
to the stationary distribution Ht, and then the Markov chain is reversible with
respect to Ht. As M is a Markov basis, from the property (2.9) it follows that
the Markov chain is connected. To show that the chain is aperiodic, choose a
move and apply this move until some cell counts is zero, obtaining a holding
probability strictly positive.
In practice, to obtain a sample from the distribution of interest σ(f) on
Ft, the Markov chain is performed as follows:
(a) at the time 0 the chain is in f ;
(b) choose a move m uniformly in the Markov basis and ε = ±1 with prob-
ability 1/2 each independently from m;
(c) if f + εm ≥ 0 then move the chain from f to f + εm with probability
min{σ(f + εm)/σ(f), 1}; in all other cases, stay at f .
The transition matrix of this Markov chain is given by
P (f, f + m) =1
2Lmin{σ(f + m)/σ(f), 1} if f + m ≥ 0
P (f, f −m) =1
2Lmin{σ(f −m)/σ(f), 1} if f −m ≥ 0
for all m ∈ {m1, . . . , mL} and P (f, f) chosen in order to have a stochastic
matrix. For example, if σ is the uniform distribution, we have
P (f, f + m) =1
2Lif f + m ≥ 0
P (f, f −m) =1
2Lif f −m ≥ 0
34 CHAPTER 2. THE DIACONIS-STURMFELS ALGORITHM
and P (f, f) is simply the ratio of the number of inadmissible moves among
{±m1, . . . , ,±mL} over the total number of moves.
Moreover, note that in the hypergeometric case the computation of the
ratio σ(f + m)/σ(f) leads to some simplifications. In fact
Ht(f + m)/Ht(f) =
∏x∈X f(x)!∏
x∈X (f(x) + m(x))!=
=
∏x∈X f(x)!∏
x∈X (f(x) + m(x))!=
∏
x : m(x)6=0
f(x)!
(f(x) + m(x))!. (2.12)
Let us recall here the basic convergence theorem for the Metropolis–Has-
tings algorithm working in our framework.
Theorem 2.3 Let σ(f) be a positive function on Ft. Given a Markov basis
{m1, . . .mL}, the Markov chain generated following the described algorithm is
connected, reversible and aperiodic on Ft with stationary distribution propor-
tional to σ(f).
Proof. The proof is the same as for Theorem 2.2, considering a general
stationary distribution σ(f) instead of the hypergeometric distribution Ht.
Now, consider an indeterminate for each sample point, that is q indeter-
minates ξ1, . . . , ξq, an let K[ξ1, . . . , ξq] be the polynomial ring in the inde-
terminates ξ1, . . . , ξq with coefficients in the numeric field K. Moreover, let
M = {m1, . . . , mL} be a set of moves. We define the ideals
IT = Ideal(ξa − ξb : T (a) = T (b)) (2.13)
IM = Ideal(ξm+i − ξm−
i : i = 1, . . . , L) (2.14)
Note that the ideal IT is a toric ideal as T is a semigroup homomorphism.
The application of the theory of toric ideals comes from the following result.
Theorem 2.4 The Markov chain with moves in M is connected if and only
if IM = IT .
Proof. “only if”: Notice that that IM ⊆ IT is obvious. Moreover, if the
Markov chain is connected basis, then for any a, b with T (a) = T (b) the term
ξa can be written as
ξa = ξb +A∑
j=1
εj(ξm+
ij − ξm−
ij ) (2.15)
2.2. THE ALGORITHM 35
for a suitable sequence of moves (mi1 , . . . , miA) ∈ M and a suitable sequence
of signs (ε1, . . . , εA). This implies that ξa = ξb ∈ IM, and then IT ⊆ IM.
“if”: For all g, g′ such that∑
x∈X (g(x) − g′(x))T (x) = 0, there is a repre-
sentation of the form
ξg − ξg′ =L∑
j=1
εjξhj(ξ
f+ij − ξ
f−ij ) (2.16)
with εj = ±1 for all j. If L = 1, the above Equation reduces to the con-
nectedness condition. If L > 1 the result follows by induction. In fact from
Equation (2.16), it follows that either ξg = ξhr ξf+
ir or ξg = ξhr ξf−ir for some
r = 1, . . . , L. Suppose for example ξg = ξhr ξf+
ir . Then g − f−ir is non-negative
and so g +f+ir
is non-negative. Subtracting ξhr(ξf+ir − ξf−ir ) from both sides and
using hr + f+ir
= g + fir , we obtain an expression for ξg+fir − ξg′ having length
L− 1. By induction, g + fir can be connected to g′ in L− 1 steps. The proof
is now complete.
Theorem 2.4 gives also the relation between moves and binomials. If m is
a move, m = m+ − m−, the corresponding binomial is ξm+ − ξm−and vice
versa. For example, the move presented above for the 3× 3 tables
0 +2 0+1 −1 0
0 −1 +1
is represented by the binomial
gm = ξ212ξ21ξ33 − ξ22ξ32 .
in the polynomial ring Q[ξ11, ξ12, ξ13, ξ21, ξ22, ξ23, ξ31, ξ32, ξ33].
Thus, in view of Theorem 2.4, in order to compute the Markov basis, one
computes the toric ideals associated to the sufficient statistic and one defines
the moves from the binomials following the above rule.
Note that the theory of Markov bases does not need the notion of Grobner
basis; it is sufficient to consider the ideals and set of generators. Grobner
bases will be used below as computational tool. Moreover, the above theory
can be applied to the contingency tables observing that: 1) the (reduced)
Grobner basis of a toric ideal is formed only by binomials; 2) applying the
division algorithm to a monomial with respect to a binomial Grobner basis,
the dividends and the Normal Form are all monomials: this means that at
every step of the algorithm we have a monomial, or equivalently the algebraic
rules transform a contingency table into another contingency table.
36 CHAPTER 2. THE DIACONIS-STURMFELS ALGORITHM
2.3 Rates of convergence
In general, the Markov chain described in Section 2.2 needs some running time
to reach the stationary distribution.
There are no general results for the rates of convergence of the chain,
but there is a number of specific results working in some special framework.
Roughly speaking, the rate of convergence depends on the diameter γ of the
random walk. γ is the diameter of the graph with the points in Ft as vertices
and two points f and f ′ are connected if and only if f ′ can be reached from f
in one step.
Here is an example of such theorems, taken from Diaconis & Sturmfels
(1998). Consider a sample space of the form {1, . . . , I} × {1, . . . , J}, the row
sums and the column sums as components of the sufficient statistic, and let Ube the uniform distribution on the reference set Ft.
Theorem 2.5 In the previous settings, let K(f, g) be the connected Markov
chain on Ft based on the moves of the form
+1 −1−1 +1
for any 2× 2 minor of the table and zero otherwise. Then,
‖Kh(x, f)− U‖TV ≤ A1 exp(−A2c) (2.17)
for h = cγ2, where ‖ · ‖TV denotes the total variation distance.
Refer to Diaconis & Sturmfels (1998), Section 2.3, for a general survey on
other results and pointers to literature. However, in the case of sample spaces
with a more complex geometrical structure, no results are available.
2.4 Two examples
We conclude this chapter with the illustration of two examples.
Example 2.6 Consider a sample space formed by 6 points, say
X = {a(1), . . . , a(6)}
2.4. TWO EXAMPLES 37
and consider the 5−dimensional sufficient statistic with components
T1 = F (a(1)) + F (a(2))
T2 = F (a(3)) + F (a(4))
T3 = F (a(5)) + F (a(6))
T4 = F (a(1)) + F (a(3)) + F (a(5))
T5 = F (a(1)) + F (a(2)) + F (a(6))
The matrix representation of this sufficient statistic produces the matrix an-
alyzed in Example 1.26 in Chapter 1. The Grobner basis of the toric ideal
is
{ξ4ξ5 − ξ3ξ6, ξ2ξ5 − ξ1ξ6, ξ2ξ3 − ξ1ξ4}corresponding to the three moves
(0, 0, 1,−1,−1, 1)
(1,−1, 0, 0,−1, 1)
(1,−1,−1, 1, 0, 0)
We will consider again this example in the next chapter in the framework of
the independence model.
The above example belongs to a special class where the components of
the sufficient statistic are the counts over subsets of the sample space. The
following example is not in that class, and statistical literature calls such model
logistic regression model.
Example 2.7 Suppose that the sample space is
X = {(1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2)} ⊂ N2
and consider the 5−dimensional sufficient statistic with components
T1 = F (1, 1) + F (1, 2)
T2 = F (2, 1) + F (2, 2)
T3 = F (3, 1) + F (3, 2)
T4 = F (1, 2) + F (2, 2) + F (2, 3)
T5 = F (1, 2) + 2F (2, 2) + 3F (2, 3)
38 CHAPTER 2. THE DIACONIS-STURMFELS ALGORITHM
Using the matrix representation of this sufficient statistic and computing the
toric ideal associated to that matrix, we obtain the following Grobner basis
{ξ11ξ222ξ31 − ξ12ξ
221ξ32}
corresponding to a Markov basis with only one move
+1 −1−2 +2+1 −1
Chapter 3
Two-way contingency tables
In the present chapter, we restrict our attention to two-way contingency ta-
bles. In particular, we consider the application of the algorithm presented in
the previous chapter to a wide class of log-linear models. Throughout this
chapter we will present some new results which make explicit the computation
of Markov bases for some classical log-linear models for two-way contingency
tables, such as quasi-independence and quasi-symmetry. For a discussion on
the meaning of the different log-linear models in the two-way case, the reader
can refer to Fienberg (1980), Fingleton (1984) and Agresti (2002), where a
number of applications of log-linear models to different problems coming from
biology, psychology, econometrics are presented.
The notation is the same as in the previous chapters, except for the cell
frequencies, where we will use the standard notation in the analysis of contin-
gency tables, e.g. we will write fij instead of f(i, j). This choice allows also
to save space in the proofs.
In Section 3.1 we adapt the notation of Chapters 1 and 2 to the framework
of the contingency tables, starting with the independence model for complete
tables, as it represents an easy prototype, both from the algebraic and statisti-
cal point of view. In Section 3.2 we present new characterizations or computa-
tions of the Markov bases for the most widely used models for two-way contin-
gency tables, including models for tables with structural zeros. In Section 3.3
we analyze three examples of goodness of fit test for the quasi-independence
and the quasi-symmetry models. The first one is a standard case where the
asymptotic is correct; the second one is a case where the Monte Carlo p− value
leads to a different decision with respect to the asymptotic p−-value; the third
one, from Agresti (1996), shows a case where the asymptotics dramatically fail.
Finally, Section 3.4 shows the application of the methodology presented here
39
40 CHAPTER 3. TWO-WAY CONTINGENCY TABLES
to a class of problems which frequently happens in the biological and medical
research.
3.1 Two-way contingency tables: Independ-
ence
In this Section we consider I×J contingency tables formed by considering the
counts generated from two categorical variables.
Let us denote the observed count for the (i, j) cell by fij and the totals
for the ith row and jth column by fi+ and f+j respectively. Finally, let N =∑Ii=1 fi+ =
∑Jj=1 f+j the sample size. We emphasize that this notation is
easily extendable to the multi-way contingency tables and also the topics of
this Section can be extended to the multi-way case. We will see examples of
log-linear models for multi-way contingency tables in Chapter 4.
Using the same notation as in Chapters 1 and 2, the sample space is X =
{(i, j) : 1 ≤ i ≤ I, 1 ≤ j ≤ J} and the vectors r = (f1+, . . . , fI+) and
c = (f+1, . . . , f+J) are the components of the sufficient statistic T . In the
independence case, the probability model can be written as
I∏i=1
pr(i)fi+
J∏j=1
pc(j)f+j , (3.1)
where pr(i) and pc(j), with i = 1, . . . , I and j = 1, . . . , J , are the parameters
of the marginal distributions.
Every margin as component of the sufficient statistic is additive and then
this probability model is coherent with the above theory. If we denote the set
of all contingency tables with fixed margins by F(r,c), the following result holds
true.
Proposition 3.1 The probability distribution on F(r,c) is the hypergeometric
distribution, and its explicit formula is
H(r,c)(f) =
∏Jj=1
(f+j
f1j ···fIj
)(
Nf1+···fI+
) . (3.2)
Proof. The probability of (f11, . . . , f1J) in the first row of the table is
(f+1
f11
) · · · (f+J
f1J
)(
Nf1+
) =
∏Jj=1
(f+j
f1j
)(
Nf1+
)
3.1. TWO-WAY CONTINGENCY TABLES: INDEPENDENCE 41
and by straightforward computation the result is proved.
As it is difficult to find the number of tables in F(r,c) and to have a complete
list of all tables in F(r,c), we approximate the distribution of a generic test
statistic C conditioned to the sufficient statistic using the algorithm described
in Chapter 2.
Recalling the definition of Markov basis, see Chapter 2, Definition 2.1, it is
clear that the Markov basis for the independence model has to be formed by
moves with null margins. Moreover, the set of moves must give a connected
Markov chain.
Then, using this Markov basis, one can apply the algorithm described in
the previous chapter in order to obtain a Monte Carlo approximation of the
distribution of the test statistic.
The likelihood of any I×J contingency table f can be written in monomial
form
ξf11
11 · · · ξfIJ
IJ , (3.3)
where the ξij’s are functions of the model parameters. The sufficient statistic
in the independence case is a map
T : NI×J −→ NI+J (3.4)
f 7−→ (f1+, . . . , fI+, f+1, . . . , f+J) (3.5)
Then, for I × J contingency tables the ring homomorphism of interest is
π : K[ξ11, . . . , ξIJ ] −→ K[y1+, . . . , yI+, y+1, . . . , y+J ] (3.6)
ξij 7−→ yi+y+j (3.7)
The following result characterizes a well known object, e.g. the set of 2 × 2
minors, as the Grobner basis of the toric ideal I associated to π.
Theorem 3.2 Given any term-ordering τ , the reduced Grobner basis of I is
Gτ = {ξilξjk − ξikξjl : 1 ≤ i < j ≤ I, 1 ≤ k < l ≤ J} (3.8)
Proof. First, we prove that Gτ is a set of generators for ker(π). Every
polynomial of the form ξilξjk − ξikξjl is in ker(π), so Ideal(Gτ ) ⊆ ker(π). For
the converse, consider a monic polynomial g ∈ ker(π). Letting ξa = LT (g) the
42 CHAPTER 3. TWO-WAY CONTINGENCY TABLES
leading term of g and π(ξa) = yu, there exists in g at least another term ξb
such that π(ξb) = yu. The polynomial ξa− ξb is such that the exponents verify
I∑i=1
aij =I∑
i=1
bij for all j = 1, . . . , J
andJ∑
j=1
aij =J∑
j=1
bij for all i = 1, . . . , I .
With these constraints, after long but trivial computations one proves that
ξa− ξb ∈ Ideal(Gτ ). Now, also the polynomial g′ = g− ξa + ξb is in ker(π) and
LT (g′) ≺τ LT (g). Then, the result is proved by applying the same procedure
to g′ and so on for a finite number of times.
The proof that Gτ is a Grobner basis is a simple application of the notion
of syzygy, which is a standard tool in Commutative Algebra. The reader can
find the theory of syzygies in Kreuzer & Robbiano (2000), Section 2.3.
The following example is the continuation of Example 2.6 of Chapter 2.
Example 3.3 The reduced Grobner basis for the 3 × 2 contingency tables
gives the three moves
+1 −1 +1 −1 0 0−1 +1 0 0 +1 −1
0 0 −1 +1 −1 +1
This example can be used to show that the standard analysis of contingency
tables with the use of the theory of vector spaces is not sufficient in order to
find a Markov basis, see also Pistone et al. (2001a). The matrix representation
of the homomorphism T in Equation (3.5) is
S1 =
1 1 0 0 0 00 0 1 1 0 00 0 0 0 1 11 0 1 0 1 00 1 0 1 0 1
(3.9)
Note that we use the notation S1 in order to preserve here the notation of
the statistical literature, but the matrix S1 is just the matrix A introduced in
Chapter 1, Section 1.2.3.
3.1. TWO-WAY CONTINGENCY TABLES: INDEPENDENCE 43
We extend its action from R6 to R5 and compute its kernel with the system
of linear equations
a11 + a12 = 0a21 + a22 = 0a31 + a32 = 0
a11 + a21 + a31 = 0a12 + a22 + a32 = 0
(3.10)
As the system has rank 4, the kernel has dimension 2. Two solutions for this
system are
(a11, a12, a21, a22, a31, a32) = (1,−1, 0, 0,−1, 1) , (3.11)
(a11, a12, a21, a22, a31, a32) = (0, 0, 1,−1,−1, 1) . (3.12)
Then, a basis of the orthogonal of ST1 = Z1 is given by
Z2 =
1 0−1 00 10 −1−1 −11 1
(3.13)
but the polynomials ξ11ξ32 − ξ12ξ31 and ξ21ξ32 − ξ22ξ31 are not a Markov basis.
For example, starting from the table
3 56 40 0
, (3.14)
the Markov chain based on the two moves
+1 −1 0 00 0 +1 −1
−1 +1 −1 +1
does not produce any other tables. In fact, the polynomial ξ11ξ22 − ξ12ξ21
represents a non-trivial move, but it does not lie in the ideal Ideal(ξ11ξ32 −ξ12ξ31, ξ21ξ32 − ξ22ξ31). Although this table is statistically trivial, this example
allows us to write explicitly the homomorphism in a simple case. In other
models there are more interesting examples, as we will see in Section 3.3.
Using the saturation-based algorithm presented in Chapter 1, Section 1.2.4,
the Markov basis can be found through the computation of the Grobner basis
of the ideal
Ideal(ξ11ξ32− ξ12ξ31, ξ21ξ32− ξ22ξ31, hu− 1)∩K[ξ11, ξ12, ξ21, ξ22, ξ31, ξ32] (3.15)
44 CHAPTER 3. TWO-WAY CONTINGENCY TABLES
where h is the polynomial formed by the product of all the ξ’s indeterminates.
This operation leads to the three moves described above. Refer to Chapter 1
for details on the computation of toric ideals as well as for a discussion on the
symbolic software we use.
Finally, note that in the independence case the Markov basis consists of all
tables of the type
+1 −1−1 +1
for any 2× 2 minor of the table and zero otherwise. Then, the computation of
the ratio H(r,c)(f +m)/H(r,c)(f) in the rejection probability of the Metropolis–
Hastings algorithm is again simplified. In fact, referring to Equation (2.12),
we obtain
H(r,c)(f + m)/H(r,c)(f) =∏
i,j : mij 6=0
fij!
(fij + mij)!= (3.16)
=∏
i,j : mij=+1
fij!
(fij + 1)!
∏i,j : mij=−1
fij!
(fij − 1)!= (3.17)
=∏
i,j : mij=+1
(fij + 1)−1∏
i,j : mij=−1
fij (3.18)
and only four numbers are involved.
3.2 Some models for incomplete and square
tables
We have considered in the previous sections the independence model as it
represents an easy prototype. We analyze now the Markov bases for other
models, with special attention to the most widely used models for two-way
contingency tables. A survey on these models in the framework of the log-
linear models can be found in Haberman (1978).
3.2.1 Incomplete tables
Incomplete tables are contingency tables with structural zeroes. In this case
the sample space X is a subset of {(i, j) : 1 ≤ i ≤ I, 1 ≤ j ≤ J}. A complete
description of a Markov basis for incomplete tables is not easy. For example,
we consider a 3×3 table with the missing entry (1, 3), again with the row sums
and the column sums as components of the sufficient statistic. Using CoCoA
3.2. SOME MODELS FOR INCOMPLETE AND SQUARE TABLES 45
for computing a Grobner basis of the corresponding toric ideal, we obtain these
five moves in the Markov basis:
+1 −1 • +1 −1 • 0 0 •−1 +1 0 0 0 0 +1 −1 0
0 0 0 −1 +1 0 −1 +1 0
0 0 • 0 0 •+1 0 −1 0 +1 −1−1 0 +1 0 −1 +1
In this case, the Markov basis seems to be obtained from the corresponding
Markov basis of the complete table by deleting the moves which involve the
missing entry. But this is false in general as we show in the following example.
If we consider a 3 × 3 table with all diagonal cells as missing entries, the
algebraic computation show that the Markov basis has only one move:
• +1 −1−1 • +1+1 −1 •
In this example, the ring homomorphism of interest is
π : K[ξ12, ξ13, ξ21, ξ23, ξ31, ξ32] −→ K[y1+, y2+, y3+, y+1, y+2, y+3] (3.19)
ξij 7−→ yi+y+j (3.20)
with i 6= j.
3.2.2 The quasi-independence model
In square tables is often useful the quasi-independence model. The log-linear
form of this model is
log mij = λ + λXi + λY
j + δiI{i=j} (3.21)
where the λXi ’s are the effects of the first variable X with values 1, . . . , I, the
λYj ’s are the effects of the second variable Y with values 1, . . . , I and the δi’s
are the effect of the diagonal cells. Here the indicator I{i=j} equals 1 when
i = j and 0 otherwise. The usual constraints of this model are
I∑i=1
λXi = 0 and
I∑j=1
λYj = 0 . (3.22)
46 CHAPTER 3. TWO-WAY CONTINGENCY TABLES
This model consider the independence except for the diagonal cells, which
are fitted exactly. In this model, the components of the sufficient statistic
are the row sums and the column sums together with the diagonal entries
d = (f11, . . . , fII). As the diagonal entries are themselves sufficient statistic,
the model fits exactly these cells, and then no move of the Markov basis can
modify the diagonal. This observation leads us to the following result.
Proposition 3.4 The reduced Grobner basis of the toric ideal I for the quasi-
independence model is the Grobner basis for the corresponding incomplete table
with missing diagonal entries.
Proof. An equivalent minimal sufficient statistic has components d, r =
(f1+, . . . , fI+) and c = (f+1, . . . , f+I), where fi+ = fi1 + . . . + fi i−1 + fi i+1 +
. . . + fiI is the i-th row sum except for the i-th diagonal cell which is excluded
and the same definition holds for the column sums f+j. With this sufficient
statistic, the equations defining the kernel of π are
ξij = yi+y+j for i, j = 1, . . . , I, i 6= j (3.23)
ξii = fii for i = 1, . . . , I (3.24)
A Grobner basis G1 of the ideal J1 = Ideal(ξij − yi+y+j, i, j = 1, . . . , I, i 6= j)
is the Grobner basis coming from the incomplete table with missing diagonal
entries. A Grobner basis G2 of the ideal J2 = Ideal(ξii − fii, i = 1, . . . , I) is
{ξii − fii, i = 1, . . . , I}.Now, observe that the first group of equations involves only the indeter-
minates ξij, i 6= j, yi+ and y+j, i, j = 1, . . . I, whereas the second group of
equations involves the indeterminates ξii and fii, i = 1, . . . I. As a conse-
quence, it is a standard fact in polynomial algebra that a Grobner basis of the
ideal J = J1 + J2 is G1 ∪ G2.
As we have previously seen, the Grobner basis of the toric ideal I = J ∩K[ξij : i 6= j] is obtained from G1 ∪ G2 by deleting the polynomials involving
the y’s and the f ’s indeterminates. So, the polynomials in G2 are all deleted
and the result is proved.
As the computation of a Grobner basis of a toric ideal is very computer
intensive and contingency tables need a great number of indeterminates and
equations, the above result is specially useful from a computational point of
view, as it reduces the number of equations used in the computation. Moreover,
3.2. SOME MODELS FOR INCOMPLETE AND SQUARE TABLES 47
note that the above proposition can be extended to other models where some
entries are sufficient statistics.
3.2.3 The symmetry model
Following the notations of the log-linear model, the symmetry model can be
written in the form
log mij = λ + λi + λj + λij (3.25)
with λij = λji for all i, j = 1, . . . , I and with the constraint∑I
i=1 λi = 0. This
model implies the marginal homogeneity. The sufficient statistic is s = (sij =
fij + fji : 1 ≤ i ≤ j ≤ I).
The Markov basis for the symmetry model is characterized by the following
result.
Theorem 3.5 Given any term-ordering τ , the reduced Grobner basis of I for
the symmetry model is
Gτ = {ξij − ξji : 1 ≤ i < j ≤ I} (3.26)
Proof. Writing explicitly the kernel of the ring homomorphism we obtain
a polynomial system which contains the couples of equations{
ξij = sij
ξji = sij(3.27)
for all 1 ≤ i < j ≤ I and the equations
ξii = sii (3.28)
for all i = 1, . . . , I. As the corresponding polynomials are a Grobner basis,
straightforward elimination of sij’s indeterminates gives the result.
3.2.4 The quasi-symmetry model
The quasi-symmetry model has the log-linear form
log mij = λ + λXi + λY
j + λij (3.29)
with λij = λji for all i, j = 1, . . . , I and with the constraints
I∑i=1
λXi = 0 and
I∑j=1
λYj = 0 . (3.30)
48 CHAPTER 3. TWO-WAY CONTINGENCY TABLES
This model is more general than the previous, as it does not imply the marginal
homogeneity. The components of the sufficient statistic are the row sums, the
column sums and s = (sij = fij + fji : 1 ≤ i ≤ j ≤ I). In this model the
Markov basis can be computed with CoCoA. For example, for a 4 × 4 table,
the seven moves are
0 0 0 0 0 0 −1 +1 0 −1 0 +10 0 −1 +1 0 0 0 0 +1 0 0 −10 +1 0 −1 +1 0 0 −1 0 0 0 00 −1 +1 0 −1 0 +1 0 −1 +1 0 0
0 0 −1 +1 0 −1 0 +1 0 +1 −1 00 0 +1 −1 +1 0 −1 0 −1 0 0 +1
+1 −1 0 0 0 +1 0 −1 +1 0 0 −1−1 +1 0 0 −1 0 +1 0 0 −1 +1 0
0 −1 +1 0+1 0 −1 0−1 +1 0 0
0 0 0 0
It should be noted that the Markov basis is easily computable also in those
models where the maximum likelihood equations are not in explicit form and
a numerical method, such as the Newton-Raphson method, must be used in
order to calculate the maximum likelihood estimate, see Chapter 1.
3.2.5 Computational notes
The application of the above theory leads to two computational problems.
First, the computation of the Grobner bases of toric ideals involves a great
number of indeterminates and the number increases fast when the dimensions
of the table increase. For example, the computation of the Grobner basis of
the toric ideal in the independence model for complete I × J tables involves
IJ + I + J indeterminates.
For other models the number of indeterminates is even greater. For exam-
ple, the quasi-symmetry model for an I × I table involves I2 + 2I + I(I + 1)/2
indeterminates. The most common symbolic software packages, such as Co-
CoA or Maple, give computational problems if the number of indeterminates
is too high. So, for solving problems for large tables are essential the theorems
of the above Sections in order to have the Grobner basis avoiding the compu-
tation. However, we have found in our computations that CoCoA, having the
function Toric for the computation of toric ideals, is the package which allows
to work with the grater number of indeterminates.
Second, the number of moves increases fast when I and J increase. Con-
sidering once again the independence model, the number of polynomials of the
3.3. EXAMPLES 49
Grobner basis, or equivalently the number of moves of the Markov basis, is(I2
)(J2
). In Table 3.1, we report the number of moves of the Markov basis for
the independence model.
JI 2 3 4 5 6 7 8 9 102 1 3 6 10 15 21 28 36 453 3 9 18 30 45 63 84 108 1354 6 18 36 60 90 126 168 216 2705 10 30 60 100 150 210 280 360 4506 15 45 90 150 225 315 420 540 6757 21 63 126 210 315 441 588 756 9458 28 84 168 280 420 588 784 1008 12609 36 108 216 360 540 756 1008 1296 1620
10 45 135 270 450 675 945 1260 1620 2025
Table 3.1: Number of moves for the independence model.
In this case, the main problem is how to store the results of the symbolic
computations in matrix form in order to use these results with statistical soft-
wares.
3.3 Examples
In this section we present three different examples of application of the Dia-
conis-Sturmfels algorithm to goodness of fit tests for log-linear models.
Example 3.6 Consider the categorical data in Fingleton (1984), page 142
and reported in Table 3.2. These data describe the place of residence of a
subgroup of British migrants in 1966 and 1971. The data are simplified in
order to consider only four places, labelled A, B, C and D. For the quasi-
symmetry model, the Pearson statistic is C = 2.5826. If we compare this
value with the chi-squared distribution with (I − 1)(I − 2)/2 = 3 degrees of
freedom, it gives an asymptotic approximate p−value 0.4607.
Running the Markov chain based on the seven moves found in Section 3.2
for the quasi-symmetry model with 50,000 burn-in steps and sampling every
50 steps for a total of 10,000 values, the Monte Carlo approximate p−value
is 0.4714. As the sample size is high, the classical asymptotic approximation
works well and the two approximations are similar.
50 CHAPTER 3. TWO-WAY CONTINGENCY TABLES
19711966 A B C D
A 118 12 7 23B 14 2127 86 130C 8 69 2548 107D 12 110 88 7712
Table 3.2: British migrants data.
The sample size is chosen in order to have a 95%-confidence interval for
the exact p−value shorter than 0.02. In fact, assuming near independence,
the width of the 95%-confidence interval is 2 · 1.96√
p(1− p)/B, where p is
the estimated p−value and B is the size of the Monte Carlo sample. In the
inequality 2 · 1.96√
p(1− p)/B < 0.02, the left hand term has a maximum in
p = 1/2, so the inequality is reduced to 2 · 1.96√
1/(4B) < 0.02. From the last
inequality we easily obtain B > 9, 604.
Example 3.7 As a second example, we present a quasi-independence model
for complete tables. We analyze the medical data in Agresti (1996), page 242
and reported in Table 3.3.
YX 1 2 3 41 22 2 2 02 5 7 14 03 0 2 36 04 0 1 17 10
Table 3.3: Carcinoma data.
This table shows ratings by two pathologists, X and Y , who separately
classified 118 slides regarding the presence and extent of carcinoma of the
uterine cervix. The rating scale has four ordered categories, 1, 2, 3 and 4.
Under the quasi-independence model, the Pearson statistic is C = 11.5, leading
to an asymptotic approximate p−value 0.0423. Applying the algebraic MCMC
algorithm as in the previous example, the Monte Carlo approximate p−value
is 0.0080. If we perform a 1%-level goodness of fit test, the conclusions are
different using different approaches.
3.3. EXAMPLES 51
Example 3.8 Finally, we consider again the data of the previous example,
but under the quasi-symmetry model. The Pearson statistic for this table
under the quasi-symmetry model is C = 0.6348. Agresti (1996) compares this
value with the chi-squared distribution with 2 degrees of freedom, obtaining
an asymptotic approximate p−value 0.7280. Running the Markov chain as
above, the Monte Carlo approximate p−value is exactly 1, and in this case the
two approximations are not similar. We used the definition of p−value as the
smallest significance level at which the model is rejected with probability 1.
But there is more. Using the seven moves of the Markov basis, we can
compute for this table the exact distribution of C under the model. In fact,
starting from the original table (denoted here by f1), only the last move
0 −1 +1 0+1 0 −1 0−1 +1 0 0
0 0 0 0
with negative sign, can be applied and this move gives the table f2 below.
Again, only the last move can be used: if we choose the positive sign we come
back to the table f1, whereas if we choose the negative sign we obtain the table
f3.
f2 =
22 3 1 04 7 15 01 1 36 00 1 17 10
f3 =
22 4 0 03 7 16 02 0 36 00 1 17 10
(3.31)
Now, only the last move can be applied, with positive sign, coming back to
table f2.
There are only three tables with non-negative entries and with the appro-
priate values of the sufficient statistic and it is not difficult to compute the
corresponding hypergeometric probabilities. Applying the formula (3.16), we
obtain the equations H(f2)/H(f1) = 4/9 and H(f3)/H(f2) = 1/32, together
with the constraint H(f1) +H(f2) +H(f3) = 1. Solving this simple system,
we find H(f1) = 72/105, H(f2) = 32/105, H(f3) = 1/105. The Pearson
statistic values for these three tables are C(f1) = 0.6348, C(f2) = 1.8405,
C(f3) = 12.3206.
Moreover, following the Example 3.3, the size of the matrix S1 is 16× 18 and
the dimension of its kernel is 3. Then, a basis of the kernel as vector space is
given by three linearly independent moves, as instance the first three ones. If
52 CHAPTER 3. TWO-WAY CONTINGENCY TABLES
the last move is excluded, the Markov chain starting from f1 cannot reach the
other two tables.
3.4 An application. The h-sample problem
In this section we present a first application of the theory above, and in par-
ticular of the exact test for the independence model.
We want to compute the exact p−values for tests on equality of h multi-
nomial distributions, through the test of independence on h × k tables. It is
worth noting that when the p−value is not exactly computed, classical approx-
imations based on chi-squared asymptotic distribution of the test statistic are
used. As for log-linear models, asymptotic approximations in many situations
are not accurate, indeed the problem of computing exact p−values has been
widely considered in recent literature.
Mehta & Patel (1983) provide a network algorithm to inspect the orbit of
h × k contingency tables with fixed margins. But, the entire set of tables is
often too wide and consequently approximation methods have been proposed.
Baglivo et al. (1992) used a FFT algorithm to compute the exact p−value on
h × k tables with fixed margins. Hirji & Johnson (1996) compared the two
algorithms above and concluded that the first one is superior to the second
one with respect to computing speed and accuracy.
The algorithm proposed by Metha et al. (1988) provides an approximation
of the exact p−value and it is based on a complicated Monte Carlo method
which generates random tables providing statistics belonging to the critical
region of the test.
Recently, Strawderman & Wells (1998) used saddlepoint approximations to
the exact distribution of the conditional maximum likelihood estimate (MLE)
for several 2× 2 tables, providing approximated p−values, power calculations
and confidence intervals. Booth & Butler (1999) proposed a Monte Carlo im-
portance sampling for conducting exact conditional tests in log-linear models.
Moreover, Metha et al. (2000) proposed a Monte Carlo method for approx-
imating the exact p−values in conditional logistic regression and foresee the
applicability of algebraic-based methods in this framework. The main problem
in Metha et al. (1988), and in most cited works, was how to generate random
tables with fixed margins. In this paper we give a new solution to the h-sample
problem applying the technique described in Diaconis & Sturmfels (1998), and
3.4. AN APPLICATION. THE H-SAMPLE PROBLEM 53
the computer program source can be included in few lines of statements. More-
over, we provide a solution not only for 2 × k tables but, more generally, for
h× k tables.
The h-sample problem can be stated as follows. Let X1, . . . , Xh be random
variables with values in {1, . . . , k} defined on h different groups and consider
h samples drawn from these groups: (X1,1, . . . , X1,n1) from the multinomial
distribution D1, . . ., (Xh,1, . . . , Xh,nh) from the multinomial distribution Dh.
Let N =∑h
i=1 ni be the total sample size. The distribution Di has parameters
pi1, . . . , pik, for i = 1, . . . , h with the constraints pij ≥ 0 and∑k
j=1 pij = 1 for
all i = 1, . . . , h. The usual test in this situation is the test for equality of the
proportions, where the null hypothesis is
H0 : p1j = . . . = phj for j = 1, . . . k (3.32)
against the composite alternative hypothesis of different proportions.
The components of the sufficient statistic are the sums of the observations
in each group and the sums of observations for each possible value of the
variables, that is
S =
(∑u,j
I{X1,u=j}, . . . ,∑j,u
I{Xh,u=j},∑i,u
I{Xi,u=1}, . . . ,∑i,u
I{Xi,u=k}
)(3.33)
for i = 1, . . . , h and j = 1, . . . , k, where IA is the indicator function of A. Of
course, the sample can be summarized in a contingency table with h rows and
k columns and we denote this table by F . In other words Fij =∑ni
u=1 I{Xi,u=j}.
The table F has raw parameters p∗ij = pijni/N . Moreover, we denote by fi+ and
by f+j the row sums and the column sums, respectively. The sufficient statistic
S is represented by the margins of this table and the maximum likelihood
estimate of the parameters is
p∗ij = fi+f+j/N2 for i = 1, . . . h , j = 1, . . . , k . (3.34)
For this hypothesis the most commonly used test statistics are the Pear-
son statistic C =∑
i,j(Fij − Np∗ij)2/Np∗ij and the log-likelihood ratio L =
2∑
i,j Fij log(Fij/Np∗ij), see for example Agresti (1996).
There are many nonparametric algorithms, based on Monte Carlo approx-
imations of the p−value which do not involve limit theorems. The algorithm
presented here is still valid when the test statistic is of the form
T (F ) =∑i,j
aij(Fij) , (3.35)
54 CHAPTER 3. TWO-WAY CONTINGENCY TABLES
where the aij’s are real valued functions on the set N of the nonnegative inte-
gers.
As usual in this framework, we consider a problem of conditional infer-
ence, where only the tables with the same margins of the original table are
relevant. So, we restrict the analysis to the reference set Ft. Using the al-
gorithm described above with the Markov basis for the independence model,
the approximate Monte Carlo p−value is simply the ratio of the number of
tables whose value of the test statistic is greater than or equal to the one of
the original table, over the number B of sampled tables.
Note that or the two-sample problem, there is an algorithm proposed by
Metha et al. (1988), which is an importance sampling algorithm and is based
on the network representation of the reference set Ft. It were also used in order
to approximate the power of the test, but is difficult to rewrite it for the h-
sample problem. Moreover, our method allows us to approximate the p−value
as stated before, instead of considering also the hypergeometric probability
weights of the sampled tables.
It is interesting to note that also Metha et al. (2000) considered the possi-
bility of applying algebraic-based methods: “An investigation is needed to es-
tablish conditions under which the probability of connectedness of the Markov
chain is high”. But, the theorems presented in Diaconis & Sturmfels (1998),
and described in Chapter 2 of this thesis, state that the algebraic algorithm
described above defines a connected Markov chain.
As in this kind of applications only the independence model is involved, it
is worth noting the algebraic step of the algorithm needs not symbolic com-
putations. In fact, the relevant Markov bases are completely characterized in
Theorem 3.2.
Finally, we emphasize that classical methods, for example Metha et al.
(1988), proposed the solution by sampling few well chosen tables with great
computational effort. On the other hand, this method easily generates tables
from the reference set.
Chapter 4
Rater agreement models
In this chapter we apply the algebraic technique for sampling from condi-
tional distributions presented in Chapter 2 in order to define an exact Markov
Chain Monte Carlo (MCMC) algorithm for conditional hypothesis testing in
the problems of rater agreement. Moreover, we extend the results and the
computations made up in Chapter 3 to the multi-dimensional case, i.e. to
the case of d-way tables, with d > 2. These new results allow us to compute
Markov bases also in non-trivial cases and some of the most important cases
are presented explicitly.
In Section 4.1, we present the medical framework where models for rater
agreement apply. In Section 4.2 we review the main methods for measuring
rater agreement, with particular attention to the multi-observer case. The
methods considered here are the Cohen’s κ and the goodness of fit tests for
appropriate log-linear models. In Section 4.3, we describe the conditional in-
ference for these problems, using the Diaconis-Sturmfels algorithm and the
notion of Markov basis, while in Section 4.4 a careful presentation of the con-
ditioning structures arising from complex problems is given. Section 4.5 deals
with the computation of some relevant Markov bases coming from agreement
problems, through some new theoretical results useful for simplifying the sym-
bolic computations. Finally, in Section 4.6, an example on a real data set
shows the practical applicability of the algebraic procedures.
4.1 A medical problem
The material presented in this chapter was inspired by a medical research on
medical imaging drugs and biological products, in particular contrast agents,
see CDER (2000). Contrast agents are drugs that improve the visualization of
55
56 CHAPTER 4. RATER AGREEMENT MODELS
tissues, organs or physiologic processes by increasing the relative difference of
imaging signal intensities in adjacent parts of the body.
From the statistical point of view, the effectiveness of a contrast agent is
assessed by evaluating with statistical quantitative studies the agent’s ability to
provide a better reading of the medical images. The assessment of the validity
of a new product over the existing ones is performed in general by comparing
radiographies, computed tomographies and magnetic resonance imaging taken
with the new product and the old product, when available.
In this medical setting, the most commonly used statistical models are the
models for agreement among raters. For the case of two raters the proposed
methods are the Cohen’s κ and similar indices, see Landis & Koch (1975) for
a survey. More recently, the application of log-linear models, in particular the
quasi-independence model and the quasi-symmetry model has been proposed,
see for example Agresti (1992).
In some particular clinical settings, Government Agencies can require mod-
els for the agreement among many raters or the analysis of stratified samples
with respect to other covariates, for example when the study includes subjects
that represent the full spectrum of the population. For example, in a study
intended to demonstrate the validity of a contrast agent to assess bronchiesta-
sis, the sample should be stratified with respect to age classes of the patients
and to the presence or absence of the main related pulmonary disorders, e.g.,
chronic bronchitis, pneumonia, asthma, cystic fibrosis, see CDER (2000).
In these situations, the use of conditional inference is common practice. In
fact, the use of conditional inference seems to be preferable with respect to the
use of an unconditional approach, see Klar et al. (2000) for a discussion and
further references. In addition, different structures of conditioning arise in the
inferential procedures, as it will be clear in the next sections. For example, it
is usual to define different indices of agreement for each pair of observers, e.g.
different κ’s for any pairs of observers and for any strata. As it will be clear
from the following, such a study can be much more informative than the study
of an overall index of agreement for the sample.
4.2 Multidimensional rater agreement models
Suppose that d observers rate the same set of N subjects.
In this section, we briefly review the main methods for measuring rater
4.2. MULTIDIMENSIONAL RATER AGREEMENT MODELS 57
agreement, starting with the case of d = 2 observers, say A and B. We present
here a wide range of models, both for binary, nominal and ordinal rating scales,
and we denote the set of possible responses as {1, . . . , I}. We do not consider
continuous responses, where parametric or nonparametric ANOVA methods
can be applied,see for example Landis & Koch (1975).
The data are summarized in a two-way contingency table. We note πij =
P{A = i, B = j} the probability that observer A classifies the subject in
the category i and observer B in category j, with i, j = 1 . . . , I. Moreover, we
denote by {πi+, i = 1, . . . , I} and {π+j, j = 1, . . . , I} the marginal distributions
for the observers A and B, respectively. Note that in the previous chapters
the cell probabilities were denoted by p, as usual in the analysis of contingency
tables. But in the rater agreement literature the notation π to denote the
probabilities is the standard one.
Probably, the most widely used index of agreement for nominal and ordinal
categories is the Cohen’s κ defined as
κ =
∑Ii=1 πii −
∑Ii=1 πi+π+i
1−∑Ii=1 πi+π+i
(4.1)
leading to the model
πij = (1− κ)πi+π+j for i 6= j
πii = (1− κ)πi+π+i + κ/I for i = j
This index measures the observer agreement beyond chance in the sense
that the index equals zero if the agreement is equal to that expected by chance
and computed under the independence assumption, see Landis & Koch (1975)
for further details or the original work Cohen (1960).
Another version of the κ index implies marginal homogeneity and its defi-
nition assumes the form
κh =
∑Ii=1 πii −
∑Ii=1(πi+ + π+i)
2/4
1−∑Ii=1(πi+ + π+i)2/4
. (4.2)
There are also several versions of weighted κ for ordinal scales, see Landis
& Koch (1975) or Fleiss (1981) for details and discussion. As the Markov bases
do not change when considering weights, we restrict our presentation to the
case of standard weights, such as in (4.1) and (4.2). Moreover, some authors
define separated k’s for each pair of responses, see for example Agresti (1992).
58 CHAPTER 4. RATER AGREEMENT MODELS
An agreement index is a normalized index such that it is equal to 1 in the
case of perfect agreement and it is equal to 0 in the case of independence. For
d > 2 raters, one can find different versions of the Cohen’s κ. Here we use the
following formula
κ(d) =
(d2
)−1{∑d
h<k=1
∑Iih=ik=1 πi1,...,ih,...,ik,...,id −
∑dh<k=1
∑Ii=1 πk(i)πh(i)
}
1− (d2
)−1 ∑dh<k=1
∑Ii=1 πk(i)πh(i)
(4.3)
where πk(i) is the proportion of the i-th level in the k-th marginal distribution,
see Landis & Koch (1975).
With the marginal homogeneity hypothesis we obtain the formula:
κh(d) =
(d2
)−1(∑d
h<k=1
∑Iih=ik=1 πi1,...,ih,...,ik,...,id
)−∑I
i=1 π(i)2/d2
1−∑Ii=1 π(i)2/d2
(4.4)
where π(i) =∑d
h=1 πh(i) is the proportion of ratings in the i-th category by
the d observers.
It is known that the maximum likelihood estimates of the κ parameters are
obtained simply by substituting in the above formulae the observed propor-
tions to the theoretical proportions. We denote by κ the maximum likelihood
estimators.
As the interpretation of the Cohen’s κ is not always straightforward, see
Fleiss (1981), more recently the use of log-linear models for describing ob-
server agreement has been proposed. For the specific application of the log-
linear models for measuring the agreement, Agresti (1992) illustrates the use
of the quasi-independence and the quasi-symmetry models. In particular, the
p−value of the goodness of fit test is used to measure the agreement.
Let {nij | i, j = 1, . . . , I} and {mij | i, j = 1, . . . , I} denote the observed
cell counts and the expected cell counts, respectively. The quasi-independence
model assumes the form
log mij = µ + λAi + λB
j + δiI{i=j} (4.5)
where µ is a global mean parameter, λAi , λB
j , i, j = 1, . . . , I are the marginal
effects with the constraints∑I
i=1 λAi = 0,
∑Jj=1 λB
j = 0, and δi, i = 1, . . . , I
are the main diagonal parameters.
The expression of the independence model, which gives the expected agree-
ment by chance, has the form
log mij = µ + λAi + λB
j .
4.2. MULTIDIMENSIONAL RATER AGREEMENT MODELS 59
As pointed out in Agresti (1992), the quasi-independence model decom-
poses the agreement between the two raters in two parts: a first addendum
which represents the agreement by chance and a second addendum which rep-
resents the agreement beyond chance. For further details on the use of the
quasi-independence model for measuring rater agreement, see Bishop et al.
(1975) or Tanner & Young (1985).
We can also define a simplified quasi-independence model with an unique
parameter δ for the main diagonal cells instead of the sequence {δi, i = 1, . . . I}.This makes the fit of the main diagonal unsaturated and the model has the
form
log mij = µ + λAi + λB
j + δI{i=j} . (4.6)
Another log-linear model to assess rater agreement is the quasi-symmetry
model, defined by the equations
log mij = µ + λAi + λB
j + λij (4.7)
with the constraints∑I
i=1 λAi = 0,
∑Jj=1 λB
j = 0, and λij = λji for all i, j =
1, . . . , I, see Agresti (1992). Other related models can also be found in Becker
(1990).
The extension of the theory of log-linear models to higher dimensional
tables is presented for example in Fienberg (1980). However, as pointed out in
the introduction, concept such as independence and conditional independence
have been largely studied in the statistical literature, while the notions of
quasi-independence, symmetry and quasi-symmetry are still underdeveloped.
Some models of quasi-independence for many raters can be found in Tanner
& Young (1985). A generalization of the quasi-symmetry model to higher
order tables is presented in Rudas & Bergsma (2002) for marginal log-linear
models. Roughly speaking, marginal models are generalizations of classical
log-linear models and they are defined by restriction on appropriate marginal
distributions of the table, see Bergsma & Rudas (2002) for details.
We consider here 4 models. First, the quasi-independence model given by
log mi1...id = λ + λA1i1
+ . . . + λAdid
+ δiI{i1=...=id} (4.8)
see Tanner & Young (1985). From this model we can also generalize the
simplified quasi-independence model by considering δi = δ for all i = 1, . . . , I.
Tanner & Young (1985) proposed also a non-homogeneous quasi-indep-
endence model defined by measuring the agreement among the(
dk
)subsets of
60 CHAPTER 4. RATER AGREEMENT MODELS
k raters. The model is specified by the equation:
log mi1...id = λ + λA1i1
+ . . . + λAdid
+ δi1...id (4.9)
with
δi1...id = I1δ1 + . . . + I(dk)
δ(dk)
where Ig = 1 for those cells corresponding to agreement between the k ob-
servers of the g-th k-tuple of observers and Ig = 0 otherwise. In particular,
when k = 2 this model measures the pairwise agreement.
The generalization of the quasi-symmetry model is less straightforward.
The model in Equation (4.10) below is a prototype and generalizes a model
presented in Rudas & Bergsma (2002):
log mi1...id = λ + λA1i1
+ . . . + λAdid
+∑
h<k=1,...,d
λAkAhikih
(4.10)
with λAkAhikih
= λAhAkihik
for all h, k = 1, . . . , d. In the same paper other models
are introduced. Our choice is motivated by the fact that often in medical
application the hypothesis of pairwise agreement is more useful rather than an
overall quasi-symmetry condition of the d observers. The definition of more
restrictive quasi-symmetry models is possible by adding to the previous model
other parameters for higher-order symmetries.
In some clinical studies the repeated image evaluation by the same reader
is of great interest. In fact, the medical studies usually require the assess-
ment of both the inter-reader variability and the intra-reader variability. In
particular, intra-reader agreement models are useful for the study of category
distinguishability for the neat definition of the rating scales, see Darroch &
McCloud (1986), where quantitative measures are welcome, see CDER (2000).
But, the resulting statistical structure and modeling in the intra-reader agree-
ment is the same than in the previous setting of inter-reader agreement.
Finally, in the case where a truth standard is available, the analysis usually
concerns non-square tables, in most cases 2 × I tables. In such situation, the
statistical analysis is usually performed using the Receiver Operating Char-
acteristic (ROC) curves, see for example Hanley (1998). We do not consider
here this application, limiting ourselves to square tables.
4.3. EXACT INFERENCE 61
4.3 Exact inference
In this section we introduce the application of the algebraic algorithm for sam-
pling from conditional distribution described in Chapter 2 in order to define
exact inferential procedures for the Cohen’s κ and for the log-linear models pre-
sented in the previous section. As mentioned above, we define our procedures
in the framework of the conditional inference.
In the recent literature, some methods for estimating κ indices with non-
asymptotic techniques are referred to as the methods of generalized estimating
equations, see Klar et al. (2000) and Williamson et al. (1994) for details. In this
last paper, a simple simulation study is given. This approach is a generalization
of the Cohen’s κ and it is useful in particular when each subject is rated by
a different number of observers. Other studies approach the problem from
the perspective of latent class analyses, see Guggenmoos-Holzmann & Vonk
(1998).
Some procedures for hypothesis testing on κ have been recently developed,
see for example Donner et al. (2000), but limited to two raters or dichoto-
mous variables. Other works on the same subject can be found in Donner &
Klar (1996). Note that under the conditional approach the distribution of the
estimator κ is difficult to compute. In fact, also its support depends on the
observed marginal distributions.
Here, we use the algebraic algorithms presented previously in order to define
appropriate MCMC approximations of the test statistics. Note that specific
work on the quasi-independence model can also be found in Smith et al. (1996).
The method we present here applies to general multi-dimensional tables,
with both more than 2 levels of the rating scale and more than 2 raters. Let
X = {1, . . . , I}d be the sample space. Denote the cell probabilities by π and
the sample proportions by p. Moreover, let f be a generic contingency table,
that is a function f : X −→ N.
As discussed in the previous chapters, for the log-linear models, the good-
ness of fit tests are performed conditionally to the sufficient statistics, whereas
in the κ approach, the distribution of the estimator κ is computed condition-
ally to the one-dimensional margins of the tables. These statements are well
known in the two-observer case. In the multi-observer case the situation is less
well understood and it will be widely discussed later.
We call conditioning statistic the function T : X −→ Ns, under which we
62 CHAPTER 4. RATER AGREEMENT MODELS
compute the conditional distributions. In the log-linear case, this is equivalent
to the notion of sufficient statistic, but in a general approach this terminology
could lead to some misunderstandings, especially when we compute conditional
distributions of the estimators.
Now, the p−values and the conditional distributions of the estimators can
be computed applying the algorithm described in Chapter 2.
4.4 Conditioning
The problem of finding the conditioning statistic T for the models introduced
above is relatively easy for d = 2. In fact, the distribution of the estimator κ is
computed with fixed margins if we allow different marginal distributions, i.e.,
using the coefficient defined in Equation (4.1). In this case the conditioning
statistic is
T =
(I∑
j=1
fij, i = 1, . . . I,
I∑i=1
fij, j = 1, . . . , I
).
Using the definition of κh of Equation (4.2) where we impose marginal
homogeneity, the appropriate conditioning statistic is
T =
(I∑
j=1
fij +I∑
r=1
fri, i = 1, . . . I,
).
For the goodness of fit tests for log-linear models, we compute the dis-
tribution of the test statistics conditioned to the sufficient statistic, see for
example Fienberg (1980) for details and formulae. For example, the condi-
tioning statistic for the quasi-independence model as defined in Equation (4.5)
is given by
T =
(I∑
j=1
fij, i = 1, . . . I,
I∑i=1
fij, j = 1, . . . , I, fii, i = 1, . . . , I
).
The situation presented above for the two-observer case is much less clear
in the multi-observer case.
Analyzing Equations (4.3) and (4.4), it follows that the conditioning statis-
tics for the Cohen’s κ problem are those coming from the complete indepen-
dence model for the estimation of κ(d) and from the complete independence
4.4. CONDITIONING 63
model together with the marginal homogeneity for the estimation of κh(d). In
the first case, the conditioning statistic T is
T = (fk(i), i = 1, . . . , I, k = 1, . . . , d)
where fk(i) =∑
ih 6=ikfi1,...,ih,...,idI{ik=i} is the i-th count of the k-th marginal
distribution, while in the second case the conditioning statistic T is
T =
(d∑
k=1
fk(i), i = 1, . . . , I
).
In the log-linear models perspective, the conditioning statistic is
T = (fk(i), i = 1, . . . , I, k = 1, . . . , d, fi...i, i = 1, . . . , I) (4.11)
for the quasi-independence model described by Equation (4.8), while for the
non-homogeneous model we need to know the same conditioning statistic for all
marginal two-way tables. Note that here our attention is restricted to the case
k = 2 in Equation (4.9), i.e. to the pairwise agreement, but the conditioning
structure is the same also for the agreement among h > 2 observers.
For the quasi-symmetry model in Equation (4.10) the appropriate condi-
tioning statistic is given by
T = (fk(i), i = 1, . . . , I, k = 1, . . . , d,
∑
ik 6=ij ,ih
(fi1...ij ...ih...id + fi1...ih...ij ...id), h ≥ k = 1, . . . , I
. (4.12)
In other words the quasi-symmetry model presented in this paper considers
the quasi-symmetry of all two-way marginal tables. This is sufficient because
many agreement models refer to pairwise agreement.
The marginal two-way agreement in a d-dimensional table is sometimes
preferable with respect to an overall test of quasi-independence or quasi-
symmetry. In Agresti (1992) a discussion on this topic is presented for the
quasi-symmetry models. In particular the author’s attention is addressed to
the study of the degrees of freedom of the test statistics and he concludes that,
in some cases when the constraints imposed in the model reduce the degrees of
freedom, the test can be too conservative. Although we use a non-asymptotic
approach, also in the exact framework the most complex models can lead to
64 CHAPTER 4. RATER AGREEMENT MODELS
results with a difficult interpretation, especially when the sample size is small
and there are sampling zeros. To show this, we present a simple example of
two-way table. Consider the data shown in Table 4.1,
YX 1 2 3 41 22 2 2 02 5 7 14 03 0 2 36 04 0 1 17 10
Table 4.1: Carcinoma data.
These data has already been used in Example 3.8, Chapter 3. If we analyze
these data under the quasi-symmetry model, there are only three tables in the
appropriate reference set Ft and the exact p−value of the goodness of fit test
is 1. But, this result can be criticized because of the small cardinality of Ft.
Moreover, if we slightly modify this table, setting to 0 the (1, 3) entry of the
table and then considering the data in Table 4.2, the reference set is formed
only by the original table and no other ones. In such cases, the inference is
probably infeasible.
YX 1 2 3 41 22 2 2 02 5 7 14 03 0 2 36 04 0 1 17 10
Table 4.2: Carcinoma data modified.
4.5 Computations
In this section, we compute the relevant Markov bases for the statistical mod-
els introduced above using the same technique as in Chapter 3. Here, the
computations are much more complex and new results are presented.
As pointed out in the previous chapter, in order to find Markov bases it
is not necessary to consider Grobner bases. We can consider any finite set of
generators of the relevant ideals. Given an ideal, it is possible to determine
4.5. COMPUTATIONS 65
a minimal set of generators , that is a set of polynomials which generates the
ideal and has minimal cardinality, see Kreuzer & Robbiano (2000). Roughly
speaking, a minimal set of generators is another basis of the ideal. For the
application of the Diaconis-Sturmfels algorithm, the relevant object is the toric
ideal IT coming from the matrix representation AT of the sufficient statistic,
and the algorithm is independent from the particular representation of such
ideal. While in simple cases there is no considerable difference between the
Grobner basis and the corresponding minimal set of generators of an ideal,
in complex cases a minimal set of generators is preferable, because of the
significant reduction in the number of binomials, leading to a simplification of
the Monte Carlo step of the algorithm.
Nevertheless, Grobner bases are essential in the symbolic computation of
the toric ideals, as they has nice algebraic properties, namely they guarantee
the finiteness of the division algorithm needed in a great number of algebraic
algorithms, including that for the computation of toric ideals.
Here, we use the same notation as in Chapters 2 and 3, i.e., we use the
ξ’s indeterminates for the cell probabilities and the y’s indeterminates for the
components of the sufficient statistic, using these indeterminates as parame-
ters.
The minimal set of generators is computed in CoCoA through the function
MinGens, starting from the Grobner basis. Note that in the following we leave
the resulting Markov bases in terms of binomials in order to save space.
For the κ(d) coefficient in the case of 3 observers and a response variable with
3 levels we simply write in CoCoA the matrix representation of the conditioning
statistic and few lines of statements as follows.
Mat1:= [[1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0],[0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0],[0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1],[1,1,1,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0],[0,0,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0,0],[0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,1,1,1],[1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1]];
T1:=Toric(Mat1);MB1:=MinGens(T1);
66 CHAPTER 4. RATER AGREEMENT MODELS
We obtain 162 moves of the form ξ121ξ211−ξ111ξ221. As we can see, the alge-
braic complexity of the problems increases fast with the number of observers.
For example the above Markov basis consists of 162 moves, while the Markov
basis for the corresponding problem with two observers consists of 9 moves.
If we consider the distribution of κh(d), the problem is implemented with
Mat2:= [[3,2,2,2,1,1,2,1,1,2,1,1,1,0,0,1,0,0,2,1,1,1,0,0,1,0,0],[0,1,0,1,2,1,0,1,0,1,2,1,2,3,2,1,2,1,0,1,0,1,2,1,0,1,0],[0,0,1,0,0,1,1,1,2,0,0,1,0,0,1,1,1,2,1,1,2,1,1,2,2,2,3]];
T2:=Toric(Mat2);MB2:=MinGens(T2);
and we obtain 44 moves: 17 moves of degree one, that is coming from binomials
of degree one, and 27 moves of degree two. From a statistical point of view, the
degree of a move represents the number of counts moved from a cell to another
cell, see the correspondence between binomials and moves given in Chapter 2.
For higher-dimensional problems, i.e. with a greater number of observers
and/or with a greater number of levels of the response variable, it is enough to
add appropriate rows and columns to the previous matrices of the conditioning
statistics.
Note that the matrix representation of the conditioning statistic allows us
to write easily the expression of the underlying statistical model. In fact, as
each column represents a raw probability and each row represents a component
of the conditioning statistic, it is easy to write the multiplicative form of the
model. For example in the previous model we have
πikj = yiyjyk (4.13)
with i, j, k = 1, . . . , I, giving π111 = y31 in the first column, π112 = y2
1y2 in the
second column, and so on.
Now, consider the log-linear models presented in Section 2. In the models
below we do not write explicitly the matrix representation of the conditioning
statistics, but we write directly the results. For the quasi-independence model
and the quasi-symmetry model for two-way contingency tables, the computa-
tion of the Markov bases has been presented in the previous chapter, so we
restrict our attention to higher-dimensional tables. In particular we compute,
as an example, the Markov bases for a 3× 3× 3 table.
4.5. COMPUTATIONS 67
For the quasi-independence model defined in Equation (4.8) with condi-
tioning statistic given in Equation (4.11), we find a Markov basis with 105
moves, all of degree two.
It is worth noting that in the quasi-independence model we have single cells
as components of the sufficient statistic. So, we can apply in the computation of
the Markov bases the following result, which is a generalization of Proposition
3.4, Chapter 3, holding for two-way tables.
Theorem 4.1 The Grobner basis of the toric ideal IT for the quasi-indep-
endence model is the Grobner basis for the corresponding incomplete table with
missing diagonal entries under the complete independence model.
Proof. The conditioning statistic for the quasi-independence model is
T = (fk(i), i = 1, . . . , I, k = 1, . . . , d, fi...i, i = 1, . . . , I)
Consider the equivalent conditioning statistic
T ∗ = (f ∗k (i), i = 1, . . . , I, k = 1, . . . , d, fi...i, i = 1, . . . , I)
where f ∗k (i) denotes the sum excluding the main diagonal cells. Using the
conditioning statistic T ∗, the set of binomials coming from the ring homomor-
phism ξu → tu = yT (u) is formed by two separated subsets of equations. The
first subset does not involve the indeterminates ξi...i and yi...i, and is formed by
the equations of the complete independence model for the table with missing
diagonal entries, i.e. the binomials are
ξi1...id −d∏
k=1
ykik
for all non-diagonal cells. The second set of equations is formed the binomials
ξi...i − yi...i for i = 1, . . . I.
Now, eliminating the y’s indeterminates, the Grobner basis coming from
the first subset of equations is the Grobner basis for the complete independence
model for the table with missing diagonal entries, while the binomials of the
second subset are all deleted. This proves the theorem.
Usually, because of a reduction of the indeterminates, Theorem 4.1 speadies
the symbolic computation, with a great saving of time, especially for high-
dimensional tables.
Moreover, we can also apply the following result, if we have the Markov
basis for the complete independence model.
68 CHAPTER 4. RATER AGREEMENT MODELS
Theorem 4.2 The Grobner basis of the toric ideal IT for the quasi-indep-
endence model is the Grobner basis for elimination ideal obtained by the toric
ideal of the complete independence model with the elimination of those inde-
terminates corresponding to the diagonal cells.
Proof. Let T be the conditioning statistic of the quasi-independence defined
in Equation(4.11) and let T ′ be the conditioning statistic of the complete
independence model for the same table. As the diagonal cells are components
of the conditioning statistic T , the binomials ξi...i − yi...i, i = 1, . . . , I, are
generators of the ideal JT . The ideal JT is then formed by the set of binomials
corresponding to the complete independence model
ξi1...id −d∏
k=1
ykik (4.14)
for all cells, and by the set of binomials ξi...i − yi...i, i = 1, . . . , I. The bi-
nomials in Equation (4.14) are the generators of the ideal JT ′ of the com-
plete independence model and the corresponding toric ideal is computed as
IT ′ = Elim(y,JT ′).
Following the theory in Chapter 2 of Kreuzer & Robbiano (2000), by the
linearity of the last binomials, we can replace ξi...i with yi...i in the binomials
(4.14).
Eliminating the y’s indeterminates from JT , the binomials of the form
ξi...i − yi...i are all deleted and for the first set we obtain Elim(y,JT ′). As we
can replace ξi...i with yi...i, the following chain of equalities holds:
IT = Elim(y,JT ) = Elim((y, ξi...i),JT ) =
= Elim((y, ξi...i),JT ′) = Elim(ξi...i, IT ′) .
The proof is now complete.
We also consider the same problem under the simplified quasi-independence
model, as defined in Equation (4.6) and we obtain a Markov basis formed by
164 moves: the 105 moves of the quasi-independence model above, together
with 38 moves of degree 3 and 21 moves of degree 4.
For the non-homogeneous model defined in Equation (4.9) a slight more
complex situation happens. In fact, the Markov basis is formed by 3436 moves:
52 moves of degree 3; 774 moves of degree 4; 1064 moves of degree 5; 1503 moves
of degree 6; 18 moves of degree 7; 27 moves of degree 8.
4.5. COMPUTATIONS 69
Under the quasi-symmetry model defined in Equation (4.10) with the con-
ditioning statistic specified in Equation (4.12), we have a Markov basis formed
by 219 moves: 13 moves of degree 3; 90 moves of degree 4; 54 moves of degree
5; 54 moves of degree 6; 18 moves of degree 8.
To illustrate the computational growth when moving from two- to three-
dimensional tables, we give in Table 4.3 the number of moves for the 3-observer
problems and the corresponding moves for the 2-observer problems of the pre-
vious models.
model 2 observers 3 observersκ 9 162
κ with marg. hom. 9 44q.i. 1 105
simplified q.i. 9 164non-hom. q.i. 1 3436
q.s. 1 219
Table 4.3: Number of moves for a 3-level response and 2 or 3 observers.
In addition, it is interesting to show the importance of considering minimal
sets of generators rather than Grobner bases. This is clearly shown in Table 4.4,
where we compare the number of moves coming from the Grobner bases and
the number of moves coming from the minimal sets of generators. Moreover,
as the complexity of a move is represented by its degree, we also compare the
maximum degree of the moves.
model GB Max Degree MSG Max Degreeκ 168 3 162 2
κ with marg. hom. 44 2 44 2q.i. 121 4 105 2
simplified q.i. 210 7 164 4non-hom. q.i. 5995 17 3436 8
q.s. 421 13 219 8
Table 4.4: Number of moves and maximum degree from Grobner bases (GB)and from minimal sets of generators (MSG).
Finally, notice that non-trivial moves are involved. For example, the move
of degree 13 coming from the Grobner basis for the quasi-symmetry model is
70 CHAPTER 4. RATER AGREEMENT MODELS
represented by the polynomial
−ξ133ξ2212ξ
2221ξ
3322ξ
2323ξ
3331 + ξ122ξ213ξ
2222ξ231ξ313ξ
4321ξ
2332ξ333
and it corresponds to the move displayed in Table 4.5. This move is not needed
if we look at the minimal set of generators instead of the Grobner basis.
ji 1 2 31 0 0 02 0 -2 +13 0 +4 -3
ji 1 2 31 0 +1 02 -2 +2 03 0 -3 +2
ji 1 2 31 0 0 -12 +1 0 03 +1 -2 +1
k = 1 k = 2 k = 3
Table 4.5: Move of degree 13 from the Grobner basis for the quasi-symmetrymodel.
4.6 Example
In this section we apply the methodology described in the paper to a data set
taken from Dillon & Mulani (1984) and shown in Table 6. This data set gives
the ratings of 164 subjects by three observers. The rating scale has 3 levels:
Positive(+), Neutral (N) and Negative (−).
A2
A1 + N −+ 56 12 1N 1 2 1− 0 1 0
A2
A1 + N −+ 5 14 2N 3 20 1− 0 4 7
jA1 + N −+ 0 0 1N 0 4 1− 1 2 24
A3 = + A3 =N A3 = −Table 4.6: Data set from Dillon & Mulani (1984).
For all models, we compute the non-asymptotic p−values of the tests run-
ning the Markov chain based on the appropriate Markov basis. Any Markov
chain samples 50,000 tables with 50,000 burn-in steps and sampling every 50
tables, i.e., with the same Monte Carlo parameters as in the example of Chap-
ter 3.
We consider two tests, a first test without the constraint of marginal ho-
mogeneity with null hypothesis H0 : κ(3) = 0 against the one-sided alternative
4.6. EXAMPLE 71
model df asymptotic MCMCκ <0.0001 0
κ with marg. hom. <0.0001 0q.i. 17 0.0031 0.0105
simplified q.i. 19 0.0001 0.0001non-hom. q.i. 17 0.0016 0.0081
q.s. 11 0.1296 0.1624
Table 4.7: Comparison of the results.
hypothesis H1 : κ(3) > 0, and a second test with the constraint of marginal ho-
mogeneity with null hypothesis H0 : κh(3) = 0 against the one-sided alternative
hypothesis H1 : κh(3) > 0. In the case of log-linear models we use the Pearson’s
chi-squared C statistic for testing the goodness of fit of the 4 models presented
in Section 2.
Few lines of statements are sufficient to produce the output in ten seconds
approximately of CPU time usage on a standard PC, except for the non-
homogeneous quasi-independence model, which requires some additional time
for the symbolic computation.
In Table 4.7 we compare the non-asymptotic MCMC p−values with the
classical ones based on asymptotically normal or chi-square distributions.
From Table 4.7, we remark that the differences between the two approaches
are small in those cases where there is strong evidence against the null hypoth-
esis, while there are notable differences if the significance of the test is less evi-
dent. We conclude that there are significant differences between the asymptotic
p−values and the MCMC ones, especially when the exact p−values are near to
the usual α-levels of the tests and the sample size is small. These differences
can also lead to different conclusions on the rejection of the null hypothesis.
72 CHAPTER 4. RATER AGREEMENT MODELS
Chapter 5
Sufficiency and estimation
The problems of sufficiency and estimation of log-linear models are an impor-
tant topics in the analysis of discrete data and, specially, in the analysis of
contingency tables. In Chapter 1 we have presented an introduction to log-
linear models, with special attention to their structure. In particular, literature
suggests the maximum likelihood (ML) approach, as introduced in Chapter 1.
In the past few years, the application of new algebraic non-linear tech-
niques to Probability and Statistics have been presented. Here we follow the
polynomial representation of random variables on discrete sample spaces as
introduced in Pistone et al. (2001a) and we study some algebraic and geo-
metrical property of a class of models introduced as toric models in Pistone
et al. (2001b), showing the links between toric models and log-linear models.
The polynomial Algebra is used to describe the geometrical structure of the
statistical toric models on finite sample spaces. The first works in this direc-
tion, but limited to the analysis of graphical models, are Geiger et al. (2002)
and Garcia et al. (2003). Moreover, the work of Diaconis & Sturmfels (1998)
presented in Chapter 2 leads to the possibility of sampling from conditional
distributions and we will use such technique in order to actually compute the
minimum-variance unbiased estimator of the cell counts.
In this chapter we consider a general finite sample space and we use alge-
braic techniques in order to obtain a description of the notion of sufficiency and
to define an unbiased minimum-variance estimator. This estimator is defined
for all observed tables, even in the non-negative case, i.e. it avoids problems
with structural zeros. Moreover, we show that it is well-defined also for sparse
tables, i.e. it avoids problems with sampling zeros. Starting from this the-
ory, we also define an algorithm to actually compute the estimate, using the
technique described in Chapter 2.
73
74 CHAPTER 5. SUFFICIENCY AND ESTIMATION
Working in the non-negative case, in Section 2 we introduce the class of
toric models, and we give a polynomial representation of the notion of sufficient
statistic, first for a single observation and then for a independent sample,
generalizing some known results. In Section 3 we study the geometry of toric
models and we show that toric models are disjoint union of exponential models.
Moreover, we show that the parameterization play a fundamental role, and we
define a special parameterization such that the model contains all the possible
exponential models. In Section 4 we define an unbiased minimum-variance
estimator of the cell counts, we show its uniqueness and its relationship with
the ML estimate. Finally, in Section 5 we define a MCMC algorithm for the
computation of the minimum-variance estimate.
5.1 Introduction
Consider a statistical model on a finite sample space X . Although in contin-
gency tables the sample space is usually a Cartesian product (for example in
the two-way case is a space of the form {(i, j) | i = 1, . . . I, j = 1, . . . , J}), we
assume here, without loss of generality, a generic list
X = {a(1), . . . , a(k)} .
A probability distribution on X is characterized by the parameters pi = P[a(i)]
with i = 1, . . . , k. Roughly speaking, a model M is a subset of the space of
parameters. As in the previous chapters, we use the vector notation, that is p
is the vector (p1, . . . , pk). The parameter space is given by the simplex pi ≥ 0,
i = 1, . . . , k and∑k
i=1 pi = 1.
If (x1, . . . , xN) is a sample of size N drawn from a vector (X1, . . . , XN) of
independent and identically P-distributed random variables with values in X ,
the ML approach considers the likelihood function L(p; x1, . . . , xN) and the
maximum likelihood estimate p of p is given by
L(p; x1, . . . , xN) = supp∈M
L(p; x1, . . . , xN) .
The ML estimate may not exist or it may be not unique for some observed
tables. The problems of non-existence or non-uniqueness are related to the
presence of sampling zeros or structural zeros.
In the case where we restrict to the case of positive parameters, i.e. pi > 0
for all i = 1, . . . , k, and the conditions for the existence and uniqueness of
5.2. SUFFICIENCY AND SAMPLING 75
the ML estimate are satisfied, the ML estimate is found by differentiating the
likelihood function.
In the log-linear models for contingency tables, the likelihood equations
can be classified in two types:
(a) the equation system has an explicit solution p and no approximation is
needed. This is the case, for example, of the independence model;
(b) the equation system has not an explicit solution and we must use a
numerical method in order to approximate the solution. This is the case,
for example, of the quasi-independence model.
For a discussion on this topic, see Section 3.7 of Kreuzer & Robbiano (2000),
where some symbolic methods for solving system of polynomial equations are
presented.
As outlined in Chapter 1, in the case (b) the most widely used meth-
ods for approximating the ML estimate are the Newton–Raphson one and
the Iterative Proportional Fitting one. For details about these methods, see
Haberman (1974), Chapter 3. Many optimized versions of such algorithms are
implemented in various software packages. For example, a modified Newton–
Raphson method is used in PROC CATMOD of SAS software, see SAS/STAT
User’s Guide (2000) for a concise computation’s discussion and for further ref-
erences on numerics. Details on the applications of the Newton–Raphson al-
gorithm and the Fisher scoring are given, for example, in McCullagh & Nelder
(1989), Chapter 2. Again, the main problem in the application of numerical
methods is the presence in the contingency tables of sampling or structural
zeros. Then, the algorithms need slight modifications of the observed counts,
in order to eliminate the null entries, see for example the recent algorithms in
Duffy (2002).
When ML estimate exists, it is in general a biased estimate of the cell
counts, and in some applications one may ask for an unbiased estimate. The
minimum-variance unbiased estimator can be used in such situation and also
when the ML estimate does not exist or it is not unique.
5.2 Sufficiency and sampling
Let us consider a statistical model on the finite sample space X of the form
px = Pφ[x] = φ(T (x)), x ∈ X (5.1)
76 CHAPTER 5. SUFFICIENCY AND ESTIMATION
where T : X −→ Ns is the vector of integer valued components of the sufficient
statistic. Here φ ∈ Φ, where Φ is a subset of functions from Ns to R. In point
of fact, the range of T is a finite subset T ⊂ Ns. Thus, the set of functions
from T to R has a vector space structure with a finite basis. This remark will
be useful in the next sections.
The set Φ defines a subset M of the space of the probabilities with sufficient
statistic T , and we refer to this subset as to a statistical model. In other words,
M is a subset defined through
M = {p : p = φT, φ ∈ Φ} .
We specialize Equation (5.1), by assuming that there exists a parameteri-
zation with parameters ζ1, . . . , ζs such that the probabilities assume the form
px =L(ζ, x)∑
x∈X L(ζ, x), (5.2)
with L(ζ, x) = ζT (x). Here T (x) = (T1(x), . . . , Ts(x)) and∑
x∈X L(ζ, x) is the
normalizing constant. As in the previous chapters, we use the vector notation,
e.g. ζT (x) means ζT1(x)1 · · · ζTs(x)
s . The algebraic structure of the functions φ and
T in Equation (5.1) will be discussed later in Section 5.4. Following Pistone
et al. (2001a) and Pistone et al. (2001b), we can state this definition.
Definition 5.1 If a statistical model M consists of all likelihoods of the form
in Equation (5.2), then we say that the model is a toric model.
The term “toric” comes from Commutative Algebra, because of the alge-
braic structure of the probabilities. As we have seen in Chapter 1, toric ideals
describe the algebraic relations among power product and in toric models the
probabilities are expressed in terms of power products. See also Sturmfels
(1996) and Bigatti & Robbiano (2001), where toric ideals are studied in de-
tails.
In Definition 5.1, the ζ parameters are unrestricted, except the non-negativ-
ity constraint. Note that in general a toric model is bigger than the exponential
model, as we do not assume positive probabilities, and with the constraint
px > 0 for all x ∈ X it is a log-linear model. Let M>0 be the subset of the
toric model M with the restriction px > 0 for all x ∈ X . Then
log px =∑
j
(log ζj)Tj(x) + log ζ0 , (5.3)
5.2. SUFFICIENCY AND SAMPLING 77
where
ζ0 =
(∑x∈X
L(ζ, x)
)−1
(5.4)
is the normalizing constant. M>0 appears to be a log-linear, then it is an
exponential model with sufficient statistic T and canonical parameters log ζj,
j = 1, . . . , s.
Note that the representation of the toric model in Equation (5.2) refers
to the notion of multiplicative form of a log-linear model, see for example
Goodman (1979a).
A first example of toric model is a model with sufficient statistic consisting
of the counts over the sets A1, . . . , As ⊆ X , possibly overlapping:
T : x 7−→ (IA1(x), . . . , IAs(x))
In most examples in the literature
X ⊆ {1, . . . , I1} × · · · × {1, . . . , Id}
is a d-way array, possibly incomplete.
A second example is a log-linear model of the type
logPψ(x) =s∑
i=1
ψiTi(x)− k(ψ) (5.5)
with integer valued Ti’s and parameters ψ = (ψ1, . . . , ψs). In such a case each
Ti(x) =∑a∈X
Ti(a)Ia(x) (5.6)
is an integer combination of all counts, see Fienberg (1980).
Note that in Equation (5.1) the strict positivity implied by the log-linear
model in Equation (5.5) is not assumed.
Consider now the problem of sampling. As the probabilities can be written
as
px = ζ0ζT1(x)1 · · · ζTs(x)
s (5.7)
the probability of a sample (x1, . . . , xN) is
px1 · · · pxN= ζN
0 ζP
i T1(xi)1 · · · ζ
Pi Ts(xi)
s (5.8)
i.e., the sufficient statistic for the sample of size N is the sum of the sufficient
statistics of the N components of the sample. Note that this result is formally
the same as in the positive case where the theory of exponential models applies.
78 CHAPTER 5. SUFFICIENCY AND ESTIMATION
The j-th component of the sufficient statistic can be represented as
∑i
Tj(xi) =∑a∈X
Tj(a)Fa(x1, . . . , xN) (5.9)
where
Fa(x1, . . . , xN) =N∑
j=1
Ia(xj) (5.10)
is the count of the cell a. This result generalizes the corresponding theorem
holding in the framework of log-linear models, which states that the sufficient
statistics for all log-linear models are linear combinations of the cell counts.
5.3 Geometry of toric models
Now, define the matrix AT in the following way. AT is a matrix with k rows
and s columns and its generic element AT (i, j) is Tj(xi) for all i = 1, . . . k and
j = 1, . . . , s.
If ηj = E[Tj] are the mean-parameters and p is the row vector of the
probabilities, then
η = pAT (5.11)
The matrix AT can also be used in order to describe the geometric structure
of the statistical model. First, we can state the following result.
Proposition 5.2 Choose a parameter ζj and take the set X ′ of points x ∈ Xsuch that Tj(x) > 0. Suppose that X ′ 6= X . If we set ζj = 0, we obtain a model
on the remaining sample points and this model is again a toric model.
Proof. Without loss of generality, suppose j = 1 and T1(x) = 0 for
i = 1, . . . , k′ and T1(x) > 0 for i = k′ + 1, . . . , k. The matrix AT can be
partitioned as
AT =
0... A′
T
0∗...∗
(5.12)
where ∗ denotes non-zero entries. Now, the matrix A′T is non-zero and it is
the representation of a toric model on the first k′ sample points.
5.3. GEOMETRY OF TORIC MODELS 79
Thus, the geometric structure of the toric model is an exponential model
and at most s toric models with (s− 1) parameters on appropriate subsets of
the sample space X . Moreover, by applying recursively the above theorem we
obtain the following result.
Theorem 5.3 Let ζj1 , . . . , ζjr be a set of parameters and take the set X ′ of
points x ∈ X such that Tib(x) > 0 for at least one b ∈ {1, . . . , r} and Tib(x) = 0
for all b = 1, . . . , r and x ∈ X − X ′. Suppose that X ′ 6= X . If we set ζjb= 0
for all b = 1, . . . , r, we obtain a toric model on the remaining sample points.
Proof. Apply b times Proposition 5.2.
These results lead to a geometrical characterization of a toric model.
Theorem 5.4 A toric model is the disjoint union of exponential models.
Proof. The result follows from the application of Theorem 5.3 for all
possible sets of parameters ζj1 , . . . , ζjbfor which X ′ 6= X .
Example 5.5 Consider the independence model for 3× 2 tables analyzed in
Example 3.3 of Chapter 3. The matrix representation of the sufficient statistic
is
AT =
1 0 0 1 01 0 0 0 10 1 0 1 00 1 0 0 10 0 1 1 00 0 1 0 1
and the toric model is, apart from the normalizing constant,
p11 = ζ1ζ4
p12 = ζ1ζ5
p21 = ζ2ζ4
p22 = ζ2ζ5
p31 = ζ3ζ4
p32 = ζ3ζ5
If ζ > 0 we have the log-linear model. Moreover, we can choose the following
sets X ′ in Theorem 5.3:
• a) X ′ = {(1, 1), (1, 2)} corresponding to ζ1 = 0;
80 CHAPTER 5. SUFFICIENCY AND ESTIMATION
b) X ′ = {(2, 1), (2, 2)} corresponding to ζ2 = 0;
c) X ′ = {(3, 1), (3, 2)} corresponding to ζ3 = 0.
In this cases we obtain the independence model for the 2×2 tables X\X ′;
• d) X ′ = {(1, 1), (2, 1), (3, 1)} corresponding to ζ4 = 0;
e) X ′ = {(1, 2), (2, 2), (3, 2)} corresponding to ζ5 = 0.
In this case we obtain the multinomial model for the second and the first
column, respectively;
• f) X ′ = {(1, 1), (1, 2), (2, 1), (2, 2))} corresponding to ζ1 = ζ2 = 0;
g) X ′ = {(1, 1), (1, 2), (3, 1), (3, 2))} corresponding to ζ1 = ζ3 = 0;
h) X ′ = {(2, 1), (2, 2), (3, 1), (3, 2))} corresponding to ζ2 = ζ3 = 0.
In this case we obtain the Bernoulli model for the third, the second and
the first row, respectively;
• i) X ′ = {(1, 1), (1, 2), (2, 1), (3, 1)} corresponding to ζ1 = ζ4 = 0;
j) X ′ = {(1, 1), (1, 2), (2, 2), (3, 2)} corresponding to ζ1 = ζ5 = 0;
k) X ′ = {(2, 1), (2, 2), (1, 1), (3, 1)} corresponding to ζ2 = ζ4 = 0;
l) X ′ = {(2, 1), (2, 2), (1, 2), (3, 2)} corresponding to ζ2 = ζ5 = 0;
m) X ′ = {(3, 1), (3, 2), (1, 1), (2, 1)} corresponding to ζ3 = ζ4 = 0;
n) X ′ = {(3, 1), (3, 2), (1, 2), (2, 2)} corresponding to ζ3 = ζ5 = 0.
In this case we obtain 6 Bernoulli models. These models are the 6 models
on the columns of the 3 independence models found above;
• corresponding to the six conditions ζ1 = ζ2 = ζ4 = 0, ζ1 = ζ2 = ζ5 = 0,
ζ1 = ζ3 = ζ4 = 0, ζ1 = ζ3 = ζ5 = 0, ζ2 = ζ3 = ζ4 = 0 and ζ2 = ζ3 = ζ5 = 0
we obtain 6 trivial distributions on one sample point.
The toric model is then formed by 21 models: the log-linear model on 6 points,
3 models on 4 points, 2 models on 3 points, 9 models on 2 points and 6 trivial
models on 1 point. Apart from the trivial models, the remaining 15 models
are represented in Table 5.5.
In Table 5.5 the 0 means structural zero, while ∗ denotes the cells with
non-zero probability.
The procedure in Theorem 5.3, applied as in Example 5.5 can also be used
to define the admissible sets of structural zeros.
5.3. GEOMETRY OF TORIC MODELS 81
∗ ∗∗ ∗∗ ∗
∗ ∗∗ ∗0 0
∗ ∗0 0∗ ∗
0 0∗ ∗∗ ∗
0 ∗0 ∗0 ∗
∗ 0∗ 0∗ 0
0 00 0∗ ∗
0 0∗ ∗0 0
∗ ∗0 00 0
0 00 ∗0 ∗
0 0∗ 0∗ 0
0 ∗0 00 ∗
∗ 00 0∗ 0
0 ∗0 ∗0 0
∗ 0∗ 00 0
Table 5.1: Nontrivial models.
Definition 5.6 A subset X ′ ⊂ X is an admissible set of structural zeros if
there exist parameters ζi1 , . . . , ζir such that the condition of Theorem 5.3 holds.
Now, we consider the behavior of the η-parameterization in the different
exponential models. Starting from the representation of the η parameters as
function of the ζ parameters in Equation (5.11), we can easily prove that the
η parameterization is coherent, as stated in the following proposition.
Proposition 5.7 The η parameters for the reduced models are the same as in
the exponential case, provided that a component is fixed to zero.
Proof. For the proof, it is enough to combine the linear relation between η
and ζ given in Equation (5.11) and Theorem 5.3.
Remark 5.8 The definition of toric model as disjoint union of exponential
models has a counterpart in differential geometry, namely in the definition of
the differential manifolds with corners. For the construction of a differential
manifold with corners, consider the positive quadrant in Rk. It is an open
variety and its boundary is formed by k quadrants of dimension k − 1. Add
one or more of these quadrants to the variety and iterate k times the procedure.
In general, such derivatives differ from the limit of the derivatives computed
in Rk. Every real variety diffeomorphic to a variety defined as above is a
differential manifold with corner. For details about differential manifolds with
corners, see Margalef Roig & Outerelo Dominguez (1992)
82 CHAPTER 5. SUFFICIENCY AND ESTIMATION
Now, we state a differential relationship among the ζ parameters and the
η parameters.
Theorem 5.9 In the positive case the following equation holds
ζj∂ζ0
∂ζj
= −ηjζ20 (5.13)
for all j = 1, . . . , s
Proof. By straightforward computation, we have
∂ζ0
∂ζj
= −∑
x∈X Tj(x)ζT (x)/ζj(∑x∈X ζT (x)
)2 = −ζ20ηj
ζj
,
and the proof is complete.
Note that the previous Equation holds in all exponential models contained
in the toric model.
The geometric representation of the structural zeros as presented in the
previous definition needs some further discussions. Let we consider a simple
example.
Example 5.10 Consider the simple example of the independence model for
2× 2 contingency tables. Using the notations of the previous chapters, a first
parameterization is
p11 = ζ0
p12 = ζ0ζ1
p21 = ζ0ζ2
p22 = ζ0ζ1ζ2
(5.14)
This parameterization is used in Pistone et al. (2001a). A second parameteri-
zation is the following one, based on the matrix representation of the sufficient
statistic as explained in Chapter 3.
p11 = ζ1ζ3
p12 = ζ1ζ4
p21 = ζ2ζ3
p22 = ζ3ζ4
(5.15)
Eliminating the ζ indeterminates in the above systems of equations, in both
cases we obtain only one polynomial, namely the binomial
p11p22 − p12p21 = 0 (5.16)
5.3. GEOMETRY OF TORIC MODELS 83
But, the toric models in Equation (5.14) and in Equation (5.15) are different.
In fact, the second parameterization contains for example the Bernoulli model
on the two points (1, 1) and (1, 2), while the first parameterization does not.
Now, consider the binomial equations obtained from a toric model by elim-
ination of the ζ indeterminates as described in Chapter 1. In this way we
obtain a set of binomials which defines a statistical model, but in general the
binomial form differs from the parametric form. The binomial form is inde-
pendent on the parameterizations and it can allow some boundaries excluded
from the parametric form, as in the previous Example. In the following, we
will use the terms “parametric toric model” and “binomial toric model”.
Moreover, we can also restate the definition of set of structural zeros for a
toric model in binomial form.
Definition 5.11 Let B be the set of binomials defined by elimination as de-
scribed in Chapter 2. A subset X ′ ⊂ X is an admissible set of structural zeros
independent from the parameterization if the polynomial system B = 0 together
with px = 0 for all x ∈ X ′ has a non-negative normalized solution.
In general, it is difficult to find a parameterization such that the parametric
form of the toric model contains all the exponential sub-models.
Definition 5.12 If a parameterization defines a parametric toric model which
contains all the exponential sub-models, we say that the parameterization is a
full parameterization.
In view of Example 5.10 and Definition 5.12, it follows that not all matrix
representations of the sufficient statistic are equivalent. In general there is an
infinite number of non-negative integer valued bases of the sub-space spanned
by T , but not all of these contains all the exponential sub-models. This hap-
pens because the columns of the matrix AT are defined in the vector space
framework, but in the power product representation we need non-negative
exponents and then we need linear combinations with non-negative integer co-
efficients. In general, it is difficult to find a full parameterization, but it is easy
to characterize all the parameterizations which lead to a given binomial toric
model.
84 CHAPTER 5. SUFFICIENCY AND ESTIMATION
Theorem 5.13 Let AT be the matrix model. If v1, . . . , vs is a non-negative
integer system of generator of the rank of AT , then
px = ζv1(x)1 · · · ζvs(x) (5.17)
for x ∈ X is a parameterization and all parameterizations of this kind lead to
the same binomial toric model.
Proof. Let Av be the matrix
Av = (v1, . . . , vs)
As the kernels of the systems As = 0 and AT = 0 are the same. As a conse-
quence of the construction of the toric ideals presented in Section 2 of Chapter
1, the toric models defined through AT and As are represented by the same
binomials.
Among others, one can consider the rank of AT as a lattice and then con-
sider a Hilbert basis of such lattice.
Definition 5.14 Let S ⊆ Nk be the set of integer solutions of a diophantine
system AT p = 0. A set of integer vectors H = {v1, . . . , vh} is a Hilbert basis
of S if for all β ∈ S
β =∑v∈H
cvv (5.18)
where cv ∈ N.
It is known that such a set H exists and is unique. The number of elements
in H in general differs from the dimension of the rank of AT as vector sub-
space. For details about the properties of the Hilbert basis and the algorithms
for its computation, see Sturmfels (1993) and Kreuzer & Robbiano (2003).
The application of theory of Hilbert bases to the geometrical description of
the toric models is currently in progress.
5.4 Algebraic representation of the sufficient
statistics
Let us now introduce the algebraic representation of the sufficient statistic for
toric models. Let T = T (X ) be the image of the sufficient statistic T , i.e. the
set of all values of the sufficient statistic. As noted in Section 5.2, the set T is
a finite subset of Ns.
5.5. MINIMUM-VARIANCE ESTIMATOR 85
Proposition 5.15 Let (T1(X ), . . . , Ts(X )) = T ⊆ Ns be the range of the suf-
ficient statistic. As T is a finite subset of Qs, then there exists a list of mono-
mials
{tα : α ∈ L} (5.19)
such that all function φ : T −→ T are expressed as
φ(t) =∑α∈L
φαtα (5.20)
For the proof, see e.g. Pistone et al. (2001a). Implemented algorithms can
be used to derive such list.
The model M has the polynomial representation
px = φ(T1(x), . . . , Ts(x)) =∑α∈L
φαTα(x) . (5.21)
The functions Tα, with α ∈ L are a sufficient statistic equivalent to T1, . . . , Ts.
If the φα are unrestricted except for the normalizing constraint, this means that
the model contains all the probabilities with sufficient statistic T . The advan-
tage of the former is the linearity of Equation (5.21). This remark prompts
the following definition.
Definition 5.16 The set T1, . . . , Ts′ is a basis for the statistical model M if
P∗N(x1, . . . , xN) = φN
(∑i
Tj(xi), j = 1, . . . , s
)(5.22)
for all N ≥ 2.
The Tα in Equation (5.21) are a basis for all models with sufficient statistic
T .
5.5 Minimum-variance estimator
In this section, we define an unbiased minimum-variance estimator of the cell
counts. This estimator is defined as the mean of the hypergeometric distribu-
tion, whose parameters depend on the observed value of the sufficient statistic
TN . Moreover we show that, under some extra conditions, this estimate is
equal to the ML estimate of the vector of parameters Np.
In the classical log-linear theory, where the attention is restricted to the
positive case, the most commonly used theorem of existence and uniqueness
86 CHAPTER 5. SUFFICIENCY AND ESTIMATION
of the ML estimate is based on the vector space representation of log-linear
models, see Chapter 1 for an introduction and Haberman (1974), Theorem 2.2,
for a detailed proof.
From this point of view, an unbiased minimum-variance estimator has two
advantages. This estimator is well defined also in such cases where the ML
estimator does not exists or its computation is difficult. Moreover, in some
applications, it could be desirable to use an unbiased estimator instead of a
biased one.
The difficulty of the computation of the minimum-variance estimate will
be avoided with an algebraic based algorithm presented in the next section.
From Definition 5.16 it follows that the basis is the algebraic counterpart
of the complete sufficient statistic.
Definition 5.17 Define
Fx = Eφ[Fx(X1, . . . , XN) | T αN , α ∈ L] (5.23)
Fx is the minimum-variance unbiased estimator of the expected cell counts
Npx.
In fact, Eφ[Fx] = Eφ[Fx] = Npx and Fx is minimum-variance as it is func-
tion of the sufficient statistic. Moreover, if TαN is a basis, see Definition 5.16,
then the estimator Fx is a linear combination of the TαN .
Theorem 5.18 If T1, . . . , Ts are a vector space basis of the space of the suffi-
cient statistic, the estimator Fx is the unique unbiased estimator of the count
Npx which is function of T1, . . . , Ts.
Proof. Following the technique used in Lehmann (1983), let Fx be another
estimator of Npx function of T1, . . . , Ts. We have
Fx =∑
α
cαT α
and
Fx =∑
α
dαTα
Thus, we can write
0 = Eφ[Fx − Fx] =∑
α
(cα − dα)Eφ[Tα]
5.6. ALGEBRAIC METHOD 87
for all Eφ[Tα]. As the φα are unrestricted, the previous Equation leads to
0 = Eφ[Fx − Fx] =∑
α
φαE0[(Fx − Fx)Tα] ,
where E0 is the mean with respect to a dominant measure. As E0[(Fx− Fx)Tα]
must be all zero, we conclude that Fx = Fx.
Remark 5.19 Note that the result above can be easily proved if we restrict to
the model M>0, as in the exponential model the sufficient statistic is complete
and the result in Theorem 5.18 is a consequence of the completeness.
The estimator Fx exists even when ML estimate is not well defined. The
links between this estimator and the ML estimator is given by the following
result.
Proposition 5.20 If ML estimate exists and the vector F /N belongs to the
model, then the ML estimate Np is equal to F .
Proof. A classical result from Birch (1963), generalized by Bishop et al.
(1975), states that there is a unique configuration of the cell probabilities which
lies in the model and verifies the constraints of the sufficient statistic, and this
configuration is the ML estimate of the cell probabilities.
5.6 Algebraic method
From the above theory, together with the use of the algorithm presented in
Chapter 2, we can derive a Monte Carlo method for the computation of the
estimates fx for all x ∈ X . As in the previous Chapters, let Yt be the set of all
samples with fixed value t of the sufficient statistic TN and Ft be the set of all
frequency tables obtained from samples with value t of the sufficient statistic
TN . As P∗Nφ = φN(TN), then the distribution of the sample given {TN = t} is
uniform on Yt. The distribution of F = (Fx)x∈X is the image of the uniform
distribution, i.e. it is the hypergeometric distribution on Ft, see Chapter 2 for
details.
Given t, we define a Markov chain with sample space Ft and with stationary
distribution Ht. The Markov chain is a random walk on Ft with moves in an
88 CHAPTER 5. SUFFICIENCY AND ESTIMATION
appropriate Markov basis M = {m1, . . . , mL}, computed starting from the
matrix representation of the sufficient statistic T . Thus, as we are interested
in the mean value of the cell counts, the Markov chain is constructed following
a Metropolis-Hastings-type algorithm and it is performed as follows:
(a) let S = 0;
(b) at the time 0 the chain is in f ;
(c) choose a move m uniformly in the Markov basis and ε = ±1 with prob-
ability 1/2 each independently of m;
(d) if f + εm ≥ 0 then move the chain from f to f + εm with probability
min{Ht(f + εm)/Ht(f), 1} . (5.24)
In all other cases, stay at f ;
(e) let S = S + f ;
(f) repeat B times the steps (c)–(e);
(g) the (approximated) mean is S/B.
The convergence of this Markov chain has been proved in Chapter 2. As
in many cases we are interested in the computation of the estimates in the
case of two-way of multi-way contingency tables, we can use the results for the
computation of the Markov bases presented in Chapters 3 and 4.
Appendix A
Programs
In this Appendix we show the programs, both symbolic and numerics, needed
for the implementation of the MCMC algorithm for the goodness of fit tests
for log-linear models presented in Chapters 3 and 4. We will use the data and
the quasi-symmetry model for 4×4 tables in Example 3.8, Chapter 3, as main
example.
The first step of the algorithm needs the computation of the relevant
Markov basis for the model. In order to do that, we use CoCoA. We set
an appropriate polynomial ring, we write the matrix representation of the
sufficient statistic and we use the CoCoA function Toric.
We use a polynomial ring with an indeterminate for each sample point, i.e.,
in our example, with 16 indeterminates. Thus, we write:
Use A::=Q[x[1..16]];T:=Mat[[1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0],[0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1],[1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0],[0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0],[0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0],[0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1],[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1],[0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0],[0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0],[0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0],[0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0],[0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0],[0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0]];
89
90 APPENDIX A. PROGRAMS
I:=Toric(T);G:=GBasis(I);G;
The first 4 rows of the matrix represent the row sums, the second 4 rows
represent the column sums and the remaining rows represent the sum of the
diagonal-opposite cells.
Two remarks are useful for the implementation of this part of the algorithm.
First, we use line vectors for the sample points, i.e. we use the indeterminates
ξ1, . . . , ξ16, instead of the two-way representation, i.e. ξ1,1, . . . , ξ4,4. This choice
does not present any problems in the multi-way case, as shown by examples
in Chapter 4. Second, if we want to work with a minimal set of generators
instead of a Grobner basis of the ideal, it is sufficient to add the command
replace the command G:=GBasis(I) with the command G:=MinGens(I).
Now, we have a Grobner basis of the relevant toric ideal. In our example,
the CoCoA output is
[x[2]x[8]x[13] - x[4]x[5]x[14], x[2]x[7]x[9] - x[3]x[5]x[10],-x[7]x[12]x[14] + x[8]x[10]x[15], x[3]x[12]x[13] - x[4]x[9]x[15],x[2]x[7]x[12]x[13] - x[4]x[5]x[10]x[15], x[3]x[8]x[10]x[13] -x[4]x[7]x[9]x[14], -x[3]x[5]x[12]x[14] + x[2]x[8]x[9]x[15]]
Now, we use these binomials for the definition of the Markov basis. We
transform each binomial in a vector, as described in Chapter 2. We can use
the following simple CoCoA function:
Define GensToMat(G);-- input: a list of binomials;-- output: a matrix containing the vector representation-- of the binomials;
L:=Len(G);M:=NewMat(L,NumIndets(),0);For I:=1 To L Do
ListMon:=Monomials(G[I]);L1:=Log(ListMon[1]);L2:=Log(ListMon[2]);M[I]:=L1-L2;
End;Return M;
EndDefine;
Applying this function to the Grobner basis computed above, we obtain
the following matrix
91
Mat[[0, 1, 0, -1, -1, 0, 0, 1, 0, 0, 0, 0, 1, -1, 0, 0],[0, 1, -1, 0, -1, 0, 1, 0, 1, -1, 0, 0, 0, 0, 0, 0],[0, 0, 0, 0, 0, 0, 1, -1, 0, -1, 0, 1, 0, 1, -1, 0],[0, 0, 1, -1, 0, 0, 0, 0, -1, 0, 0, 1, 1, 0, -1, 0],[0, 1, 0, -1, -1, 0, 1, 0, 0, -1, 0, 1, 1, 0, -1, 0],[0, 0, 1, -1, 0, 0, -1, 1, -1, 1, 0, 0, 1, -1, 0, 0],[0, -1, 1, 0, 1, 0, 0, -1, -1, 0, 0, 1, 0, 1, -1, 0]]
Here each row represents a move. This matrix is a concise representation
of the 7 moves found in Chapter 3, Section 3.2.4.
Now, we can use this matrix for the second step of the algorithm, i.e. the
simulation step. For this step we use the software Matlab. After using Matlab,
we need an easy ASCII manipulation in order to have a suitable input format
for Matlab.
function simul=simul(tab,mle,m,N,bis,step);
%tab = the observed table;%mle = maximum likelihood estimate;%m = the matrix representation of the moves;%N = number of MCMC replicates;%bis = number of burn-in steps;%step = reduction of correlation step;
nc=length(tab);p=0;c=zeros(N,1);chisqref=chisq(tab,MLE);mm=-m;m=[m; mm];[nmoves,ncm]=size(m);numit=N*step+bis;for i=1:numit
tabp=zeros(1,nc);r=ceil(nmoves*rand(1));tabp=tab+m(r,:);
if tabp>=zeros(1,nc)mhr=1;
for j=1:ncif m(r,j)~=0
mhr=mhr*prod(1:tab(j))/prod(1:tabp(j));end;
end;alpha=rand(1);if mhr>=alpha
tab=tabp;
92 APPENDIX A. PROGRAMS
end;end;
if (rem(i,step)==0) & (i>bis)c((i-bis)/step)=chisq(tab,MLE);if c((i-bis)/step)>=chisqref
p=p+1;end;
end;end;p=p/N;simul=p;
The value of the Pearson statistic is computed through the following func-
tion:
function chisq=chisq(tab,MLE);c=0;l=length(tab);for i=1:l
if MLE(i)~=0c=c+(tab(i)-MLE(i))*(tab(i)-MLE(i))/MLE(i);
end;end;chisq=c;
Some remarks about the parameters of the simulation Matlab function.
• the maximum likelihood estimate of the cell counts mle is computed
with statistical software, for example SAS, see SAS/STAT User’s Guide
(2000), unless we disposed of a closed form;
• the number of Monte Carlo replicates B is chosen accordingly to the
precision we want reach. In our examples we have used B=50.000, and
the motivation of our choice is discussed in Example 3.6, Chapter 3;
• the number of burn-in steps bis and the step for the reduction of the
correlation step should be chosen for any particular problem based on
inspection of the empirical distribution of the random variables and the
traces of auto-correlation function, respectively.
Finally, with a slight modification of the above program, one can obtain
the approximate distribution of the test statistic.
Bibliography
Agresti, A. (1992). Modelling patterns of agreement and disagreement.
Statistical Methods in Medical Research 1, 201–218.
Agresti, A. (1996). An Introduction to Categorical Data Analysis. New
York: Wiley.
Agresti, A. (2001). Exact inference for categorical data: Recent advances
and continuing controversies. Statist. Med. 20, 2709–2722.
Agresti, A. (2002). Categorical Data Analysis. New York: Wiley, 2nd ed.
Agresti, A., Mehta, C. R. & Patel, N. R. (1990). Exact inference for
contingency tables with ordered categories. J. Amer. Statist. Assoc. 85,
453–458.
Baglivo, J., Olivier, D. & Pagano, M. (1992). Methods for exact
goodness-of-fit tests. J. Amer. Statist. Assoc. 87, 464–469.
Becker, M. P. (1990). Quasisymmetric models for the analysis of square
contingency tables. J. R. Statist. Soc. 52, 369–378.
Bergsma, W. P. & Rudas, T. (2002). Marginal models for categorical data.
Ann. Statist. 30, 140–159.
Bigatti, A., La Scala, R. & Robbiano, L. (1999). Computing toric
ideals. J. Symb. Comput. 27, 351–365.
Bigatti, A. & Robbiano, L. (2001). Toric ideals. Mat. Contemp. 21, 1–25.
Birch, M. W. (1963). Maximum likelihood in three-way contingency tables.
J. R. Statist. Soc. 25, 220–233.
Bishop, Y. M., Fienberg, S. & Holland, P. W. (1975). Discrete multi-
variate analysis: theory and practice. Cambridge: MIT Press.
93
94 BIBLIOGRAPHY
Booth, J. G. & Butler, R. W. (1999). An importance sampling algorithm
for exact conditional tests in log-linear models. Biometrika 86, 321–331.
Capani, A., Niesi, G. & Robbiano, L. (2000). CoCoA, a system for doing
Computations in Commutative Algebra. Available via anonymous ftp from
cocoa.dima.unige.it, 4th ed.
CDER (2000). Developing Medical Imaging Drugs and Biological products.
Guidance for industry. U.S. Department of Health and Human Services,
Food and Drug Administration.
Chib, S. & Greenberg, E. (1995). Understanding the Metropolis–Hastings
algorithm. Amer. Statist. 49, 327–335.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational
and Psychological Measurement 20, 37–46.
Collombier, D. (1980). Recherches sur l’Analyse des Tables de Contingence.
Phd thesis, Universite Paul Sabatier de Toulouse.
Cox, D., Little, J. & O’Shea, D. (1992). Ideals, Varieties, and Algorithms.
New York: Springer Verlag.
Darroch, J. N. & McCloud, P. I. (1986). Category distinguishability and
observer agreement. Australian Journal of Statistics 28, 371–388.
De Loera, J. A., Hemmecke, R., Tauzer, J. & Yoshida, R. (2003).
Effective lattice point counting in rational convex polytopes. Preprint.
Diaconis, P. & Sturmfels, B. (1998). Algebraic algorithms for sampling
from conditional distributions. Ann. Statist. 26, 363–397.
Dillon, W. R. & Mulani, N. (1984). A probabilistic latent class model
for assessing inter-judge reliability. Multivariate Behavioral Research 19,
438–458.
Dobson, A. J. (1990). An Introduction to Generalized Linear Models. Lon-
don: Chapman & Hall.
Donner, A. & Klar, N. (1996). The statistical analysis of kappa statistics
in multiple samples. Journal of Clinical Epidemiology 49, 1053–1058.
BIBLIOGRAPHY 95
Donner, A., Shoukri, M. M., Klar, N. & Bartfay, E. (2000). Testing
the equality of two dependent kappa statistics. Statist. Med. 19, 373–387.
Duffy, D. (2002). The gllm package. Available from
http://cran.r-project.org, 0th ed.
Fienberg, S. (1980). The Analysis of Cross-Classified Categorical Data.
Cambridge: MIT Press.
Fingleton, B. (1984). Models of Category Counts. Cambridge: Cambridge
University Press.
Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions. New
York: Wiley, 2nd ed.
Garcia, L. D., Stillman, M. & Sturmfels, B. (2003). Algebraic geom-
etry of Bayesyan networks. Preprint.
Geiger, D., Meek, C. & Sturmfels, B. (2002). On the toric algebra of
graphical models. Preprint.
Goodman, L. A. (1979a). Multiplicative models for square contingency tables
with ordered categories. Biometrika 66, 413–418.
Goodman, L. A. (1979b). Simple models for the analysis of association in
cross-classifications having ordered categories. J. Amer. Statist. Assoc. 74,
537–552.
Goodman, L. A. (1985). The analysis of cross-classified data having ordered
and/or unordered categories: Association models, correlation models, and
asymmetry models for contingency tables with or without missing entries.
Ann. Statist. 13, 10–59.
Guggenmoos-Holzmann, I. & Vonk, R. (1998). Kappa-like indices of
agreement viewed from a latent class perspective. Statist. Med. 17, 797–
812.
Haberman, S. J. (1974). The Analysis of Frequency Data. Chicago and
London: The University of Chicago Press.
Haberman, S. J. (1978). Analysis of Qualitative Data. New York: Academic
Press.
96 BIBLIOGRAPHY
Hanley, J. A. (1998). Receiver Operating Characteristic (ROC) curves.
In Encyclopedia of Biostatistics, P. Armitage & T. Colton, eds. Wiley, pp.
3738–3745.
Hirji, K. F. & Johnson, T. D. (1996). A comparison of algorithms for
exact analysis of unordered 2× k contingency tables. Comput. Statist. Data
Anal. 21, 419–429.
Klar, N., Lipsitz, S. R. & Ibrahim, J. G. (2000). An estimating equation
approach for modelling Kappa. Biom. J. 42, 45–58.
Kreuzer, M. & Robbiano, L. (2000). Computational Commutative Algebra
1. New York: Springer.
Kreuzer, M. & Robbiano, L. (2003). Computational Commutative Algebra
2. New York: Springer. In preparation.
Landis, R. J. & Koch, G. G. (1975). A review of statistical methods in
the analysis of data arising from observer reliability studies, Parts I and II.
Statist. Neerlandica 29, 101–123, 151–161.
Lehmann, E. L. (1983). Theory of Point Estimation. New York: John Wiley
& Sons.
Lehmann, E. L. (1987). Testing Statistical Hypotheses. New York: John
Wiley & Sons.
Margalef Roig, J. & Outerelo Dominguez, E. (1992). Differential
Topology. No. 173 in North-Holland Mathematixs Studies. Amsterdam:
North-Holland Publishing Co.
McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models. Lon-
don: Chapman & Hall, 2nd ed.
Mehta, C. R. & Patel, N. R. (1983). A network algorithm for perfoming
Fisher’s exact test in r × c contingency tables. J. Amer. Statist. Assoc. 78,
427–434.
Metha, C. R., Patel, N. R. & Senchaudhuri, P. (1988). Importance
sampling for estimating exact probabilities in permutational inference. J.
Am. Statist. Assoc. 83, 999–1005.
BIBLIOGRAPHY 97
Metha, C. R., Patel, N. R. & Senchaudhuri, P. (1991). Exact strati-
fied linear rank tests for binary data. In Computing Science and Statistics:
Proceedings of the 23rd Symposium on the Interface, E. M. Keramidas, ed.
American Mathematical Society, pp. 200–207.
Metha, C. R., Patel, N. R. & Senchaudhuri, P. (2000). Efficient monte
carlo methods for conditional logistic regression. J. Am. Statist. Assoc. 95,
99–108.
Pistone, G., Riccomagno, E. & Wynn, H. P. (2001a). Algebraic Statis-
tics: Computational Commutative Algebra in Statistics. Boca Raton: Chap-
man&Hall/CRC.
Pistone, G., Riccomagno, E. & Wynn, H. P. (2001b). Computational
commutative algebra in discrete statistics. In Algebraic Methods in Statistics
and Probability, M. A. G. Viana & D. S. P. Richards, eds., vol. 287 of
Contemporary Mathematics. American Mathematical Society, pp. 267–282.
Rudas, T. & Bergsma, W. P. (2002). On generalised symmetry. Tech. rep.,
Paul Sabbatier University, Toulouse, France. Preprint.
SAS/STAT User’s Guide (2000). Version 8. Cary, NC: SAS Institute, 1st ed.
Smith, P. W. F., Forster, J. J. & McDonald, J. W. (1996). Monte
Carlo exact tests for square contingency tables. J. R. Statist. Soc. 159,
309–321.
Strawderman, R. L. & Wells, M. T. (1998). Approximately exact in-
ference for the common odds ratio in several 2 × 2 tables. J. Am. Statist.
Assoc. 93, 1294–1307.
Sturmfels, B. (1993). Algorithms in Invariant Theory. Texts and Mono-
graphs in Symbolic Computation. New York: Springer.
Sturmfels, B. (1996). Grobner bases and convex polytopes, vol. 8 of Univer-
sity lecture series (Providence, R.I.). American Mathematical Society.
Tanner, M. A. & Young, M. A. (1985). Modelling agreement among
raters. J. Am. Statist. Assoc. 80, 175–180.
98 BIBLIOGRAPHY
Williamson, J. M., Manatunga, A. K. & Lipsitz, S. R. (1994). Model-
ing kappa for measuring dependent categorical agreement data. Biostatistics
1, 191–202.