Models for Paired Comparison Data: A Review with Emphasis ...Models for Paired Comparison Data: A...

arX

iv:1

210.

1016

v1 [

stat

.ME

] 3

Oct

201

2

Statistical Science

2012, Vol. 27, No. 3, 412–433DOI: 10.1214/12-STS396c© Institute of Mathematical Statistics, 2012

Models for Paired Comparison Data:A Review with Emphasis onDependent DataManuela Cattelan

Abstract. Thurstonian and Bradley–Terry models are the most com-monly applied models in the analysis of paired comparison data. Sincetheir introduction, numerous developments have been proposed in dif-ferent areas. This paper provides an updated overview of these ex-tensions, including how to account for object- and subject-specific co-variates and how to deal with ordinal paired comparison data. Specialemphasis is given to models for dependent comparisons. Although thesemodels are more realistic, their use is complicated by numerical difficul-ties. We therefore concentrate on implementation issues. In particular,a pairwise likelihood approach is explored for models for dependentpaired comparison data, and a simulation study is carried out to com-pare the performance of maximum pairwise likelihood with other lim-ited information estimation methods. The methodology is illustratedthroughout using a real data set about university paired comparisonsperformed by students.

Key words and phrases: Bradley–Terry model, limited information es-timation, paired comparisons, pairwise likelihood, Thurstonian models.

1. INTRODUCTION

Paired comparison data originate from the com-parison of objects in couples. This type of data arisesin numerous contexts, especially when the judgmentof a person is involved. Indeed, it is easier for peo-ple to compare pairs of objects than ranking a listof items. There are other situations that may beregarded as comparisons from which a winner anda loser can be identified without the presence ofa judge. Both these instances can be analyzed bythe techniques described in this paper.

Manuela Cattelan is Postdoctoral Research Fellow,

Department of Statistical Sciences, University of Padua,

via C. Battisti 241, 35121 Padova, Italy e-mail:

[email protected].

This is an electronic reprint of the original articlepublished by the Institute of Mathematical Statistics inStatistical Science, 2012, Vol. 27, No. 3, 412–433. Thisreprint differs from the original in pagination andtypographic detail.

The objects involved in the paired comparisonscan be beverages, carbon typewriter ribbons, lot-teries, players, moral values, physical stimuli andmany more. Here, the elements that are comparedare called objects or sometimes stimuli. The pairedcomparisons can be performed by a person, an agent,a consumer, a judge, et cetera, so the terms subjector judge will be employed to denote the person thatmakes the choice.The bibliography by Davidson and Farquhar (1976),

which includes more than 350 papers related to pairedcomparison data, testifies to the widespread inter-est in this type of data. This interest is still presentand extensions of models for paired comparison datahave been proposed. This paper focuses on recentextensions of the two traditional models, the Thur-stone (1927) and the Bradley–Terry (Bradley andTerry (1952)) model, especially those subsequent tothe review by Bradley (1976) and the monograph byDavid (1988), including in particular the work thathas been done in the statistical and the psychomet-ric literature.

1

http://arxiv.org/abs/1210.1016v1

http://www.imstat.org/sts/

http://dx.doi.org/10.1214/12-STS396

http://www.imstat.org

mailto:[email protected]

http://www.imstat.org

http://www.imstat.org/sts/

http://dx.doi.org/10.1214/12-STS396

2 M. CATTELAN

Section 2 reviews models for independent data.After the introduction of the two classical modelsfor the analysis of paired comparison data and a sur-vey of different areas of application, Sections 2.3and 2.4 review extensions for ordinal paired com-parison data and for inclusion of explanatory vari-ables. Section 3 reviews models that allow for depen-dence among the observations and outlines the infer-ential problems related to such an extension. Here,a pairwise likelihood approach is proposed to esti-mate these models, and a simulation study is per-formed in order to compare the estimates producedby maximum likelihood, a common type of limitedinformation estimation and pairwise likelihood. Sec-tion 4 reviews existing R (R Development Core Team(2011)) packages for the statistical analysis of pairedcomparison data, and Section 5 concludes.

2. INDEPENDENT DATA

2.1 Traditional Models

Let Ysij denote the random variable associatedwith the result of the paired comparison betweenobjects i and j, j > i = 1, . . . , n, made by subjects= 1, . . . , S, and let Ys = (Ys12, . . . , Ysn−1n) be thevector of the results of all paired comparisons madeby subject s. When S = 1 or the difference betweenjudges is not accounted for in the model, then thesubscript s will be dropped. If each possible pairedcomparison is performed, they number N = n(n−1)/2, and SN = Sn(n−1)/2 in a multiple judgmentsampling scheme, that is, when all paired compar-isons are made by all S subjects. Different samplingschemes are possible. When each paired comparisonis performed by a different subject, the outcomesare independent. In other instances, a subject per-forms more than one paired comparison; in this case,it is conceivable that results of several paired com-parisons performed by the same subject will not beindependent. In Section 2, independence among ob-servations is assumed while Section 3 addresses theissue of dependent data, assuming that each subjectperforms all N paired comparisons, except for Sec-tion 3.3 which considers the case of dependence notinduced by judges.Let µi ∈R, i= 1, . . . , n, denote the notional worth

of the objects. Traditional models were developedassuming only two possible outcomes of each com-parison, so Yij is a binary random variable, and πij ,the probability that object i is preferred to object j,depends on the difference between the worth of the

two objects

πij = F (µi − µj),(2.1)

where F is the cumulative distribution function ofa zero-symmetric random variable. Such models arecalled linear models by David (1988). When F is thenormal cumulative distribution function, formula (2.1)defines the Thurstone (1927) model, while if F is thelogistic cumulative distribution function, then theBradley–Terry model (Bradley and Terry (1952)) isrecovered. Other specifications are possible; for ex-ample, Stern (1990) suggests modeling the worthparameters as independent gamma variables withthe same shape parameter and different scale pa-rameter. The Thurstone model is also known as theThurstone–Mosteller model since Mosteller (1951)presented some inferential techniques for the model,while the Bradley–Terry model was independentlyproposed also by Zermelo (1929) and Ford (1957).Model (2.1) is called unstructured model, and theaim of the analysis is to make inference on the vec-tor µ= (µ1, . . . , µn)

′ of worth parameters which canbe used to determine a final ranking of all the ob-jects compared. Note that the specification of model(2.1), through all the pairwise differences µi − µj ,implies that a constraint is needed in order to iden-tify the parameters. Various constraints can be spec-ified: the most common are the sum constraint,∑n

i=1 µi = 0, and the reference object constraint,µi = 0 for one object i ∈ 1, . . . , n.The comparative nature of the data poses inferen-

tial and interpretational problems. Consider two dif-ferent studies, for example, about beverages. If sub-jects were requested to express an absolute measureof like/dislike for each drink in a categorical scale,then the data obtained from the two studies mightbe analyzed all together. On the contrary, if the sub-jects express preferences in paired comparisons, thedata can be combined only if at least one objectis common to both studies; otherwise the data canbe analyzed separately, and no conclusions can bemade about relationships between objects in the twodifferent studies. Indeed, the lack of origin impliesthat no absolute statement can be made about thedata and two subjects can provide the same setsof preferences, but one may dislike all items whilethe other may like all of them. The identificationof an origin may be useful for understanding theunderlying psychological process, for discriminatingbetween desirable and undesirable objects and foridentifying the degree of an option desirability in

MODELS FOR PAIRED COMPARISON DATA 3

different conditions. However, it is not possible to re-cover the origin without further choice experimentsand/or further assumptions (Thurstone and Jones(1957); Bockenholt (2004)). Despite all their lim-its, paired comparison data are widespread becauseof their ease of performance and their discrimina-tory ability since objects that may be judged inthe same like/dislike category may be differentiatedwhen compared pairwise.If the reference object constraint is employed, the

identified worth parameters are differences with re-spect to the reference object. Hence, inference willtypically regard differences between estimated worthparameters with the related statistical problems. Forexample, for testing H0 :µi = µj by means of the

Wald test statistic (µi− µj)/var(µi− µj)1/2, whereµi is the maximum likelihood estimator of µi, thecovariance between the estimators of the worth pa-rameters is needed. In general, the whole covariancematrix of the worth parameters should be reportedin order to allow the final users to perform the teststhey are interested in. However, it is very inconve-nient to report that matrix and a useful alterna-tive may be to report quasi-standard errors (Firthand de Menezes (2004)) instead of the usual stan-dard errors since they allow approximate inferenceon any of the contrasts. Let c be a vector of zero-sum constants. If the parameters µ were indepen-dent, then the estimated standard error of c′µ wouldbe (

∑ni=1 c

2i vi)

1/2, where vi denotes the estimatedvariance of µi. Quasi-variances are a vector of con-stants q such that

var(c′µ)≃n∑

i=1

c2i qi,

so they have the property that they add over thecomponents of µ, and hence can be used to ap-proximate variances of contrasts of estimated worthparameters as if they were independent. Let p(qi +qj, var(µi − µj)), be a penalty function which de-pends on the quasi-variances and the estimated vari-ance of the difference µi − µj , then quasi-variancesare computed through minimization of the sum ofthe penalty function for all contrasts; see Firth andde Menezes (2004, Section 2.1).Further statistical problems arising from the com-

parative nature of the data are discussed in Sec-tion 3.2.2.

Example. A program supported by the Euro-pean Union offers an international degree in Eco-nomics and Management. Twelve universities take

part in this program, and in order to receive a de-gree, a student in the program must spend a semesterat another university taking part in the program.Usually, some universities receive more preferencesthan others, and this may cause organizational prob-lems. A study was carried out among 303 studentsof the Vienna University of Economics who wereasked in which university they would prefer to spendthe period abroad, between six universities situatedin Barcelona (Escuela Superior de Administraciony Direccion de Empresas), London (London Schoolof Economics and Political Sciences), Milan (Uni-

versita Luigi Bocconi), Paris (Hautes Etudes Com-merciales), St. Gallen (Hochschule St. Gallen) andStockholm (Stockholm School of Economics), com-pared pairwise. This example will be used through-out the paper as an illustration. For an exhaustiveanalysis of the data refer to Dittrich, Hatzinger andKatzenbeisser (1998, 2001). The data set is avail-able in both the prefmod (Hatzinger (2010)) and theBradleyTerry2 (Turner and Firth (2010a)) R pack-ages; see Section 4. Table 1 reports the aggregateddata on the 15 paired comparisons. For example, thefirst row shows that in the paired comparison be-tween London and Paris, 186 students prefer Lon-don, 91 students prefer Paris and 26 students donot have a preference between the two universities.Moreover, 91 students unintentionally overlooked thecomparison between Paris and Milan which has only212 answers. The second column of Table 2 shows

Table 1

Universities paired comparison data. 1 and 2 referto the number of choices in favor of the university

in the fist and the second column, respectively, while X

denotes the number of no preferences expressed

1 X 2

London Paris 186 26 91London Milan 221 26 56Paris Milan 121 32 59London St. Gallen 208 22 73Paris St. Gallen 165 19 119Milan St. Gallen 135 28 140London Barcelona 217 19 67Paris Barcelona 157 37 109Milan Barcelona 104 67 132St. Gallen Barcelona 144 25 134London Stockholm 250 19 34Paris Stockholm 203 30 70Milan Stockholm 157 46 100St. Gallen Stockholm 155 50 98Barcelona Stockholm 172 41 90

4 M. CATTELAN

Table 2

Estimates (Est.), standard errors (S.E.) and quasi-standarderrors (Q.S.E.) of the universities worth parameters

employing a two-categorical Thurstone model (Thurstone)and a cumulative extension of the Thurstone model

(cumulative Thurstone)

Thurstone cumulative Thurstone

Est. S.E. Q.S.E. Est. S.E. Q.S.E.

Barcelona 0.333 0.043 0.030 0.332 0.041 0.028London 0.982 0.045 0.033 0.998 0.043 0.031Milan 0.240 0.044 0.031 0.241 0.041 0.029Paris 0.561 0.044 0.031 0.566 0.042 0.030St. Gallen 0.325 0.043 0.030 0.324 0.040 0.028Stockholm 0 – 0.031 0 – 0.029τ2 – – – 0.153 0.007 –

the estimate of the worth parameters for the sixuniversities using the Thurstone model and addinghalf of the number of no preferences to each univer-sity in the paired comparison. In Section 2.3 a bet-ter way to handle no preference data will be dis-cussed.The reference object constraint is used, and the

worth parameter of Stockholm is set to zero. All es-timates are positive, so we can conclude that Stock-holm is the least preferred university, while Londonis the most preferred one, followed by Paris, Barce-lona, St. Gallen and Milan. The estimated probabil-ity that London is preferred to Paris is Φ(0.982 −0.561) = 0.66, where Φ denotes the cumulative dis-tribution function of a standard normal random vari-able. If it is of interest to test whether the worth ofSt. Gallen is significantly higher than the worth ofMilan, the standard error of the difference betweenthese two worth parameters can be approximatedby means of the quasi-standard errors as (0.0302 +0.0312)1/2 = 0.043. Quasi-standard errors are lowerthan standard errors, thus accounting for the pos-itive covariance between parameter estimates. Thevalue of the test statistic is (0.325− 0.240)/0.043 =1.98, which yields a p-value of 0.02; hence the hy-pothesis of equal worth parameters between St. Gal-len and Milan is not supported by the data.

2.2 Applications

There are many different areas in which pairedcomparison data arise. Here, a number of recent ap-plications are described, and further references canbe found in Bradley (1976), Davidson and Farquhar(1976) and David (1988).

Despite its simplicity, the basic Bradley–Terry andThurstone models have found a wide range of ap-plications. Choisel and Wickelmaier (2007) analyzepairwise evaluations of sounds through a standardBradley–Terry model, while Bauml (1994) and Kis-sler and Bauml (2000) present applications involv-ing facial attractiveness. In Mazzucchi, Linzey andBruning (2008) the standard Bradley–Terry modelis applied to a reliability problem. A panel of wiringexperts is asked to state which is the riskier one be-tween different scenarios compared pairwise in orderto determine the probability of wire failure as a func-tion of influencing factors in an aircraft environ-ment. Stigler (1994) uses the traditional Bradley–Terry model for ranking scientific journals, and thesame model is exploited in genetics by Sham andCurtis (1995).Maydeu-Olivares and Bockenholt (2008) list 10

reasons to use Thurstone’s model for analyzing sub-jective health outcomes, including the ease for re-spondents, the existence of extensions for modelinginconsistent choices and for including covariates andthe possibility to investigate which aspects influencethe choices of subjects.In many applications there are more than two pos-

sible outcomes of the comparisons. Henery (1992)employs a Thurstone model for ranking chess playersand adapts it to three possible results: win, draw andloss. Bockenholt and Dillon (1997a) consider a five-response-categories model for applications to tastetesting of beverages and to preferences for brandsof cigarettes. Dittrich, Hatzinger and Katzenbeisser(2004) consider motives to start a Ph.D. programusing three response categories in the log-linear ver-sion of the Bradley–Terry model.It is often of interest to investigate whether some

covariates affect the results of the comparisons. Eller-meier, Mader and Daniel (2004) employ a Bradley–Terry model to analyze pairwise evaluations of soundsand include sound-related covariates, for example,roughness, sharpness, et cetera, to evaluate whichof them contribute to the unpleasantness of sounds.Duineveld, Arents and King (2000) use the log-linearformulation of the Bradley–Terry model to investi-gate consumer preference data on orange soft drinksincluding an analysis of the factorial design for thedrinks compared, while Francis et al. (2002) includesubject-specific covariates in the analysis of valueorientation of people in different European coun-tries. Applications of the Bradley–Terry model arepresent also in zoological data in order to investi-gate aspects of animal behavior considering animal-


specific covariates (Stuart-Fox et al. (2006); Whit-ing et al. (2006); Head et al. (2008)). Agresti [(2002),Chapter 10] extends the Bradley–Terry model to ac-count for the home advantage effect in baseball data.Sometimes it is more realistic to include depen-

dence among observations. Object-specific randomeffects can be used to introduce correlation betweencomparisons with common objects, for example, insports data (Cattelan (2009)). When all judges per-form all paired comparisons, random effects can in-troduce correlation between preferences expressedby the same subject involving a common object asshown in Bockenholt and Tsai (2007) for the univer-sity preference data.When paired comparisons are performed in pro-

longed time periods, it may be necessary to accountfor it. McHale and Morton (2011) estimate a Bradley–Terry model in which tennis matches distant in timeare down-weighted since the aim is to predict the re-sults of future matches. Further dynamic extensionsfor sports data have been proposed by Barry andHartigan (1993), Fahrmeir and Tutz (1994), Knorr-Held (2000) and Cattelan, Varin and Firth (2012).In tournaments it may happen that a player winsall the comparisons in which he is involved. In thiscase a standard Bradley–Terry or Thurstone modelwould estimate an infinity worth parameter for thisteam. Mease (2003) proposes a penalization of thelikelihood which overcomes this problem. The meth-od proposed by Firth (1993) to reduce the bias ofthe maximum likelihood estimates is an alternativetechnique to obtain finite estimates in this instance.Finally, the case in which the margin of victory insport contests is not discrete, but continuous, is an-alyzed in Stern (2011).In the context of the log-linear specification of the

Bradley–Terry model, Dittrich et al. (2012) accountalso for missing responses in a study about the qual-ities of a good teacher.

2.3 Ordinal Paired Comparisons

Sometimes subjects are requested to express a de-gree of preference. Suppose that objects i and j arecompared, and the subject can express strong pref-erence for i over j, mild preference for i, no prefer-ence, mild preference for j over i or strong preferencefor j. If H denotes the number of grades of the scale,then in this example, H = 5.Let Yij = 1, . . . ,H , where 1 denotes the least fa-

vorable response for i, and H is the most favorableresponse for i. Agresti (1992) shows how two mod-

els for the analysis of ordinal data can be adaptedto ordinal paired comparison data. The cumulative

link models exploit the latent random variable rep-resentation. Let Zij be a continuous latent randomvariable, and let τ1 < τ2 < · · ·< τH−1 denote thresh-olds such that Yij = h when τh−1 <Zij ≤ τh. Then,

pr(Yij ≤ yij) = F (τyij − µi + µj),(2.2)

where −∞= τ0 < τ1 < · · ·< τH−1 < τH =∞, and Fis the cumulative distribution function of the latentvariable Zij . F is usually assumed to be either the lo-gistic or the normal distribution function leading tothe cumulative logit or the cumulative probit model,respectively. The symmetry of the model imposesthat τh =−τH−h, h= 1, . . . ,H and τH/2 = 0 whenHis even. When H = 3 there are two threshold param-eters, τ1 and τ2, such that τ1 =−τ2 and model (2.2)corresponds to the extension of the Bradley–Terrymodel introduced by Rao and Kupper (1967) whena logit link is considered, and the extension of theThurstone model by Glenn and David (1960) whenthe probit link is employed.An alternative model proposed by Agresti (1992)

is the adjacent categories model. In this case thelink is applied to adjacent response probabilities,rather than cumulative probabilities and reduces tothe Bradley–Terry model when only 2 categoriesare allowed and to the model proposed by David-son (1970) when 3 categories are allowed. The ad-jacent categories model is simpler to interpret thancumulative link models since the odds ratio refers toa given outcome instead of referring to groupings ofoutcomes (Agresti (1992)). The adjacent categoriesmodel, as well as the Bradley–Terry model, has alsoa log-linear representation (Dittrich, Hatzinger andKatzenbeisser (2004)).An application of the adjacent categories model to

market data is illustrated in Bockenholt and Dillon(1997b). Bockenholt and Dillon (1997a) note thata bias may be caused by the usage of the scale be-cause subjects may use only subsets of all categories.The threshold parameters τh can account for the se-lection bias, for example, in the cumulative probitmodel the quantity Φ(τh)−Φ(τh−1) gives the cate-gory selection bias since it is the probability of se-lecting category h when the two stimuli are equal.Different latent classes of consumers with differentthreshold values and worth parameters can be iden-tified. If subjects share the same worth parametersbut have different thresholds, it is possible to letthresholds depend on subject-specific covariates and

6 M. CATTELAN

to have a random part (Bockenholt (2001b)). It isalso possible to define thresholds that depend on theobjects compared, as in Henery (1992).

Example. In the paired comparisons of univer-sities, students were allowed to express no preferencebetween two universities. Therefore, the data shouldbe analyzed by means of a model for ordinal data.Columns 5–7 in Table 2 show the estimates of a cu-mulative probit extension of the Thurstone modelfor the university data. The estimated threshold pa-rameter τ2 = 0.153 is highly significant. In this par-ticular case, the estimates of the worth parametersand their standard errors are very similar to thoseof the model with two categories, and the ranking ofuniversities remains the same, but in general, espe-cially when the number of no preferences is large, re-sults can be different. Moreover, in this case it is pos-sible to estimate the probability of no preference be-tween London and Paris which is Φ(0.153− 0.998 +0.566) −Φ(−0.153 − 0.998 + 0.566) = 0.11, and theestimated probability that London is preferred toParis reduces to 1−Φ(0.153−0.998+0.566) = 0.61;hence the estimated probability that Paris is pre-ferred to London is 0.28. There is no much differ-ence from the previous result in the test of equalityof worth parameters for universities in St. Gallenand Milan.

2.4 Explanatory Variables

In many instances, it is of interest to investigatewhether some explanatory variables affect the re-sults of the comparisons. Explanatory variables canbe related to the objects compared, to the subjectsperforming the comparisons or they can be compar-ison-specific.Let xi = (xi1, . . . , xiP )

′ be a vector of P explana-tory variables related to object i and β = (β1, . . . , βP )be a P -dimensional parameter vector. Then, in thecontext of the Bradley–Terry model, Springall (1973)proposes to describe the worth parameters as thelinear combination

µi = xi1β1 + · · ·+ xiPβP , i= 1, . . . , n.(2.3)

A paired comparison model with explanatory vari-ables is called a structured model. The same exten-sion can be applied to the Thurstone model. Notethat since only the differences µi − µj = (xi − xj)

′β

enter the linear predictor, an intercept cannot beidentified. In some instances, both worth parame-ters of objects and further object-specific covariatesare included, hence the linear predictor assumes theform µi − µj + (xi − xj)

′β; see Stern (2011).

Model (2.3) has been extended to more flexiblemodels, such as additive combinations of splinesmoothers (De Soete and Winsberg (1993)); how-ever large data sets may be necessary to estimatenonlinearities reliably, even though there is no in-vestigation about this issue.In case worth parameters are specified as in (2.3),

standard errors for the worth parameters can becomputed through the delta method, while whenboth the worth parameters and covariates are in-cluded in the linear predictor, quasi-standard errorscan be computed for the worth parameters.The results of the comparisons can be influenced

also by characteristics of the subject that performsthe paired comparisons. In the log-linear representa-tion of the Bradley–Terry model, Dittrich, Hatzingerand Katzenbeisser (1998) show how to include cate-gorical subject specific covariates, while Francis et al.(2002) tackle the problem of continuous subject-specific covariates and consider also the case in whichsome of these covariates have a smooth nonlinear re-lationship.Dillon, Kumar and De Borrero (1993) consider

a marketing application and divide subjects in la-tent classes to which they belong with a probabilitythat depends on their explanatory variables.Covariates can be added in the linear predictor

(2.3) if they are subject-object interaction effects.For example, the knowledge of a foreign languagemay influence the preference for a university. An in-teraction effect can account for whether the studentknows, for example, Spanish and one object in thecomparison is the university in Barcelona. Unfortu-nately, subject covariates that do not interact withobjects, such as age of respondents, cannot be in-cluded.A semiparametric approach which accounts for

subject-specific covariates is proposed by Strobl,Wickelmaier and Zeileis (2011) who suggest a meth-odology to partition recursively the subjects thatperform the paired comparisons on the basis of theircovariates. The procedure tests whether structuralchanges in the parameters occur for subjects withdifferent values of the covariates. Subjects are splitaccording to the test and a different unstructuredBradley–Terry model is fitted for each subgroup.The method allows us to identify which covariatesinfluence the worth parameters without the need toassume a model for them and finds the best cutpoint in case of continuous covariates. Moreover, itis possible to include subject-specific covariates, notonly interaction effects. Attention is needed in set-


ting the minimum number of subjects per class andin setting the significance level of the test in orderto avoid overfitting for large data sets. Differentlyfrom the usual latent class models, the method al-lows to divide subjects on the basis of their covari-ates; however, if some important subject-specific co-variates are not available, it may be expected thatthe usual latent class model will perform better. InStrobl, Wickelmaier and Zeileis (2011) an unstruc-tured Bradley–Terry model is estimated for eachsubgroup, but it seems possible to extend the methodalso to structured models.Finally, there may be also comparison-specific co-

variates which are related to the objects, but changefrom comparison to comparison. An example ofa comparison-specific covariate is the home advan-tage effect in sport tournaments since it depends onwhether one of the players competes in the homefield. This effect may be accounted for by addinga further term in the linear predictor (2.3). An-other example is the experience effect in contestsbetween animals which, in Stuart-Fox et al. (2006),is accounted for through a covariate that counts thenumber of previous contests fought by animals.

Example. In the universities paired compari-sons, it may be of interest to assess whether someobject-specific covariates influence the results of thecomparisons. The universities in London and Milanspecialize in economics, the universities in Paris andBarcellona specialize in management science and theremaining two in finance. This aspect may influencethe decisions of students. Another element that mayaffect the comparisons is the location of the univer-sities, in this respect they can be divided in univer-sities in Latin countries (Italy, France and Spain)and universities in other countries.Some features of the students that performed the

universities paired comparisons were collected, too.In particular, it is known whether students havegood knowledge of English, Italian, Spanish andFrench and which is the main topic of their studies.It is conceivable that, for example, students witha good knowledge of French are more inclined toprefer the university in Paris. Table 3 shows the es-timates of a model with a linear predictor that in-cludes object specific covariates and subject-objectinteraction effects. Universities in non-Latin coun-tries are preferred to those in Latin countries, anduniversities that specialize in finance seem less ap-pealing to students. The good knowledge of a foreignlanguage induces students to choose the university

Table 3

Estimates (Est.) and standard errors (S.E.) of universitiesdata with subject- and object-specific covariates

Est. S.E.

Economics 0.757 0.066Management 0.789 0.080Latin country −0.835 0.071Discipline:Management 0.238 0.054English:London 0.141 0.075French:Paris 0.652 0.049Italian:Milan 1.004 0.094Spanish:Barcelona 0.831 0.095τ2 0.160 0.007

situated in the country where that foreign languageis spoken. Consider a student with a good knowledgeof both English and French and whose main disci-pline of study is management, then the estimatedprobability that this student prefers London to Parisis 1 − Φ0.160 − (0.141 + 0.757 − 0.652 − 0.789 +0.835 − 0.238) = 0.46, while the estimated proba-bilities of no preference and preference for Paris are0.13 and 0.41, respectively. If this student’s maindiscipline of study was not management, which isthe subject in which Paris specializes, then the aboveestimated probabilities of preferring London, no pref-erence and preferring Paris would become 0.55, 0.12and 0.33, respectively.

3. MODELS FOR DEPENDENT DATA

3.1 Intransitive Preferences

The models presented so far are estimated assum-ing independence among all observations. The inclu-sion of a dependence structure is not only more real-istic, but also has an impact on the transitivity prop-erties of the model. Intransitive choices occur whenobject i is preferred to j, and object j is preferredto k, but in the paired comparison between i and k,the latter is preferred. These are also called circulartriads. Paired comparison models can present dif-ferent transitivity properties. Assume that πij ≥ 0.5and πjk ≥ 0.5, then a model satisfies:

• weak stochastic transitivity if πik ≥ 0.5;• moderate stochastic transitivity if πik ≥min(πij , πjk);• strong stochastic transitivity if πik ≥max(πij, πjk).

The Bradley–Terry and Thurstone models as pre-sented so far satisfy strong stochastic transitivity.This property may be desirable sometimes, for ex-ample, when asking wiring experts which is the risk-

8 M. CATTELAN

ier situation between different scenarios in an air-craft environment. In this case it is desirable thatchoices are consistent, so Mazzucchi, Linzey andBruning (2008) use transitivity to check the levelof reliability of experts. However, in some situationschoices can be systematically intransitive, for exam-ple, when the same objects have more than one as-pect of interest, and different aspects prevail in dif-ferent comparisons.Causeur and Husson (2005) propose a two-dimen-

sional Bradley–Terry model in which the worth pa-rameter of each object is bidimensional and can thusbe represented on a plane. A further multidimen-sional extension is proposed by Usami (2010). How-ever, this methodology does not provide a final rank-ing of all objects.A different method that allows the inclusion in the

model even of systematic intransitive comparisonswhile yielding a ranking of all the objects consistsof modeling the dependence structure among com-parisons. The development of inferential techniquesfor dependent data has recently allowed an investi-gation of models for dependent observations.

3.2 Multiple Judgment Sampling

The assumption of independence is questioned inthe case of the multiple judgment sampling, that is,when S people make all the N paired comparisons.It seems more realistic to assume that the compar-isons made by the same person are dependent. Thisaspect has received much attention in the literatureduring the last decade.

3.2.1 Thurstonian models The original modelproposed by Thurstone (1927) includes correlationamong the observations. The model was developedfor analyzing sensorial discrimination and assumesthat the stimuli T= (T1, . . . , Tn)

′ compared in a pair-ed comparison experiment follow a normal distri-bution, T∼N(µ,ΣT ), with mean µ= (µ1, . . . , µn)

′

and variance ΣT . Thurstone (1927) proposes differ-ent models with different covariance matrices of thestimuli, so the set of models which assume a normaldistribution of the stimuli are called Thurstonianmodels. The single realization ti of the stimulus Ti

can vary, and the result of the paired comparisonbetween the same two stimuli can be different indifferent occasions. Assume that only either a pref-erence for i or a preference for j can be expressed, sothen in a paired comparison when Ti > Tj object i ispreferred, or alternatively, when the latent randomvariable Zij = Ti − Tj is positive, a win for i is ob-served; otherwise a win for j occurs. In the context

of multiple judgment sampling, Takane (1989) pro-poses to include a vector of pair specific errors. LetZs = (Zs12, . . . ,Zsn−1n)

′ be the vector of all latentcontinuous random variables pertaining to subject s,then

Zs =AT+ es,(3.1)

where es = (es12, es13, . . . , esn−1n)′ is the vector of

pair-specific errors which has zero mean, covarianceΩ and is independent of T and of es′ for any othersubject s′ 6= s, and A is the design matrix of pairedcomparisons whose rows identify the paired compar-isons and columns correspond to the objects. Forexample, if n = 4, and the paired comparisons are(1,2), (1,3), (1,4), (2,3), (2,4) and (3,4), then

A=

1 −1 0 01 0 −1 01 0 0 −10 1 −1 00 1 0 −10 0 1 −1

.

A similar model is employed by Bockenholt and Tsai(2001), who assume that εs ∼N(0, ω2IN ). The moregeneral analysis of covariance structure proposed byTakane (1989) can accommodate both the wander-

ing vector and the wandering ideal pointmodels (Car-roll and De Soete (1991)), which are models with dif-ferent assumptions about the mechanism originatingthe data. The wandering vector and wandering idealpoint models do not impose the number of dimen-sions which is determined from the data alone, sothey are powerful models to analyze human choicebehavior and inferring perceptual dimensions.The model thus specified is over-parametrized. To

reduce the number of parameters, Thurstone (1927)proposes different restrictions on the covariance ma-trixΣT , while Takane (1989) proposes a factor model.Nonetheless, these models with a reduced number ofparameters need further identification restrictions;see Section 3.2.2.A further extension of model (3.1) is proposed

by Tsai and Bockenholt (2008) who unify Tsai andBockenholt (2006) with Takane (1989) to obtaina general class of models that can account simul-taneously for transitive choice behavior and system-atic deviations from it. In this case the latent vari-able is

Zs =AT+BVs,(3.2)

where Vs = (Vs1(2), Vs1(3), . . . , Vs2(1), Vs2(3), . . . ,Vsn (n−1))

′ is a vector of zero mean random effectsdesigned so as to capture the random variation in


judging an object when compared to another specificobject, and B is a matrix with rows correspondingto the paired comparisons and columns correspond-ing to the elements of Vs, so, for example, if n= 3,Vs = (Vs1(2), Vs1(3), Vs2(1), Vs2(3), Vs3(1), Vs3(2))

′ and

B=

1 0 −1 0 0 00 1 0 0 −1 00 0 0 1 0 −1

.

It is assumed thatVs, the within-judge variability,is normally distributed with mean 0 and covarianceΣV so that Zs ∼N(Aµ,AΣTA

′ +BΣVB′).

In the remaining it will be assumed that thereare only two possible outcomes of the comparisons,but it is easy to extend this model for ordinal datathrough the introduction of threshold parameterswith a specification analogous to (2.2).

3.2.2 Identification Psychometricians are interest-ed in understanding the relations between stimuli;hence they are primarily interested in the unstruc-tured and unrestricted Thurstonian models. Unfor-tunately, due to the comparative nature of the data,some identification restrictions on the covariance ma-trix are needed. The necessary identification restric-tions to estimate model (3.1) are discussed in May-deu-Olivares (2001, 2003), Tsai and Bockenholt(2002) and Tsai (2003). Consider the covariance ma-trix ΣZ =Cov(Zs) =AΣTA

′ +Ω, where ΣT is anunrestricted covariance matrix. Because of the dif-ference structure of the judgments ΣT and ΣT +d1′ + 1d′ where 1 is a vector of n ones and d isan n-dimensional vector of constants such that thematrix remains positive definite, are not distinguish-able (Tsai (2000)). Indeed, let K= [In−1|−1] be anidentity matrix of dimension n−1 to which a columnof elements equal to −1 is added, then only Kµ andKΣTK

′ are identifiable. For example the matrices

ΣT,1 =

1 0 00 1 00 0 1

,

ΣT,2 =

0.750 0.125 00.125 1.5 0.3750 0.375 1.250

are not distinguishable because KΣT,1K′ = K ·

ΣT,2K′ = ( 21

12), where the second matrix is obtain-

ed from the first one by setting d= (−1/8,1/4,1/8).This consideration remains valid for any generic ma-trix of contrasts that may be used instead of K. Thespecifications of the covariance matrix ΣT with a re-duced number of parameters proposed by Thurstone

(1927) cannot be recovered from the data and onlycovariance classes can be considered.Tsai (2003) shows that n+2 constraints are needed

in order to identify model (3.1), including the con-straint on the worth parameters. As for the mean pa-rameters, many different constraints can be imposedon the covariance matrix. For example, Bockenholtand Tsai (2001), Tsai and Bockenholt (2002) andMaydeu-Olivares (2003) set all the diagonal elementsof ΣT equal to 1 and either one of the diagonalelements of Ω to 1 or one of the nondiagonal ele-ments of ΣT equal to zero. However, if ΣT is fixedto be a correlation matrix, the set of matrixes thatproduce the same sets of probabilities is limited.Maydeu-Olivares and Bockenholt (2005) set all thecovariances involving the last latent utility to zero,which corresponds to assuming independence be-tween the last stimulus and the others, and the vari-ance of the first and last item to one. Maydeu-Oliva-res (2007) suggest to set all diagonal elements of ΣT

equal to one and the sum of the correlations be-tween the first and the other latent variables to one.With these constraints positive entries in the corre-lation matrix imply that strong preference for onestimulus is associated with strong preference for theother stimulus, while negative entries indicate thatstrong preference for one stimulus is associated withweak preference for the other stimulus. Thus, it isnot necessary to fix any element in the matrix Ω,since the constraint ω = 1 in Ω= ω2IN could lead toa nonpositive definite matrix ΣT . After estimationit is possible to recover the class of covariance ma-trixes that produce the same probabilities (Maydeu-Olivares and Hernandez (2007)). However, the ini-tial identification constraints pose limits on the setof covariance matrixes that identify the same model.There is no discussion or results about the identifi-

cation restrictions necessary to estimate model (3.2).In order not to incur identification problems, Tsaiand Bockenholt (2008) assume that the matrix ΣV

depends on very few parameters.

3.2.3 Models with logit link The dependence be-tween evaluations made by the same judge has beenintroduced also in models employing logit link func-tions. Different specifications have been used for thispurpose.A first inclusion of dependence in logit models is

proposed by Lancaster and Quade (1983), who con-sider multiple judgments by the same person andintroduce correlation in the Bradley–Terry modelassuming that the worth parameters are randomvariables following a beta distribution with shape

10 M. CATTELAN

parameters aij and bij . The Bradley–Terry modelis imposed on the means of the beta distributions,that is, E(πij) = aij/(aij + bij) = πi/(πi + πj), butsuch a model introduces correlation only betweencomparisons of the same judge on the same pairof objects, while the other comparisons remain in-dependent. The same limit presents the extensionby Matthews and Morris (1995) who consider threepossible response categories.Two different methods have been used for intro-

ducing dependence among comparisons made by thesame person involving one common object in logitmodels. The first method exploits the usual associa-tion measure for binary data: the odds ratio. Bocken-holt and Dillon (1997a) consider the adjacent cate-gories model for preference data with H categoriesand suggest a parametrisation in terms of log-oddsratios to account for dependence between observa-tions, while Dittrich, Hatzinger and Katzenbeisser(2002) adopt a similar approach in a two-categoricalmodel using the log-linear formulation of the Brad-ley–Terry model. This specification is convenient be-cause it allows one to estimate the model throughstandard software developed for log-linear models,but the number of added parameters can be quitelarge (Dittrich, Hatzinger and Katzenbeisser (2002)).Another method used for introducing dependence

among observations is the inclusion of random ef-fects in the linear predictor. Bockenholt (2001a) de-scribes the worth of object i for subject s as

µsi = µi +P∑

p=1

βipxip +Usi,

where Usi is a random component, and xi is a vec-tor of P subject-specific (and possibly item specific)covariates. Bockenholt (2001a) employs a logit linkfunction and assumes that Us = (Us1, . . . ,Usn)

′ fol-lows a multivariate normal distribution with mean 0

and covariance ΣU .Francis, Dittrich and Hatzinger (2010) consider

the log-linear representation of the Bradley–Terrymodel and introduce random effects for each respon-dent in order to account for residual heterogeneitythat is not included in subject-specific covariates.The inclusion of random effects in the linear pre-dictor introduces difficulties in the estimation of themodel.

3.2.4 Choice models The work by Thurstone hasgreat importance in the development of models foranalyzing discrete choices, not only from a psycho-

metric point of view, but also in economic choicetheory. When the idea that choices may be ran-dom and not fixed started to develop, the use ofthe model proposed by Thurstone was suggested(Marschak (1960)). As the Nobel laureate McFad-den (2001) states, “when the perceived stimuli areinterpreted as levels of satisfaction, or utility, thiscan be interpreted as a model for economic choice.”According to the economic theory, models for dis-

crete choice are required to satisfy the utility maxi-mization assumption which states that subjects max-imize their utility when making decisions. Let Υsi

denote the utility of subject s from alternative iwhich can be decomposed as Υsi =Msi + εsi, whereMsi denotes a function which relates a set of al-ternative attributes and subject attributes to theutility gain and εsi denotes factors that affect util-ity, but are not included in Msi. The probabilitythat subject s chooses alternative i is equal to theprobability that the utility gained from i is higherthan the utility from every other object in the choiceset: pr(Υsi >Υsj,∀i 6= j) = pr(εsi−εsj <Msi−Msj,∀i 6= j). These models are called random utility mod-els. For each person, a choice is described as n− 1paired comparisons between the preferred alterna-tive and all other options. Note that paired com-parisons do not really occur, so inconsistent choicescannot be observed.From the above specification, different models have

been developed depending on the assumptions aboutthe distribution of the errors and the formulationof the mean term Msi. If the εsi’s are independentand follow a Gumbel distribution the choice model isa logit model and, when Msi = x′

siβ, it correspondsto the structured Bradley–Terry model. A particu-lar concern is caused by the independence from ir-relevant alternatives (Luce (1959)) property whichcharacterizes the Bradley–Terry model. Indeed, inthe Bradley–Terry model the ratio between proba-bilities of choosing one option over another is inde-pendent from the other available alternatives. Often,this property is not satisfied in real data. This limitis somehow overcome by assuming a type of gen-eralized extreme value distribution for the errors.In the resulting nested logistic model, independencefrom irrelevant alternatives holds for sets of alterna-tives within a same subset and not for alternativesin different subsets (Train (2009)). The advantage ofthese specifications is that models can be estimatedeasily, but they cannot account for random tastevariation or unobserved factors correlated over time.


A further proposal is to assume a multivariate nor-mal distribution for the errors εsi. This model isvery flexible since it allows for random taste varia-tion and, when necessary, for temporally correlatederrors, but its estimation is not straightforward. Theresulting model is a multivariate probit model, likethe Thurstone model. In economic choice modelsit is of interest to consider the influence on deci-sions of covariates that are included in the meanterm Msi. Explanatory variables can be consideredalso in psychometric models (Tsai and Bockenholt(2002)), even though interest is focused on the pa-rameters µ which are always included in the linearpredictor.Other extensions include further random elements

in the mean term Msi, so as to allow flexible distur-bances or to account for different attitudes and per-ceptions of different people. All these elements adddifficulties in the estimation of the model.An important aspect in choice theory is the dis-

tinction between stated and revealed preferences.This problem has not received much attention inthe psychometric literature, but there may be differ-ences between what people say they would choose ina questionnaire survey and what they really choose.The former are called stated preferences and thelatter revealed preferences. If both types of prefer-ences are available, it may be useful to analyze themall together. Walker and Ben-Akiva (2002) proposea model that incorporates many of the above ex-tensions; however, care is needed when specifyingthe model because it may be difficult to understandwhich parameters can be identified. Moreover, theinclusion of additional disturbances and unobservedcovariates requires the approximation of integralswhose dimension can be high.Random utility models are very useful and widely

spread; however, some doubts have been raised abouttheir basic assumption that people act as to maxi-mize their utility since sometimes consumers do notmake rational choices (Bockenholt (2006)).

3.3 Object-Related Dependencies

In the multiple judgment sampling the dependenceamong observations derives from repeated compar-isons made by the same person, usually involvinga common object. In case paired comparisons arenot performed by a judge, the correlation may arisefrom the fact that the same object is involved in mul-tiple paired comparisons. For example, when con-tests among animals are analyzed, it is realistic toassume that comparisons involving the same animal

are correlated. In this perspective, Firth (2005) sug-gests to set

µi = x′

iβ+Ui,(3.3)

where Ui is a zero mean object-specific random ef-fect. This approach is investigated in Cattelan (2009).The results of comparisons are related to observedcharacteristics of the animal and to unobserved quan-tities that are captured by the random effect Ui.In this case, the latent random variable can be

written as

Z=AXβ+AU+ η,

where U = (U1, . . . ,Un) is the vector of all object-specific random effects, X is the matrix of covari-ates with columns xi, η are independent normallydistributed errors with mean 0 and variance 1 whilethe matrix A is the design matrix of the pairedcomparisons with rows that describe which compar-isons are observed, not necessarily all possible pairedcomparisons. If it is assumed that U is multivari-ate normal with mean 0 and covariance Inσ

2, thenZ∼N(AXβ, σ2AAT + Id), where d is the numberof paired comparisons observed. Again, this modelis a multivariate probit model. However, this typeof data presents some different features with respectto multiple judgment sampling. While in pshycho-metric applications n is not very large because itis unlikely that a person will make all the pairedcomparisons when n > 10, this will typically hap-pen in sport tournaments or in paired comparisondata about animal behavior. Moreover, in the multi-ple judgment sampling scheme S independent repli-cations of all the comparisons are available, but inother contexts this does not occur, adding furtherdifficulties.

3.4 Inference

3.4.1 Estimation In this section, the multiple judg-ment sampling scheme is mainly investigated, andonly some comments are made about the case ofobject-related dependencies. There are different meth-ods for estimating models for dependent paired com-parison data. A first approach to the computation ofthe likelihood function requires to integrate out thelatent variables T from the joint distribution of Yand T. This integral has dimension n, the number ofitems, but rewriting it in terms of differences Ti−Tn,i = 1, . . . , n − 1, the dimension can be reduced ton − 1, which nonetheless may still be quite largewhen methods such as the Gauss–Hermite quadra-ture are employed.

12 M. CATTELAN

Alternatively, it is possible to represent the jointdistribution of the observations as a multivariateprobit model. Let Z∗

s = D(Zs −Aµ) be the stan-dardized version of the latent variable Zs, whereD= [diag(ΣZ)]

−1/2 and ΣZ denotes the covariancematrix of Zs expressed as in model (3.1) or in model(3.2). Then, Z∗

s follows a multivariate normal distri-bution with mean 0 and correlation matrix ΣZ∗ =DΣZD. Object i is preferred to object j when z∗sij ≥τ∗ij , where the vector of the thresholds is given byτ ∗ =−DAµ. The likelihood function is the productof the probability of the observations for each judge

L(ψ;Y) =

S∏

s=1

Ls(ψ;Ys),

where

Ls(ψ;Ys) =

∫

Rs12

· · ·∫

Rsn−1n

φN (z∗s;ΣZ∗)dz∗s,

φN (·;ΣZ∗) denotes the density function of an N -dimensional normal random variable with mean 0

and correlation matrix ΣZ∗ and

Rsij =

(−∞, τ∗ij) if Ysij = 1,(τ∗ij,∞) if Ysij = 2.

Note that this approach requires the approxima-tion of S integrals whose dimension is equal to N =n(n−1)/2, the number of paired comparisons, so itsgrowth is quadratic with the increase in the numberof objects. However, there is a large literature aboutmethods for approximate inference in multivariateprobit models. The algorithm proposed by Genz andBretz (2002) to approximate multivariate normalprobabilities is based on quasi-Monte Carlo meth-ods, and Craig (2008) warns against the randomnessof this method for likelihood evaluation. A determin-istic approximation is developed by Miwa, Hayterand Kuriki (2003), but it is available only for in-tegrals of dimension up to 20 since even for sucha dimension its computation is very slow. Approxi-mations based on Monte Carlo methods can be used(Chib and Greenberg (1998)), but they may be com-putationally expensive if the dimension of the inte-gral is very large. Bockenholt and Tsai (2001) use anEM algorithm, while in econometric theory a maxi-mum simulated likelihood approach in which multi-variate normal probabilities are simulated throughthe Geweke–Hajivassiliou–Keane algorithm is em-ployed (Train (2009)). A further approach may bebased on data cloning (Lele, Nadeem and Schmu-land (2010)). When integrals are very large, and theapproximation is computationally demanding and

time-consuming, it is possible to resort to limited in-formation estimation methods, which are estimationprocedures based on low dimensional margins. Here,we compare two different methods. The first one iswidely applied in the context of multiple judgmentsampling (Maydeu-Olivares, 2001, 2002; Maydeu-Olivares and Bockenholt 2005) and will be calledlimited information estimation; the second is pro-posed in the context of object-specific dependenciesin Cattelan (2009) and is called pairwise likelihood.The limited information estimation procedure con-

sidered here consists of three stages. In the first stagethe threshold parameters τ ∗ are estimated exploit-ing the empirical univariate proportions of wins. Inthe second stage the elements of ΣZ∗ , which aretetrachoric correlations, are estimated employing thebivariate proportions of wins. Finally, in the thirdstage the model parameters ψ are estimated by min-imizing the function

G= κ− κ(ψ)′Wκ− κ(ψ),(3.4)

where κ denotes the thresholds, and tetrachoric cor-relations, estimated in the first and second stages,κ(ψ) denotes the thresholds, and tetrachoric corre-lations under the restrictions imposed on those pa-rameters by the model parameters ψ and W is a non-negative definite matrix. Let Ξ denote the asymp-totic covariance matrix of κ. Then it is possible

to use W= Ξ−1

(Muthen (1978)), W= [diag(Ξ)]−1

(Muthen, Du Toit and Spisic (1997)) or W = I

(Muthen (1993)). The last two options seem morestable in data sets with a small number of objects(Maydeu-Olivares (2001)). This method is very fast,and Maydeu-Olivares (2001) states that it may havean edge over full information methods because ituses only the one and two-dimensional marginals ofa large and sparse contingency table.Pairwise likelihood (Le Cessie and Van Houwelin-

gen (1994)) is a special case of the broader class ofcomposite likelihoods (Lindsay (1988); Varin, Reidand Firth (2011)). The pairwise likelihood of all theobservations is the product of the pairwise likeli-hoods relative to the single judges Lpair(ψ;Y) =∏S

s=1Lspair(ψ;Ys), where

Lspair(ψ;Ys)

=

n−2∏

i=1

n−1∏

j=i+1

n−1∏

k=i

n∏

l=j+1

pr(Ysij = ysij, Yskl = yskl).

Let ℓspair(ψ;Ys) = logLspair(ψ;Ys) denote the log-

arithm of the pairwise likelihood for subject s and


ℓpair(ψ;Y) =∑S

s=1 ℓspair(ψ;Ys) be the whole pair-

wise log-likelihood. Under usual regularity condi-tions on the log-likelihood of univariate and bivari-ate margins, the maximum pairwise likelihood esti-mator is consistent and asymptotically normally dis-tributed with mean ψ and covariance matrixH(ψ)−1J(ψ)H(ψ)−1, where J(ψ) = var∇ℓpair(ψ;Y) and H(ψ) =E−∇2ℓpair(ψ;Y) (Molenberghsand Verbeke (2005); Varin, Reid and Firth (2011)).Unfortunately, the analogous of the likelihood ratiotest based on pairwise likelihood does not follow theusual chi-square distribution (Kent (1982)). In themultiple judgment sampling context, it is naturalto consider asymptotic properties of pairwise like-lihood estimators computed as the number of sub-jects increases, that is, as S →∞. When the numberof paired comparisons per subject is bounded, theabove properties are satisfied (Zhao and Joe (2005)).Pairwise likelihood reduces noticeably the computa-tional effort since it requires only the computationof bivariate normal probabilities. The standard er-rors can be computed straightforwardly by exploit-ing the independence between the observations ofdifferent judges. In fact, H(ψ) can be estimatedby the Hessian matrix computed at the maximumpairwise likelihood estimate, while the cross-product∑S

s=1∇ℓspair(ψ;Ys)∇ℓspair(ψ;Ys)′ can be used to es-

timate J(ψ).The case of object-related dependencies is not con-

sidered in the following simulation study; however,note that some different difficulties arise. As alreadypointed out, in this context there is a large n andsmall S, so the limited information estimation meth-od cannot be applied, but pairwise likelihood can

still be employed (Cattelan (2009)). However, it ismore problematic to consider the asymptotic behav-ior of the maximum pairwise likelihood estimatorwhen data are a long sequence of dependent observa-tions; see, for example, Cox and Reid (2004). In thecontext of paired comparison data, results of sim-ulations for increasing n when all possible pairedcomparisons are performed are encouraging (Cat-telan (2009)); however, theoretical results for thisinstance are still lacking.

3.4.2 Simulation studies Simulation studies wereperformed considering models (3.1) and (3.2). It isassumed that n= 4; hence also a full likelihood ap-proach based on the algorithm by Miwa, Hayter andKuriki (2003) can be used since the integral has di-mension 6.The first simulation setting is the same as that

proposed in Maydeu-Olivares (2001), where the mod-el Zs =AT+ es is assumed with

µ=

0.50

−0.50

, ΣT =

10.8 10.7 0.6 10.8 0.7 0.6 1

and the covariance matrix of e isΩ= ω2I6. For iden-tification purposes the diagonal elements of ΣT areset equal to 1, µ4 = 0 and ω2 = 1. Hence, in thiscase ΣT is actually a correlation matrix. Table 4shows the mean and medians of the simulated esti-mates on 1000 data sets assuming S = 100 judges.Moreover, the average of model-based standard er-rors and the simulation standard deviations are re-ported. In limited information estimation, the ma-trix W= I is employed. In this setting all the meth-ods seem to perform comparably well. Table 5 shows

Table 4

Average (Mn) and median (Md) simulated estimates, average model-based standard errors (s.e.) andsimulation standard deviations (s.d.) of parameters estimated by maximum likelihood (ML),

limited information estimation (LI) and pairwise likelihood (PL)

True

value

ML LI PL

Mn s.e. s.d. Mn Md s.e. s.d. Mn Md s.e. s.d.

µ1 0.5 0.51 0.13 0.13 0.51 0.50 0.13 0.13 0.50 0.50 0.13 0.13µ2 0 0.01 0.12 0.13 0.01 0.01 0.12 0.13 0.01 0.01 0.12 0.13µ3 −0.5 −0.49 0.15 0.15 −0.50 −0.48 0.15 0.15 −0.49 −0.48 0.15 0.15σ12 0.8 0.80 0.12 0.14 0.78 0.80 0.13 0.14 0.79 0.80 0.13 0.15σ13 0.7 0.70 0.17 0.17 0.69 0.71 0.17 0.17 0.69 0.71 0.18 0.18σ14 0.8 0.79 0.13 0.14 0.78 0.79 0.13 0.14 0.78 0.80 0.14 0.15σ23 0.6 0.58 0.19 0.20 0.57 0.60 0.19 0.20 0.57 0.60 0.19 0.20σ24 0.7 0.68 0.16 0.16 0.66 0.67 0.16 0.17 0.67 0.68 0.16 0.17σ34 0.6 0.58 0.21 0.20 0.57 0.60 0.20 0.20 0.57 0.60 0.20 0.20

14 M. CATTELAN

Table 5

Empirical coverage of confidence intervals for modelparameters of limited information estimator (LI)

and pairwise likelihood estimator (PL) atnominal levels 95%, 97.5% and 99%

0.950 0.975 0.990

LI PL LI PL LI PL

µ1 0.947 0.958 0.982 0.978 0.992 0.992µ2 0.960 0.964 0.978 0.976 0.988 0.988µ3 0.941 0.930 0.969 0.972 0.995 0.991σ12 0.959 0.985 0.975 0.997 0.989 1.000σ13 0.934 0.939 0.961 0.967 0.968 0.985σ14 0.941 0.968 0.967 0.996 0.988 1.000σ23 0.965 0.970 0.973 0.980 0.987 0.995σ24 0.943 0.933 0.951 0.959 0.967 0.973σ34 0.953 0.946 0.969 0.966 0.977 0.989

the empirical coverages of confidence intervals basedon the normal approximation.The second simulation setting considers model (3.2)

proposed by Tsai and Bockenholt (2008). Here, weconsider differences with a reference object, so wecompute means and variances of the differences Ti =Ti − Tn for i= 1, . . . , n− 1. The assumed worth pa-rameters of these differences are µ= (−0.2,1,−1.5)while the covariance matrix is

1.5 1 1.31 4 2.51.3 2.5 3

,

and σij is used to denote the element in row i andcolumn j of the above reduced matrix. Differentlyfrom the previous setting, this specification of the

model allows one to estimate also the variance of thedifferences Ti−Tn and to check whether they are dif-ferent for the various objects. Tsai and Bockenholt(2008) propose a specification of the matrix B whichdepends only on one parameter b whose value is setequal to 0.5.Table 6 presents the results of the simulations.

Maximum likelihood based on numerical integrationis the method that performs best; however, maxi-mization of the likelihood was not always straight-forward, and sometimes the optimization algorithmsemployed stopped at a point where the Hessian ma-trix was not negative definite.Pairwise likelihood estimation seems to perform

quite well, especially if compared to limited infor-mation estimation, which seems not satisfactory inthis case with S = 100, as already noticed in Tsaiand Bockenholt (2008). Estimating the parametersof the covariance matrix appears more problematicthan the estimation of the worth parameters, andthe average of the simulated estimates is particu-larly influenced by some large values, but the medianshows a better performance. In particular, while theaverage simulated estimates for limited informationestimation shows a maximum percentage bias equalto 44.1%, for the median it reduces to 15.4%. Themaximum bias for the mean of the simulated esti-mates using pairwise likelihood is 16.1%, while forthe median it is 4%. In both cases, pairwise likeli-hood shows lower bias. The standard errors of pair-wise likelihood estimates are lower, thus yieldingshorter confidence intervals. Table 7 reports the em-pirical coverage of Wald-type confidence intervals for

Table 6

Average (Mn) and median (Md) simulated estimates, average model-based standard errors (s.e.) andsimulation standard deviations (s.d.) of parameters estimated by maximum likelihood (ML),

limited information estimation (LI) and pairwise likelihood (PL)

True

value

ML LI PL

Mn s.e. s.d. Mn Md s.e. s.d. Mn Md s.e. s.d.

µ1 −0.2 −0.21 0.19 0.18 −0.23 −0.21 0.21 0.22 −0.22 −0.20 0.19 0.19µ2 1 1.00 0.30 0.31 1.07 1.07 0.42 0.47 1.03 1.00 0.33 0.33µ3 −1.5 −1.51 0.31 0.32 −1.59 −1.59 0.49 0.51 −1.54 −1.51 0.36 0.35

σ21 1.5 1.53 0.83 0.81 2.06 1.58 1.97 1.64 1.70 1.44 1.05 0.95

σ22 4 3.98 1.73 1.75 5.34 4.37 4.42 4.48 4.45 3.92 2.42 2.15

σ23 3 3.01 1.41 1.42 3.91 3.19 3.17 3.25 3.32 3.04 1.93 1.73

σ12 1 0.98 0.70 0.64 1.34 1.06 1.44 1.30 1.12 0.97 0.87 0.77σ13 1.3 1.29 0.73 0.71 1.72 1.39 1.48 1.49 1.43 1.27 0.95 0.84σ23 2.5 2.49 1.09 1.09 3.35 2.72 2.67 2.77 2.77 2.48 1.53 1.33b 0.5 0.53 0.41 0.39 0.72 0.58 0.82 0.98 0.58 0.50 0.50 0.51


Table 7

Empirical coverage of confidence intervals for modelparameters of limited information estimator (LI)

and pairwise likelihood estimator (PL) atnominal levels 95%, 97.5% and 99%

0.950 0.975 0.990

LI PL LI PL LI PL

µ1 0.955 0.935 0.981 0.965 0.994 0.983µ2 0.962 0.960 0.973 0.974 0.986 0.986µ3 0.920 0.938 0.941 0.960 0.961 0.977σ21 0.932 0.922 0.947 0.936 0.959 0.966

σ22 0.932 0.924 0.949 0.945 0.964 0.961

σ23 0.936 0.937 0.949 0.953 0.963 0.964

σ12 0.932 0.937 0.951 0.956 0.966 0.970σ13 0.915 0.912 0.929 0.933 0.939 0.945σ23 0.922 0.920 0.941 0.937 0.953 0.951b 0.936 0.936 0.946 0.953 0.963 0.963

the estimated limited information estimation andpairwise likelihood. The coverage rates of the twomethods are very similar, and in both cases the ac-tual coverage for parameters of the covariance ma-trix appears systematically lower than the nominallevels. In order to obtain accurate coverage probabil-ities, we may need to resort to a bootstrap procedurefor detecting the distribution of the statistic, whilewith pairwise likelihood it may be possible to obtainintervals based on the pairwise likelihood function.

Example. We fit model (3.1) to universities’ databy means of pairwise likelihood. A full likelihoodapproach based on numeric approximation impliescomputing 303 integrals of dimension 5, in case a uni-

versity is used as reference object, both for the meanand covariance structure, but methods such as theGauss–Hermite quadrature are affected by the curseof dimensionality. A multivariate probit approachwould require a very slow computation because thealgorithm by Miwa would take very long to approx-imate 303 integrals of dimension 15. It is assumedthat Ω = ω2I15. Table 8 displays the results of theestimates, employing two different sets of constraints.The lower triangle of the covariance matrix shownin Table 8 reports the estimates obtained using theconstraints proposed in Maydeu-Olivares and Her-nandez (2007); see Section 3.2.2. The estimate of thethreshold parameter (with standard error in brack-ets) is τ2 = 0.205 (0.018) while the variance parame-ter is ω2 = 0.180 (0.026). A high correlation is es-timated between Barcelona and Milan, so strongpreference for Barcelona is associated with strongpreference for Milan. Even though some correlationsdo not seem significant, it appears that a strongpreference for St. Gallen is associated with a weakpreference for all the other universities but Stock-holm. The worth parameters denote the same rank-ing of all universities as the one arising from Ta-ble 2. However, note that the estimated worth pa-rameters cannot be considered as absolute measuresof worth of items; indeed, it is possible to obtainalternative solutions that give an equivalent fitting.The mean parameters that can be identified in the

model are standardized differences, that is, (µi −µ6)/

√σ2i + σ2

6 − 2σi6 + ω2, i = 1, . . . ,5, where µ6

and σ26 are the mean and variance of the latent vari-

able referring to Stockholm, the reference university.

Table 8

Estimates and standard errors (in brackets) of mean and correlation parameters of model (3.1) for universities datausing constraints proposed by Maydeu-Olivares and Hernandez (2007). In italics the estimates and

standard errors of a model with fixed correlation between Paris and St. Gallen

Barcelona London Milan Paris St. Gallen Stockholm µ

Barcelona 1 −0.064 0.688 0.063 −0.472 0.265 0.405(fixed) (0.183) (0.085) (0.158) (0.146) (0.145) (0.073)

London 0.058 1 0.079 −0.069 −0.287 0.227 1.346(0.084) (fixed) (0.185) (0.224) (0.147) (0.154) (0.087)

Milan 0.724 0.185 1 0.244 −0.466 0.253 0.308(0.062) (0.097) (fixed) (0.174) (0.137) (0.160) (0.074)

Paris 0.171 0.054 0.331 1 −0.690 0.033 0.748(0.094) (0.117) (0.113) (fixed) (fixed) (0.267) (0.086)

St. Gallen −0.303 −0.139 −0.298 −0.496 1 0.194 0.371(0.113) (0.139) (0.144) (0.157) (fixed) (0.135) (0.081)

Stockholm 0.350 0.316 0.339 0.144 0.287 1 0(0.079) (0.091) (0.097) (0.113) (0.130) (fixed) (fixed)

16 M. CATTELAN

From the identified parameters, different covariancematrixes of the universities can be recovered. For ex-ample, in this instance where the matrix ΣT can beinterpreted as a correlation matrix, it is shown thatthe worth parameters

√cµ, the correlation matrix

cΣT + (1− c)11′ and the covariance matrix of thepair-specific errors cΩ produce the same fitting ofthe model for a positive constant c such that the cor-relation matrix remains positive definite (Maydeu-Olivares and Hernandez (2007)). It is possible to setone of the parameters of the correlation matrix ac-cording to some assumption, for example we maypresume that a strong preference for Paris is asso-ciated with a weak preference for St. Gallen, anddetermine the value of c which minimizes the cor-relation between the two universities while yieldinga positive definite correlation matrix. The value isc= 1.13 which produces a correlation between Parisand St. Gallen equal to −0.690. The estimates ofthe correlation matrix with this fixed value of cor-relation between Paris and St. Gallen are shown inthe upper triangle of the matrix in Table 8. Theworth parameters can be computed by multiplyingthe estimates shown in Table 8 by

√1.13. The fitting

of the two models is equal, but in the second caseestimation is based on some previous theory aboutcorrelation between a certain couple of universities.

This analysis has only an illustrative purpose, inparticular Bockenholt (2001b) finds that a modelwith thresholds that vary among subjects performsbetter than a model with a constant threshold pa-rameter.

3.4.3 Model selection and goodness of fit Pairedcomparison data can be arranged in a contingencytable. In case of multiple judgment sampling thedata can be arranged in a table of dimension 2N

when there are two possible outcomes and HN whenthe outcomes areH-categorical. As a result, the con-tingency table will typically be very sparse, espe-cially if covariates are included so that paired com-parisons are observed conditional on the values ofthe covariates. In this situation the likelihood ra-tio statistic and the Pearson statistic do not followa χ2 distribution, nevertheless these statistics are of-ten employed to assess the model and for model se-lection. Differences between observed and expectedfrequencies for subsets of the data, as the 2× 2 sub-tables or triplets of comparisons, are sometimes con-sidered in order to identify where the fitting of themodel is not good. In Dittrich et al. (2007) the de-viance is used for selection between nested models,

but the test of goodness of fit cannot be based on theasymptotic χ2 distribution so a Monte Carlo proce-dure is employed.Since the goodness of fit of the model cannot be

assessed through the usual statistics and Monte Car-lo procedures are computationally expensive, somestatistics based on lower dimensional marginals ofthe contingency table have been proposed. In gen-eral the statistics proposed are quadratic forms ofthe residuals

pr −πr(ψ)′Cpr −πr(ψ),(3.5)

where C is a weight matrix, pr denotes the samplemarginal proportions and r denotes a set of lowerorder marginals.Maydeu-Olivares (2001) considers the statistic G

as in (3.4) employed for estimation, which corre-sponds to setting C = W in (3.5) and r denotingunivariate and bivariate marginal probabilities. Thestatistic SG is analyzed in order to test H0 :κ =

κ(ψ). When W = Ξ−1

, then SGd→ χ2

d where d =N(N+1)/2−q and q is the number of model param-

eters. However, when W = [diag(Ξ)]−1 or W= I ,the asymptotic distribution of the statistic is a weight-ed sum of d chi-square random variables with onedegree of freedom. Maydeu-Olivares (2001) proposesto rescale the test statistic in order to match theasymptotic chi-square distribution. The same proce-dure is followed in the proposal for testing H0 :π2 =π2(ψ), where π2 is the vector of all univariate andbivariate marginal probabilities. Maydeu-Olivares(2006) considers the testing of further hypothesesbut the issue of the asymptotic distribution beinga weighted sum of chi-square distributions remains.Maydeu-Olivares and Joe (2005) consider testing

the hypothesis H0 : π = π(ψ) in a multidimensionalcontingency table, where π is the 2N -dimensionalvector of joint probabilities. Again, the use of mar-ginal residuals up to order r is considered. Let πdenote a vector which stacks all the marginal prob-abilities: univariate, bivariate, trivariate and so on.There is a one-to-one correspondence between π andπ so that for a particular matrix Λ of 0’s and 1’sπ =Λπ. If only marginal probabilities up to order rare considered, then πr =Λrπ for a sub-matrix Λr

of Λ. Let ∆ = ∂π/∂ψ and Γ = E − ππ′, whereE= diag(π). Maydeu-Olivares and Joe (2005) pro-pose the statistic

Mr = Spr −πr(ψ)′Cr(ψ)pr −πr(ψ),(3.6)

whereCr(ψ) =F−1r −F−1

r ∆r(∆′

rF−1r ∆r)

−1∆′

rF−1r ,

Fr =ΛrΓΛ′

r and ∆r =Λr∆. Mr is asymptotically


distributed as a χ2l−q random variable where l is the

length of pr. The Mr statistic asymptotically fol-lows a chi-square distribution not only when ψ isthe maximum likelihood estimator, but also when itis a

√S-consistent estimate, such as the limited in-

formation estimator and the pairwise likelihood esti-mator presented in Section 3.4.1. Since the marginalsshould not be sparse, Maydeu-Olivares and Joe (2005)suggest to use M2 when the model is identified usingonly univariate and bivariate information, also be-cause only up to bivariate sample moments and four-way model probabilities are involved in the compu-tation of M2. As the number of cells gets larger, thedimension of the matrices involved in (3.6) increasesnoticeably, and tricks may be necessary to do thecomputations. Analysis and extensions of this typeof test are considered in Maydeu-Olivares and Joe(2006), Reiser (2008) and Joe and Maydeu-Olivares(2010). All applications considered regard item re-sponse theory, so an investigation of their perfor-mance in paired comparison data is necessary tounderstand the sample size needed for obtaining ac-curate Type I errors using M2.

4. SOFTWARE

Fitting models to paired comparison data is facil-itated by some R packages which allow fitting of theclassical models and, in some cases, also fitting ofmore complicated models.The eba package (Wickelmaier and Schmid (2004))

fits elimination by aspects models (Tversky (1972))to paired comparison data. The elimination by as-pects model assumes that different objects presentvarious aspects. The worth of each object is thesum of the worth associated with each aspect pos-sessed by the object. When all objects possess onlyone relevant aspect, then the elimination by aspectsmodel reduces to the Bradley–Terry model. There-fore, in case only one aspect per object is specified,the function eba can be used to fit model (2.1) withlogit link, while when the link is probit the functionthurstone can be used. The function strans checkshow many violations of weak, moderate and strongstochastic transitivity are present in the data.The prefmod package (Hatzinger (2010)) fits Brad-

ley–Terry models exploiting their log-linear repre-sentation. Ordinal paired comparisons are allowed,but the software reduces the total number of cate-gories to three or two, depending on whether thereis a no preference category or not.There are three different functions for estimating

models for paired comparison data: the llbt.fit

function which estimates the log-linear version of theBradley–Terry model through the estimation algo-rithm described in Hatzinger and Francis (2004), thellbtPC.fit function that estimates the log-linearmodel exploiting the gnm (Turner and Firth (2010b))function for fitting generalized nonlinear models andthe pattPC.fit function, which fits paired compar-ison data using a pattern design, that is, all possiblepatterns of paired comparisons. The latter functionhandles also some cases in which the responses aremissing not at random; see Section 5. A difficulty ofthis approach is that the response table grows dra-matically with the number of objects since, in case ofonly two possible outcomes, the number of patternsis 2N , so no more than six objects can be includedwith two response categories, and not more than fivewith three response categories. Finally, the functionpattnpml.fit fits a mixture model to overdispersedpaired comparison data using nonparametric maxi-mum likelihood.The BradleyTerry2 package (Turner and Firth

(2010a)) expands the previous BradleyTerry (Firth(2008)) package and allows one to fit the unstruc-tured model (2.1) and extension (2.3) with logit,probit and cauchit link functions, including also com-parison-specific covariates. Model fitting is either bymaximum likelihood, penalized quasi-likelihood orbias-reduced maximum likelihood (Firth (1993)). Incase of object specific random effects, as in model(3.3), penalized quasi-likelihood (Breslow and Clay-ton (1993)) is used, while when an object wins orloses all the paired comparisons in which it is in-volved and its estimate worth parameter is infinite,then the bias-reduced maximum likelihood producesfinite estimates. If there are missing explanatory vari-ables, an additional worth parameter for the objectwith missing covariates is estimated. Order effectsand more general comparison-specific covariates canbe included, but only win-loss responses are allowed.The package psychotree (Strobl, Wickelmaier and

Zeileis (2011)) implements the method for recursivepartitioning of the subjects on the basis of their ex-planatory variables and estimates an unstructuredBradley–Terry model for each of the final subgroupsof subjects; see Section 2.4.Although the available packages have many use-

ful features, a combination of those provided by thedifferent packages and also some additional featurescould be of practical help. The prefmod andBradleyTerry2 packages were built with the aim ofanalyzing multiple judgment data and tournament-like data, respectively. This is reflected in the dif-

18 M. CATTELAN

ferent characteristics of the packages. A functionthat can handle data with at least three-categoricalresults, thus allowing for the “no preference” cat-egory, include different link functions, and an easyimplementation of object-, subject- and comparison-specific covariates in a linear model framework wouldbe useful. The available methods for including de-pendencies between observations are only in a log-linear framework through the introduction of fur-ther parameters in the predictor or including object-related random effects, which are estimated by meansof penalized quasi likelihood, a method that does notperform well with binary data. At present, there areno available packages for the analysis of paired com-parison data that allow the fitting of models as thosepresented in Section 3.2.1. However, implementationof pairwise likelihood estimation for those models isstraightforward since it implies only the computa-tion of bivariate normal probabilities.

5. CONCLUSIONS

This paper reviews some of the extensions pro-posed in the literature to the two most commonlyapplied models for paired comparison data, namelythe Bradley–Terry and the Thurstone models. How-ever, not every aspect could be considered here, andamong issues that have not been treated, there arethe development of models for multi-dimensional da-ta when objects are evaluated with respect to multi-ple aspects (Bockenholt (1988); Dittrich et al. (2006)),the temporal extension for comparisons repeated intime (Fahrmeir and Tutz (1994), Glickman (2001),Bockenholt (2002), Dittrich, Francis and Katzen-beisser (2008)), the estimation of abilities of individ-uals belonging to a team that performs the pairedcomparisons (Huang, Weng and Lin (2006); Menkeand Martinez (2008)) and many more. Another im-portant issue concerns the optimal design of the ex-periment. Graßhoff et al. (2004) show that the mini-mum sample size required for maximizing the deter-minant of the information matrix in an unstructuredBradley–Terry model requires that every compari-son is performed once. When objects are specifiedusing factors with a certain number of levels, therequired sample size grows exponentially, while thenumber of parameters grows linearly as the num-ber of factors increases. Some designs, in order toreduce the number of required comparisons, are in-vestigated in Graßhoff et al. (2004). In Graßhoffand Schwabe (2008) a characterization of the lo-cally optimal design in case of two factors design in

a Bradley–Terry model is given, but for more com-plex situations it seems difficult to give general re-sults. Goos and Grossmann (2011) consider also theproblem when within-pair order effects are present.It seems that investigation of these issues in othermodels are not present in the literature.The methods for independent data are well estab-

lished, and a lot of literature has been publishedabout them. The problem of the asymptotic behav-ior of the maximum likelihood estimator has beentackled. The case of a fixed number of objects andincreasing number of comparisons per couple doesnot seem to pose particular difficulties for standardarguments, while more problematic appears the in-stance of a fixed number of comparisons per cou-ple and increasing number of items. In the contextof the unstructured Bradley–Terry model, Simonsand Yao (1999) find a condition on the growth rateof the largest ratio between item worth parameterswhich assures that the maximum likelihood estima-tor is consistent and asymptotically normally dis-tributed. Yan, Yang and Xu (2012) investigate thecase in which the number of comparisons per cou-ple is not fixed, and some comparisons may also bemissing, and find a condition that assures normalityof the maximum likelihood estimator. We are not ac-quainted with any other investigation of asymptoticbehavior of estimators in models different from theunstructured, independent Bradley–Terry model.Particular attention has been focused on mod-

els for dependent data. Thurstonian models appearparticularly suitable to account for dependence be-tween observations. However, the problems posedby the identification restrictions are noticeable. Theestimated model has to be interpreted with refer-ence to a class of covariance matrices, and differ-ent identification restrictions may lead to differentclass of matrices. It is possible to rotate the ma-trix according to a predefined hypothesis about thecovariance between certain items (Maydeu-Olivaresand Hernandez (2007)), but the estimated standarderrors vary depending on the fixed parameters andthe significance of the other estimated parameterschanges.In the multiple judgment sampling scheme it is of-

ten stated that if a judge does not perform all pairedcomparisons, then it suffices to define subject-specificmatrices As (see Section 3.2.1) with rows corre-sponding only to the comparisons performed byjudge s. However, it is expected that this may beproblematic for estimation by means of limited in-formation estimation, and there are no studies about


the consequences of missing data in this estimationmethod.Missing observations cause problems also for test-

ing the goodness of fit since quadratic statistics as(3.5) assume that all comparisons are performed byall subjects.Missing data may derive from the design of the ex-

periment, for example when n is very large, and onlya subset of all comparisons is presented to each sub-ject. Otherwise, if many comparisons are performedby the same subject it may be necessary to accountfor the fatigue of subjects and/or for the passingof time when comparisons take long in order to beaccomplished.Dittrich et al. (2012) consider the problem of miss-

ing data in the context of the log-linear representa-tion of the Bradley–Terry model since the study ofthe missing mechanism may shed light on the psy-chological process. It is assumed that the probabilitythat a comparison is missing follows a logistic distri-bution since this facilitates the fitting of the model.However, the likelihood for such models is not easyto compute, and the function in the prefmod pack-age allows one to compute it only for data with upto six objects. It is not easy to discriminate betweendifferent types of missing mechanisms, and a verylarge number of observations may be needed in or-der to discriminate between a missing completely atrandom and missing not at random situation.The economic theory points out some problems

in choice data that have not been considered yet.The main aspects which may need to be incorpo-rated in models include the influence that subjectscan have on each other, the influence of one partic-ular subject, that may be some sort of leader, overall the other judges and the dependence on choicescaused by the social and cultural context. Inclusionsof these aspects will inevitably lead to even morecomplicated models for paired comparison data.Finally, methods for object-related dependencies

present many open problems. Most of the issues areconnected to the dependence among all comparisonswhich is typically present in this context. Moreover,the scheme of paired comparisons is often much lessbalanced than in psychometric experiments. Asymp-totic theory in models for dependent data when thenumber of items compared increases has not beendeveloped yet. Maximum pairwise likelihood esti-mation provided encouraging results, but more ex-tensive studies seem necessary. In this case, compu-tation of standard errors is problematic since thereare no independent replications of the data, so a vi-

able alternative lies in parametric bootstrap. Meth-ods for model selection and goodness of fit describedin Section 3.4.3 require independent replication ofall comparisons; hence they cannot be employed inthis setting.

ACKNOWLEDGMENTS

The author would like to thank Cristiano Varinand Alessandra Salvan for helpful comments andthe anonymous referees and an Associate Editor forcomments and suggestions that led to a substantialimprovement of the manuscript.

REFERENCES

Agresti, A. (1992). Analysis of ordinal paired comparisondata. J. R. Stat. Soc. Ser. C Appl. Stat. 41 287–297.

Agresti, A. (2002). Categorical Data Analysis, 2nd ed. Wi-ley, New York. MR1914507

Barry, D. and Hartigan, J. A. (1993). Choice modelsfor predicting divisional winners in major league baseball.J. Amer. Statist. Assoc. 88 766–774.

Bauml, K. H. (1994). Upright versus upside-down faces: Howinterface attractiveness varies with orientation. Percept.Psychophys. 56 163–172.

Bockenholt, U. (1988). A logistic representation of multi-variate paired-comparison models. J. Math. Psych. 32 44–63. MR0935673

Bockenholt, U. (2001a). Hierarchical modeling of pairedcomparison data. Psychol. Methods 6 49–66.

Bockenholt, U. (2001b). Thresholds and intransitivities inpairwise judgments: A multilevel analysis. Journal of Ed-ucational and Behavioral Statistics 26 269–282.

Bockenholt, U. (2002). A Thurstonian analysis of prefer-ence change. J. Math. Psych. 46 300–314. MR1920807

Bockenholt, U. (2004). Comparative judgments as an al-ternative to ratings: Identifying the scale origin. Psychol.Methods 9 453–465.

Bockenholt, U. (2006). Thurstonian-based analyses: Past,present, and future utilities. Psychometrika 71 615–629.MR2312235

Bockenholt, U. and Dillon, W. R. (1997a). Modelingwithin-subject dependencies in ordinal paired comparisondata. Psychometrika 62 411–434.

Bockenholt, U. and Dillon, W. R. (1997b). Some newmethods for an old problem: Modeling preference changesand competitive market structures in pretest market data.Journal of Marketing Research 34 130–142.

Bockenholt, U. and Tsai, R. C. (2001). Individual differ-ences in paired comparison data. Br. J. Math. Stat. Psy-chol. 54 265–277.

Bockenholt, U. and Tsai, R. C. (2007). Random-effectsmodels for preference data. In Handbook of Statistics(C. R. Rao and S. Sinharay, eds.) 26 447–468. Elsevier,Amsterdam.

Bradley, R. A. (1976). Science, statistics, and paired com-parisons. Biometrics 32 213–232. MR0408132

http://www.ams.org/mathscinet-getitem?mr=1914507





20 M. CATTELAN

Bradley, R. A. and Terry, M. E. (1952). Rank analysisof incomplete block designs. I. The method of paired com-parisons. Biometrika 39 324–345. MR0070925

Breslow, N. E. and Clayton, D. G. (1993). Approximateinference in generalized linear mixed models. J. Amer.Statist. Assoc. 88 9–25.

Carroll, J. D. and De Soete, G. (1991). Toward a newparadigm for the study of multiattribute choice behav-ior. Spatial and discrete modeling of pairwise preferences.American Psychologist 46 342–351.

Cattelan, M. (2009). Correlation models for paired compar-ison data. Ph.D. thesis, Dept. Statistical Sciences, Univ.Padua.

Cattelan, M., Varin, C. and Firth, D. (2012). DynamicBradley–Terry modelling of sports tournaments. J. R. Stat.Soc. Ser. C Appl. Stat. To appear.

Causeur, D. and Husson, F. (2005). A 2-dimensional exten-sion of the Bradley–Terry model for paired comparisons.J. Statist. Plann. Inference 135 245–259. MR2200468

Chib, S. andGreenberg, E. (1998). Analysis of multivariateprobit models. Biometrika 85 347–361.

Choisel, S. and Wickelmaier, F. (2007). Evaluationof multichannel reproduced sound: Scaling auditory at-tributes underlying listener preference. J. Acoust. Soc. Am.121 388–400.

Cox, D. R. and Reid, N. (2004). A note on pseudolikelihoodconstructed from marginal densities. Biometrika 91 729–737. MR2090633

Craig, P. (2008). A new reconstruction of multivariate nor-mal orthant probabilities. J. R. Stat. Soc. Ser. B Stat.Methodol. 70 227–243. MR2412640

David, H. A. (1988). The Method of Paired Comparisons, 2nded. Griffin’s Statistical Monographs & Courses 41. Griffin,London. MR0947340

Davidson, R. R. (1970). On extending the Bradley–Terrymodel to accommodate ties in paired comparison experi-ments. J. Amer. Statist. Assoc. 65 317–328.

Davidson, R. R. and Farquhar, P. H. (1976). A bibliogra-phy on the method of paired comparisons. Biometrics 32

241–252. MR0408134De Soete, G. andWinsberg, S. (1993). A Thurstonian pair-

wise choice model with univariate and multivariate splinetransformations. Psychometrika 58 233–256.

Dillon, W. R., Kumar, A. and De Borrero, M. S. (1993).Capturing individual differences in paired comparisons: Anextended BTL model incorporating descriptor variables.Journal of Marketing Research 30 42–51.

Dittrich, R., Francis, B. and Katzenbeisser, W. (2008).Temporal dependence in longitudinal paired comparisons.Research report, Dept. Statistics and Mathematics, WUVienna Univ. Economics and Business.

Dittrich, R., Hatzinger, R. and Katzenbeisser, W.

(1998). Modelling the effect of subject-specific covariatesin paired comparison studies with an application to uni-versity rankings. J. R. Stat. Soc. Ser. C Appl. Stat. 47

511–525.Dittrich, R., Hatzinger, R. and Katzenbeisser, W.

(2001). Corrigendum: “Modelling the effect of subject-specific covariates in paired comparison studies with an

application to university rankings.” J. R. Stat. Soc. Ser. CAppl. Stat. 50 247–249. MR1833276


(2002). Modelling dependencies in paired comparison data:A log-linear approach. Comput. Statist. Data Anal. 40 39–57. MR1921121


(2004). A log-linear approach for modelling ordinal pairedcomparison data on motives to start a PhD program. Stat.Model. 4 1–13.

Dittrich, R., Francis, B., Hatzinger, R. and Katzen-

beisser, W. (2006). Modelling dependency in multivariatepaired comparisons: A log-linear approach. Math. SocialSci. 52 197–209. MR2257629


beisser, W. (2007). A paired comparison approach for theanalysis of sets of Likert-scale responses. Stat. Model. 7 3–28. MR2749821


beisser, W. (2012). Missing observations in paired com-parison data. Stat. Model. 12 117–143.

Duineveld, C. A. A., Arents, P. and King, B. M. (2000).Log-linear modelling of paired comparison data from con-sumer tests. Food Quality and Preference 11 63–70.

Ellermeier, W., Mader, M. and Daniel, P. (2004).Scaling the unpleasantness of sounds according to theBTL model: Ratio-scale representation and psychoacous-tical analysis. Acta Acustica United with Acustica 90 101–107.

Fahrmeir, L. and Tutz, G. (1994). Dynamic stochasticmodels for time-dependent ordered paired comparison sys-tems. J. Amer. Statist. Assoc. 89 1438–1449.

Firth, D. (1993). Bias reduction of maximum likelihood es-timates. Biometrika 80 27–38. MR1225212

Firth, D. (2005). Bradley–Terry models in R. Journal of Sta-tistical Software 12 1–12.

Firth, D. (2008). BradleyTerry: Bradley–Terry mod-els. Available at http://CRAN.R-project.org/package=

BradleyTerry.Firth, D. and de Menezes, R. X. (2004). Quasi-variances.

Biometrika 91 65–80. MR2050460Ford, L. R. Jr. (1957). Solution of a ranking problem

from binary comparisons. Amer. Math. Monthly 64 28–33.MR0097876

Francis, B.,Dittrich, R. and Hatzinger, R. (2010). Mod-eling heterogeneity in ranked responses by nonparametricmaximum likelihood: How do Europeans get their scientificknowledge? Ann. Appl. Stat. 4 2181–2202. MR2829952

Francis, B., Dittrich, R., Hatzinger, R. and Penn, R.

(2002). Analysing partial ranks by using smoothed pairedcomparison methods: An investigation of value orientationin Europe. J. R. Stat. Soc. Ser. C Appl. Stat. 51 319–336.MR1920800

Genz, A. and Bretz, F. (2002). Comparison of methods forthe computation of multivariate t probabilities. J. Comput.Graph. Statist. 11 950–971. MR1944269

Glenn, W. A. and David, H. A. (1960). Ties in paired-comparison experiments using a modified Thurstone–Mosteller model. Biometrics 16 86–109.












http://CRAN.R-project.org/package=BradleyTerry

http://CRAN.R-project.org/package=BradleyTerry







Glickman, M. E. (2001). Dynamic paired comparison mod-els with stochastic variances. J. Appl. Stat. 28 673–689.MR1862491

Goos, P. and Grossmann, H. (2011). Optimal design of fac-torial paired comparison experiments in the presence ofwithin-pair order effects. Food Quality and Preference 22

198–204.Graßhoff, U. and Schwabe, R. (2008). Optimal design for

the Bradley-Terry paired comparison model. Stat. MethodsAppl. 17 275–289. MR2425186

Graßhoff, U., Großmann, H., Holling, H. andSchwabe, R. (2004). Optimal designs for main ef-fects in linear paired comparison models. J. Statist. Plann.Inference 126 361–376. MR2090864

Hatzinger, R. (2010). prefmod: Utilities to fit paired com-parison models for preferences. Available at http://CRAN.R-project.org/package=prefmod.

Hatzinger, R. and Francis, B. J. (2004). Fitting pairedcomparison models in R. Research report, Univ. Wien.Available at http://epub.wu.ac.at/id/eprint/740.

Head, M. L., Doughty, P., Blomberg, S. P. andKeogh, S. (2008). Chemical mediation of reciprocalmother–offspring recognition in the Southern Water Skink(Eulamprus heatwolei). Australian Ecology 33 20–28.

Henery, R. J. (1992). An extension to the Thurstone–Mosteller model for chess. The Statistician 41 559–567.

Huang, T.-K., Weng, R. C. and Lin, C.-J. (2006). Gen-eralized Bradley-Terry models and multi-class probabilityestimates. J. Mach. Learn. Res. 7 85–115. MR2274363

Joe, H. and Maydeu-Olivares, A. (2010). A general familyof limited information goodness-of-fit statistics for multi-nomial data. Psychometrika 75 393–419. MR2719935

Kent, J. T. (1982). Robust properties of likelihood ratiotests. Biometrika 69 19–27. MR0655667

Kissler, J. and Bauml, K. H. (2000). Effects of the be-holder’s age on the perception of facial attractiveness. ActaPsychol. (Amst) 104 145–166.

Knorr-Held, L. (2000). Dynamic rating of sports teams.The Statistician 49 261–276.

Lancaster, J. F. and Quade, D. (1983). Random effectsin paired-comparison experiments using the Bradley–Terrymodel. Biometrics 39 245–249. MR0712751

Le Cessie, S. and Van Houwelingen, J. C. (1994). Logisticregression for correlated binary data. J. R. Stat. Soc. Ser.C Appl. Stat. 43 95–108.

Lele, S. R., Nadeem, K. and Schmuland, B. (2010). Es-timability and likelihood inference for generalized linearmixed models using data cloning. J. Amer. Statist. Assoc.105 1617–1625. MR2796576

Lindsay, B. G. (1988). Composite likelihood methods. InStatistical Inference from Stochastic Processes (Ithaca, NY,1987). Contemp. Math. 80 221–239. Amer. Math. Soc.,Providence, RI. MR0999014

Luce, R. D. (1959). Individual Choice Behavior: A Theoret-ical Analysis. Wiley, New York. MR0108411

Marschak, J. (1960). Binary-choice constraints and ran-dom utility indicators. In Mathematical Methods in the So-cial Sciences, 1959 (Arrow, K. J., Karlin, S. and Sup-

pes, S., eds.) 312–329. Stanford Univ. Press, Stanford, CA.MR0118556

Matthews, J. N. S. and Morris, K. P. (1995). An appli-cation of Bradley–Terry-type models to the measurementof pain. J. R. Stat. Soc. Ser. C Appl. Stat. 44 243–255.

Maydeu-Olivares, A. (2001). Limited information estima-tion and testing of Thurstonian models for paired compar-ison data under multiple judgment sampling. Psychome-trika 66 209–227. MR1836935

Maydeu-Olivares, A. (2002). Limited information estima-tion and testing of Thurstonian models for preference data.Math. Social Sci. 43 467–483. MR2073576

Maydeu-Olivares, A. (2003). Thurstonian covarianceand correlation structures for multiple judgment pairedcomparison data. Working Papers Economia, Instituto deEmpresa, Area of Economic Environment. Available athttp://econpapers.repec.org/RePEc:emp:wpaper:wp03-04 .

Maydeu-Olivares, A. (2006). Limited information estima-tion and testing of discretized multivariate normal struc-tural models. Psychometrika 71 57–77. MR2272520

Maydeu-Olivares, A. and Bockenholt, U. (2005). Struc-tural equation modeling of paired-comparison and rankingdata. Psychometrika 10 285–304.

Maydeu-Olivares, A. and Bockenholt, U. (2008). Mod-eling subject health outcomes. Top 10 reasons to use Thur-stone’s method. Medical Care 46 346–348.

Maydeu-Olivares, A. and Hernandez, A. (2007). Identi-fication and small sample estimation of Thurstone’s unre-stricted model for paired comparisons data. MultivariateBehavioral Research 42 323–347.

Maydeu-Olivares, A. and Joe, H. (2005). Limited- andfull-information estimation and goodness-of-fit testing in2n contingency tables: A unified framework. J. Amer.Statist. Assoc. 100 1009–1020. MR2201027

Maydeu-Olivares, A. and Joe, H. (2006). Limited infor-mation goodness-of-fit testing in multidimensional contin-gency tables. Psychometrika 71 713–732. MR2312239

Mazzucchi, T. A., Linzey, W. G. and Bruning, A. (2008).A paired comparison experiment for gathering expert judg-ment for an aircraft wiring risk assessment. Reliability En-gineering and System Safety 93 722–731.

McFadden, D. (2001). Economic choices. American Eco-nomic Review 91 351–378.

McHale, I. and Morton, A. (2011). A Bradley–Terry typemodel for forecasting tennis match results. InternationalJournal of Forecasting 27 619–630.

Mease, D. (2003). A penalized maximum likelihood approachfor the ranking of college football teams independent ofvictory margins. Amer. Statist. 57 241–248. MR2016258

Menke, J. E. and Martinez, T. R. (2008). A Bradley–Terryartificial neural network model for individual ratings ingroup competitions. Neural Computing & Applications 17

175–186.Miwa, T., Hayter, A. J. and Kuriki, S. (2003). The eval-

uation of general non-centred orthant probabilities. J. R.Stat. Soc. Ser. B Stat. Methodol. 65 223–234. MR1959823

Molenberghs, G. and Verbeke, G. (2005). Models for Dis-crete Longitudinal Data. Springer, New York. MR2171048

Mosteller, F. (1951). Remarks on the method of pairedcomparisons. I. The least squares solution assuming equalstandard deviations and equal correlations. II. The effectof an aberrant standard deviation when equal standard de-




http://CRAN.R-project.org/package=prefmod

http://CRAN.R-project.org/package=prefmod

http://epub.wu.ac.at/id/eprint/740











http://econpapers.repec.org/RePEc:emp:wpaper:wp03-04







22 M. CATTELAN

viations and equal correlations are assumed. III. A test ofsignificance for paired comparisons when equal standarddeviations and equal correlations are assumed. Psychome-trika 16 3–9, 203–218.

Muthen, B. (1978). Contributions to factor analysisof dichotomous variables. Psychometrika 43 551–560.MR0521904

Muthen, B. (1993). Goodness of fit with categorical andother non normal variables. In Structural Equation Models(K. A. Bollen, J. S. Long, eds.). 205–234. Sage, New-bury Park, CA.

Muthen, B., Du Toit, S. H. C. and Spisic, D. (1997). Ro-bust inference using weighted least squares and quadraticestimating equations in latent variable modeling with cat-egorical and continuous outcomes. Technical report.

R Development Core Team (2011). R: A Language and Envi-ronment for Statistical Computing. R Foundation for Sta-tistical Computing, Vienna, Austria. ISBN 3-900051-07-0.Available at http://www.R-project.org.

Rao, P. V. and Kupper, L. L. (1967). Ties in paired-comparison experiments: A generalization of the Bradley–Terry model. J. Amer. Statist. Assoc. 62 194–204.MR0217963

Reiser, M. (2008). Goodness-of-fit testing using componentsbased on marginal frequencies of multinomial data. BritishJ. Math. Statist. Psych. 61 331–360. MR2649040

Sham, P. C. and Curtis, D. (1995). An extended transmis-sion/disequilibrium test (TDT) for multi-allele marker loci.Ann. Hum. Genet. 59 323–336.

Simons, G. and Yao, Y.-C. (1999). Asymptotics when thenumber of parameters tends to infinity in the Bradley–Terry model for paired comparisons. Ann. Statist. 27 1041–1060. MR1724040

Springall, A. (1973). Response surface fitting using a gen-eralization of the Bradley–Terry paired comparison model.J. R. Stat. Soc. Ser. C Appl. Stat. 22 59–68.

Stern, H. (1990). A continuum of paired comparisons mod-els. Biometrika 77 265–273. MR1064798

Stern, S. E. (2011). Moderated paired comparisons: A gen-eralized Bradley–Terry model for continuous data usinga discontinuous penalized likelihood function. J. R. Stat.Soc. Ser. C Appl. Stat. 60 397–415. MR2767853

Stigler, S. M. (1994). Citation patterns in the journals ofstatistics and probability. Statist. Sci. 9 94–108.

Strobl, C., Wickelmaier, F. and Zeileis, A. (2011). Ac-counting for individual differences in Bradley–Terry modelsby means of recursive partitioning. Journal of Educationaland Behavioral Statistics 36 135–153.

Stuart-Fox, D. M., Firth, D., Moussalli, A. andWhiting, M. J. (2006). Multiple signals in chameleon con-tests: Designing and analysing animal contests as a tour-nament. Animal Behavior 71 1263–1271.

Takane, Y. (1989). Analysis of covariance structures andprobabilistic binary choice data. In New Developments inPsychological Choice Modeling (G. De Soete, H. Feger

and K. C. Klauser, eds.). North-Holland, Amsterdam.Thurstone, L. L. (1927). A law of comparative judgment.

Psychological Review 34 368–389.

Thurstone, L. L. and Jones, L. V. (1957). The rational

origin for measuring subjective values. J. Amer. Statist.Assoc. 52 458–471.

Train, K. E. (2009). Discrete Choice Methods with Sim-ulation, 2nd ed. Cambridge Univ. Press, Cambridge.MR2519514

Tsai, R.-C. (2000). Remarks on the identifiability of Thursto-nian ranking models: Case V, Case III, or neither? Psy-chometrika 65 233–240. MR1763521

Tsai, R.-C. (2003). Remarks on the identifiability of Thursto-

nian paired comparison models under multiple judgment.Psychometrika 68 361–372. MR2272384

Tsai, R.-C. and Bockenholt, U. (2002). Two-level linearpaired comparison models: Estimation and identifiabilityissues. Math. Social Sci. 43 429–449. MR2072966

Tsai, R.-C. and Bockenholt, U. (2006). Modelling intran-sitive preferences: A random-effects approach. J. Math.Psych. 50 1–14. MR2208061

Tsai, R.-C. and Bockenholt, U. (2008). On the importanceof distinguishing between within- and between-subject ef-fects in intransitive intertemporal choice. J. Math. Psych.52 10–20. MR2407792

Turner, H. and Firth, D. (2010a). Bradley–Terry modelsin R: The BradleyTerry2 package. Available at http://

CRAN.R-project.org/package=BradleyTerry2.Turner, H. and Firth, D. (2010b). Generalized nonlinear

models in R: An overview of the gnm package. Available at

http:// CRAN.R-project.org/package=gnm.Tversky, A. (1972). Elimination by aspects: A theory of

choice. Psychological Review 79 281–299.Usami, S. (2010). Individual differences multidimensional

Bradley–Terry model using reversible jump Markovchain Monte Carlo algorithm. Behaviormetrika 37 135–155.

Varin, C., Reid, N. and Firth, D. (2011). An overview

of composite likelihood methods. Statist. Sinica 21 5–42.MR2796852

Walker, J. and Ben-Akiva, M. (2002). Generalized randomutility model. Math. Social Sci. 43 303–343. MR2072961

Whiting, M. J., Stuart-Fox, D. M., O’Connor, D.,Firth, D., Bennett, N. C. and Blomberg, S. P. (2006).Ultraviolet signals ultra-aggression in a lizard. Animal Be-havior 72 353–363.

Wickelmaier, F. and Schmid, C. (2004). A Matlab func-

tion to estimate choice model parameters from paired-comparison data. Behavior Research Methods, Instru-ments, and Computers 36 29–40.

Yan, T., Yang, Y. and Xu, J. (2012). Sparse paired compar-isons in the Bradley–Terry model. Statist. Sinica 22 1305–1318.

Zermelo, E. (1929). Die Berechnung der Turnier-Ergebnisseals ein Maximumproblem der Wahrscheinlichkeitsrech-nung. Math. Z. 29 436–460. MR1545015

Zhao, Y. and Joe, H. (2005). Composite likelihood estima-tion in multivariate data analysis. Canad. J. Statist. 33

335–356. MR2193979


http://www.R-project.org












http://CRAN.R-project.org/package=BradleyTerry2

http://CRAN.R-project.org/package=BradleyTerry2

http://CRAN.R-project.org/package=gnm





Date post:	24-Dec-2019
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Models for Paired Comparison Data: A Review with Emphasis ...Models for Paired Comparison Data: A...

Documents