+ All Categories
Home > Documents > Journal of Memory and Language · Yin Songa, Farouk S. Nathooa,⇑, Michael E.J. Massonb,⇑ a...

Journal of Memory and Language · Yin Songa, Farouk S. Nathooa,⇑, Michael E.J. Massonb,⇑ a...

Date post: 07-Aug-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
15
A Bayesian approach to the mixed-effects analysis of accuracy data in repeated-measures designs Yin Song a , Farouk S. Nathoo a,, Michael E.J. Masson b,a Department of Mathematics and Statistics, University of Victoria, Canada b Department of Psychology, University of Victoria, Canada article info Article history: Received 6 April 2016 revision received 3 May 2017 Keywords: Accuracy studies Bayesian analysis Behavioral data Model selection Repeated-measures abstract Many investigations of human language, memory, and other cognitive processes use response accuracy as the primary dependent measure. We propose a Bayesian approach for the mixed-effects analysis of accu- racy studies using mixed binomial regression models. We present logistic and probit mixed models that allow for random subject and item effects, as well as interactions between experimental conditions and both items and subjects in either one- or two-factor repeated-measures designs. The effect of experimen- tal conditions on accuracy is assessed through Bayesian model selection and we consider two such approaches to model selection: (a) the Bayes factor via the Bayesian Information Criterion approximation and (b) the Watanabe-Akaike Information Criterion. Simulation studies are used to assess the methodol- ogy and to demonstrate its advantages over the more standard approach that consists of aggregating the accuracy data across trials within each condition and over the contemporary use of logistic and probit mixed models with model selection based on the Akaike Information Criterion. Software and examples in R and JAGS for implementing the analysis are available at https://v2south.github.io/BinBayes/. Crown Copyright Ó 2017 Published by Elsevier Inc. All rights reserved. Introduction Many types of behavioral data generated by experimental investigations of human language, memory, and other cognitive processes entail the measurement of response accuracy. For exam- ple, in studies of word identification, error rates in word-naming or lexical-decision tasks are analyzed to determine whether manipu- lated variables or item characteristics influence response accuracy (e.g., Chateau & Jared, 2003; Yap, Balota, Tse, & Besner, 2008). Sim- ilarly, in experiments on memory topics such as false memory and the avoidance of retroactive and proactive interference on recall, response errors or probability of accurate responding are the crit- ical measures of performance (e.g., Arndt & Reder, 2003; Jacoby, Wahlheim, & Kelley, 2015). The common treatment of accuracy or error-rate data has, and to a large extent continues, to consist of aggregating data across trials within each condition for each subject to generate the equiv- alent of a proportion correct or incorrect score, ranging from 0 to 1. These scores are then analyzed using repeated-measures analysis of variance (ANOVA) or, in the simplest cases, a t test. Although this standard approach, hereafter termed ’standard aggregating approach’, has serious problems that have repeatedly been pointed out to researchers, it continues to be used. Here we illustrate a solution to these problems offered by Bayesian data analysis. We first (re)summarize the problems of the standard aggregating approach. Then we summarize one approach to this problem that has gained traction over the last decade (non-Bayesian Generalized Linear Mixed Models), followed by a brief review of some of the general pros and cons of Bayesian approaches. The rest of the paper then presents a Bayesian statistical modeling framework for repeated-measures accuracy data, simulation studies evaluating the proposed methodology, and an application to actual data aris- ing from a single-factor repeated-measures design. The standard aggregating approach To assess the validity of our characterization of how researchers typically analyze accuracy, error, or other classification data, we examined articles published in recent issues of four of the leading journals in the field of cognitive psychology: the Journal of Memory and Language (JML), the Journal of Experimental Psychology: Learning, Memory, and Cognition (JEP), Cognition, and Cognitive Psychology. All articles appearing in issues with a publication date of January to August 2016 (up to the October 2016 issue for JML because later issues of that journal were available at the time the survey was conducted) were considered. Articles in which http://dx.doi.org/10.1016/j.jml.2017.05.002 0749-596X/Crown Copyright Ó 2017 Published by Elsevier Inc. All rights reserved. Corresponding authors. E-mail addresses: [email protected] (F.S. Nathoo), [email protected] (M.E.J. Masson). Journal of Memory and Language 96 (2017) 78–92 Contents lists available at ScienceDirect Journal of Memory and Language journal homepage: www.elsevier.com/locate/jml
Transcript
Page 1: Journal of Memory and Language · Yin Songa, Farouk S. Nathooa,⇑, Michael E.J. Massonb,⇑ a Department of Mathematics and Statistics, University of Victoria, Canada bDepartment

Journal of Memory and Language 96 (2017) 78–92

Contents lists available at ScienceDirect

Journal of Memory and Language

journal homepage: www.elsevier .com/locate / jml

A Bayesian approach to the mixed-effects analysis of accuracy datain repeated-measures designs

http://dx.doi.org/10.1016/j.jml.2017.05.0020749-596X/Crown Copyright � 2017 Published by Elsevier Inc. All rights reserved.

⇑ Corresponding authors.E-mail addresses: [email protected] (F.S. Nathoo), [email protected] (M.E.J.

Masson).

Yin Song a, Farouk S. Nathoo a,⇑, Michael E.J. Masson b,⇑aDepartment of Mathematics and Statistics, University of Victoria, CanadabDepartment of Psychology, University of Victoria, Canada

a r t i c l e i n f o a b s t r a c t

Article history:Received 6 April 2016revision received 3 May 2017

Keywords:Accuracy studiesBayesian analysisBehavioral dataModel selectionRepeated-measures

Many investigations of human language, memory, and other cognitive processes use response accuracy asthe primary dependent measure. We propose a Bayesian approach for the mixed-effects analysis of accu-racy studies using mixed binomial regression models. We present logistic and probit mixed models thatallow for random subject and item effects, as well as interactions between experimental conditions andboth items and subjects in either one- or two-factor repeated-measures designs. The effect of experimen-tal conditions on accuracy is assessed through Bayesian model selection and we consider two suchapproaches to model selection: (a) the Bayes factor via the Bayesian Information Criterion approximationand (b) the Watanabe-Akaike Information Criterion. Simulation studies are used to assess the methodol-ogy and to demonstrate its advantages over the more standard approach that consists of aggregating theaccuracy data across trials within each condition and over the contemporary use of logistic and probitmixed models with model selection based on the Akaike Information Criterion. Software and examplesin R and JAGS for implementing the analysis are available at https://v2south.github.io/BinBayes/.

Crown Copyright � 2017 Published by Elsevier Inc. All rights reserved.

Introduction

Many types of behavioral data generated by experimentalinvestigations of human language, memory, and other cognitiveprocesses entail the measurement of response accuracy. For exam-ple, in studies of word identification, error rates in word-naming orlexical-decision tasks are analyzed to determine whether manipu-lated variables or item characteristics influence response accuracy(e.g., Chateau & Jared, 2003; Yap, Balota, Tse, & Besner, 2008). Sim-ilarly, in experiments on memory topics such as false memory andthe avoidance of retroactive and proactive interference on recall,response errors or probability of accurate responding are the crit-ical measures of performance (e.g., Arndt & Reder, 2003; Jacoby,Wahlheim, & Kelley, 2015).

The common treatment of accuracy or error-rate data has, andto a large extent continues, to consist of aggregating data acrosstrials within each condition for each subject to generate the equiv-alent of a proportion correct or incorrect score, ranging from 0 to 1.These scores are then analyzed using repeated-measures analysisof variance (ANOVA) or, in the simplest cases, a t test. Although thisstandard approach, hereafter termed ’standard aggregating

approach’, has serious problems that have repeatedly been pointedout to researchers, it continues to be used. Here we illustrate asolution to these problems offered by Bayesian data analysis. Wefirst (re)summarize the problems of the standard aggregatingapproach. Then we summarize one approach to this problem thathas gained traction over the last decade (non-Bayesian GeneralizedLinear Mixed Models), followed by a brief review of some of thegeneral pros and cons of Bayesian approaches. The rest of the paperthen presents a Bayesian statistical modeling framework forrepeated-measures accuracy data, simulation studies evaluatingthe proposed methodology, and an application to actual data aris-ing from a single-factor repeated-measures design.

The standard aggregating approach

To assess the validity of our characterization of how researcherstypically analyze accuracy, error, or other classification data, weexamined articles published in recent issues of four of the leadingjournals in the field of cognitive psychology: the Journal of Memoryand Language (JML), the Journal of Experimental Psychology:Learning, Memory, and Cognition (JEP), Cognition, and CognitivePsychology. All articles appearing in issues with a publication dateof January to August 2016 (up to the October 2016 issue for JMLbecause later issues of that journal were available at the time thesurvey was conducted) were considered. Articles in which

Page 2: Journal of Memory and Language · Yin Songa, Farouk S. Nathooa,⇑, Michael E.J. Massonb,⇑ a Department of Mathematics and Statistics, University of Victoria, Canada bDepartment

Y. Song et al. / Journal of Memory and Language 96 (2017) 78–92 79

accuracy was analyzed using a transformed measure such as d0,receiver operating characteristic curves, or parameters of compu-tational models based on simulation of accuracy data were notincluded. A total of 180 articles across the four journals reporteddata expressed as proportions or the equivalent (e.g., accuracy,error, classification responses). Among these articles, 69 were ona topic related to language processing and the remaining 111addressed other issues in memory and cognition. For each article,we determined whether the authors used standard methods ofanalyzing data that included aggregating performance across itemsor across subjects or whether generalized linear mixed modelswere used in which individual trials were the units of analysis.We included in the standard-analysis category any standard uni-variate method of analysis, such as analysis of variance, t-tests, cor-relation, and regression in which data were aggregated over itemsor over subjects. The application of analysis of variance using sub-jects and items as random effects in separate analyses and report-ing F1 and F2 was also classified as using a method of aggregation.This approach, used widely since Clark’s (1973) seminal paper onitem variability, relies on an analysis that aggregates across definedsubsets of trials (items for F1 and subjects for F2), rather than ana-lyzing data at the level of individual trials. Our assessment indi-cated that for articles on language-related topics, 37 (54%)applied some form of the standard aggregating approach (of these,15 used methods that reported effects aggregated over subjectsand effects aggregated over items; i.e., F1 and F2). For articles onother topics of memory and cognition, 99 (89%) relied on the stan-dard aggregating approach (two of these reported F1 and F2 analy-ses). Overall, then, 76% of recently published articles in these fourleading cognitive psychology journals analyzed accuracy or otherbinomial data in the historically standard way, which involvesaggregating performance across items for at least a subset of theanalyses. The remaining articles used generalized linear mixedmodels to analyze the data,1 which does not aggregate across itemsand which we discuss in detail below.

The shortcomings of what continues to be a widely appliedmethod of analyzing accuracy data, and binomial data in general(i.e., aggregating across items), have been known for some time(Cochran, 1940) and have been reiterated in recent accounts ofalternative approaches (e.g., Dixon, 2008; Jaeger, 2008; Quené &Van den Bergh, 2008). For instance, the proportions generated frombinary observations (correct versus incorrect) need not be nor-mally distributed, which violates one of the fundamental assump-tions of ANOVA and t-tests. Moreover, the variance of accuracyscores will depend on the mean, with larger variance when themean is closer to .5 and variance vanishing to zero as the meanapproaches 0 or 1. This dependency implies that if effects are pre-sent (i.e., means vary across conditions), the assumption of homo-geneity of variance on which ANOVA depends is likely to beviolated. By aggregating data across trials, error variance is likelyto be reduced, leading to an elevation of type I error probabilityin null-hypothesis significance testing (Quené & Van den Bergh,2008). Finally, because proportion correct is bounded by 0 and 1,confidence intervals created from such data may well extend out-side that range when the relevant mean approaches one of thoselimits, meaning that probability mass is being assigned to impossi-ble values (Dixon, 2008; Jaeger, 2008).

A common strategy that is adopted to avoid these problems isthe application of a data transformation such as the square-rootarcsine transformation. Unfortunately, this approach makes theinterpretation of the analysis more difficult as the hypothesis teststhen correspond to the means of the transformed data and not to

1 In four of the articles reporting linear mixed model regression analyses oaccuracy, it was not clear whether logistic regression was used or whether rawaccuracy was the dependent measure.

f

the actual accuracy data. Jaeger (2008) also shows that these trans-formations do not fix the problem when the mean proportions areclose to 0 or 1. Furthermore, transforming the data after aggregat-ing across items precludes the investigation of item effects.

Generalized linear mixed models

A viable solution to these difficulties with the standard aggre-gating approach to analyzing accuracy data involves using general-ized linear mixed-models of logistic regression (Dixon, 2008;Jaeger, 2008; Quené & Van den Bergh, 2008). In this setting a hier-archical model based on two levels is specified for the data, where,at the first level the response variables are assumed to be gener-ated from a Bernoulli distribution. At the second level of the modelthe accuracy or error rates are converted to a logit scale (the loga-rithm of the odds of success or failure): logitðpÞ ¼ lnðp=ð1� pÞÞ andthe variability in the log-odds across subjects, items, and condi-tions is based on a mixed effects model. We emphasize here thatp is not computed from the data and does not correspond to theproportion of accurate responses aggregated over items for a givencondition and subject; rather, p is an unknown parameter repre-senting the probability of an accurate response for a given subject,item, and experimental condition. Rather than aggregating dataover trials to obtain a single estimate of the proportion correct ina given condition for each subject, the individual binary accuracytrial scores are the unit of measurement. This level of granularityallows the assessment of possible random effects for both subjectsand items. That is, effects of a manipulation may not be consistentfrom subject to subject or item to item and a mixed-effects analysiscan characterize the extent of these differences. Variance in effectsacross items can thus be assessed, which addresses the concernraised by Clark (1973) about the ‘‘language-as-a-fixed-effect fal-lacy” (Jaeger, 2008; Quené & Van den Bergh, 2008).

The proposed use of mixed-effects logistic regression for theanalysis of accuracy data can be implemented either with or with-out significance tests. In the latter case, information criteria such asthe Akaike Information Criterion (AIC) can be used for model selec-tion. In the former case, these analyses continue to rely on thebasic principles of null-hypothesis significance testing (NHST) formaking decisions about whether independent variables are pro-ducing effects on performance. A number of recent reports in thepsychological literature have highlighted potential deficienciesassociated with NHST (e.g., Kruschke, 2013; Rouder, Morey,Speckman, & Province, 2012; Rouder, Speckman, Sun, Morey, &Iverson, 2009; Wagenmakers, 2007). We will briefly mention onlya few of those difficulties here.

First, the probability value associated with a significance testreflects the probability of obtaining an observed result, or onemore extreme, on the assumption that the null hypothesis is true.An inference must then follow, establishing one’s belief that thenull hypothesis is false on those occasions where the obtainedprobability value is very low. Many researchers mistakenly inter-pret that probability as the likelihood that the null hypothesis istrue, given the observed data (e.g., Haller & Krauss, 2002). Thatinference is not available through NHST, but it can, for example,be generated by a Bayesian analysis. Second, NHST is, by design,capable of providing evidence in favour of only the alternativehypothesis. When evidence does not allow rejection of the compet-ing null hypothesis, no strong conclusion can be reached. Anotherpotential advantage of the Bayesian approach is that it allows thestrength of evidence in favor of either a null or an alternativehypothesis to be quantified. Although such reasoning can, andhas been, accommodated under some non-Bayesian approaches,it follows naturally from the Bayesian perspective and Bayesianmethods provide one principled approach to doing this. Finally,

Page 3: Journal of Memory and Language · Yin Songa, Farouk S. Nathooa,⇑, Michael E.J. Massonb,⇑ a Department of Mathematics and Statistics, University of Victoria, Canada bDepartment

80 Y. Song et al. / Journal of Memory and Language 96 (2017) 78–92

although not an inherent problem of non-Bayesian approaches butrather a potential pitfall of their misuse, when using NHSTresearchers are susceptible to the problems caused by optionalstopping during data collection. One may be tempted, for example,to collect additional data if a NHST applied to data currently inhand generates a p value that is just shy of significance. It has beenclearly demonstrated that this approach to data collection substan-tially raises the probability of a type I error (e.g., Wagenmakers,2007), whereas Bayesian analysis is not susceptible to this problemand will only yield increasingly accurate results as more data accu-mulate (Berger & Berry, 1988; Wagenmakers, 2007), assuming themethods are used adequately. Again, we emphasize that this is notan inherent problem of non-Bayesian approaches but one that canand often does arise when these procedures are misused.

Bayesian approaches

As a solution to this problem with NHST and to advance the useof mixed-effects analyses of binomial data, we propose the use of aBayesian version of mixed-effects models of logistic and probitregression for the repeated-measures case. The nature of thesemodified versions of regression analysis is discussed in detailbelow. Although Bayesian analysis of generalized linear mixed-models has been developed extensively in the statistical literatureover the past decade, our interest is specifically on the applicationto behavioral data arising from accuracy studies where a methodcombining the use of Bayesian analysis with generalized linearmixed-models for binomial data has not been considered previ-ously. We investigate two options for selecting between null andalternative hypotheses (models) using Bayesian analysis: the Bayesfactor computed using a Bayesian Information Criterion (BIC)approximation, and the Watanabe-Akaike Information Criterion(WAIC). We provide R software to implement these approachesin addition to computing posterior distributions.

Although the Bayesian approach offers an exciting avenue forthe analysis of memory and language data, there is a large bodyof work that debates the pros and cons of a Bayesian analysis.Efron (1986) discussed the potential problems with the Bayesianapproach and provided examples where the frequentist approachprovides easier solutions. The use of Bayesian approaches can leadto an increase in conceptual complexity of some aspects of the dataanalysis. Users must take the time to acquire the necessary back-ground in order to use Bayesian methods appropriately. In addi-tion, an important issue that has received a lot of attention in theliterature is the choice of the prior distribution, which can havean influence on the results. Informative prior distributions can bechosen to reflect prior knowledge on certain parameters of amodel, though formulating such priors can be very difficult andis often impractical. In our work, we favour the use of weakly infor-mative priors that have high or infinite prior variance so that theprior plays a limited role in the inference. For Bayesian logisticregression the issue of priors and the development of weakly infor-mative priors is discussed extensively in Gelman, Jakulin, Pittau,and Su (2008).

The use of logistic mixed models is an alternative to methodsthat involve aggregation over items or subjects. As demonstratedin the literature, this alternative is a more effective methodologyfor the analysis of repeated-measures accuracy studies. In thememory and language literature this has been considered primar-ily from a classical, non-Bayesian perspective (Dixon, 2008; Jaeger,2008; Quené & Van den Bergh, 2008). In parallel, there is currentlya shift towards the use of Bayesian methods for the analysis of cog-nitive studies (Wagenmakers, 2007; Rouder et al., 2012; Rouderet al., 2009). To date, much of this shift has focused on the analysisof continuous response variables. The goal of this project is to pro-vide tools for memory and language researchers to combine the

advantages of both logistic/probit mixed models and Bayesianmethods for the analysis of repeated-measures studies of accuracy.In doing so, we allow for the evaluation of posterior distributionsfor the effects of interest, and we also offer two possibleapproaches for Bayesian model selection (a) the Bayes factor basedon the BIC approximation and (b) the more recently proposedWAIC. These two approaches are motivated based on different util-ities, the former assesses model performance using the marginallikelihood while the latter assesses model performance by estimat-ing a measure of out-of-sample prediction error.

Given that we offer two different approaches for Bayesianmodel selection it is naturally of interest to consider comparisonsbetween them. We therefore make these comparisons by evaluat-ing the operating characteristics of both approaches using simula-tion studies. In order to place these comparisons within the largerfield of methods that can be applied to repeated-measures accu-racy data we also evaluate two other approaches. One is a Bayesiananalysis based on item aggregation with model selection deter-mined by the Bayes factor. The other is logistic mixed modelingwithin the classical (non-Bayesian) setting with model selectionbased on the standard Akaike Information Criterion (AIC). The lat-ter is arguably the current non-Bayesian state-of-the-art. We usethese evaluations to inform a discussion on the pros and cons ofthe Bayesian approach and its combination with logistic/probitmixed modeling.

In the first simulation study, comparisons are made within thecontext of testing for a fixed effect of the experimental conditions.In that study, we find that the proposed Bayesian approach and thecurrent non-Bayesian state-of-the-art perform equally well in thesense that the two exhibit identical power curves after calibratingfor type I error. In the second simulation study, comparisons aremade within the context of testing for a random effect of theexperimental conditions. More specifically, we consider the sce-nario where the effect of the experimental manipulation variesacross items. In that case, we find that the fully Bayesian approachbased on the WAIC exhibits uniformly higher power than logisticmixed modeling within the classical (non-Bayesian) setting withmodel selection based on the standard AIC.

The primary contributions of our work are: (a) Facilitating thecombination of binomial mixed-effects modeling and Bayesianinference for repeated-measures analysis of accuracy studies, (b)simulation studies evaluating this approach relative to the stan-dard aggregating approach and classical logistic mixed models,and (c) making available easy-to-use R software with examplesfacilitating such analysis.

Overview

The remainder of the article proceeds as follows. In the nextsection we present the statistical modeling framework and discussBayesian approaches for evaluating the effect of experimental con-ditions on accuracy. We then present two simulation studies eval-uating the proposed methodology in comparison to a benchmarkconsisting of a Bayesian analysis (using the BayesFactor R packagebased on Rouder et al., 2012) of data aggregated in the standardway. Comparisons to logistic mixed modeling within the non-Bayesian setting with model selection based on the AIC are alsomade. This is followed by the description of an application toactual data, where we demonstrate how our methodology can beapplied to a single-factor repeated-measures design. Two-factorrepeated-measures designs can also be accommodated and weprovide an example of this analysis on a webpage https://v2-south.github.io/BinBayes/ that is associated with this paper. Thefinal section concludes with a discussion and practical recommen-dations for the analysis of accuracy studies.

Page 4: Journal of Memory and Language · Yin Songa, Farouk S. Nathooa,⇑, Michael E.J. Massonb,⇑ a Department of Mathematics and Statistics, University of Victoria, Canada bDepartment

Y. Song et al. / Journal of Memory and Language 96 (2017) 78–92 81

Method

Statistical models for repeated-measures accuracy studies usingsingle-factor designs

We first consider a repeated-measures design involving K sub-jects, I experimental conditions corresponding to a single factor,and J items, and where generally the number of experimental con-ditions will be much smaller than the number of items. For eachsubject the data consist of a binary response measuring accuracyon each of the J items. The response on each item occurs in a par-ticular experimental condition and these conditions are randomlyassigned across items. The structure of the data are illustrated inTable 1 for the case where there are I ¼ 3 experimental conditionswith labels b; g, or r, and these labels are indicated as subscripts oneach of the binary measurements, the latter taking values either 0or 1. It should be noted that our framework can accommodatecases where items are seen by the same subject in multiple condi-tions. In that case the data could be represented by adding addi-tional columns for each item in Table 1. Furthermore, the studiesconsidered here may use counterbalancing, so that for some sub-jects a particular item will be assigned to a particular conditionwhereas for other subjects that same item will be assigned to a dif-ferent condition. We also do not require that a response is obtainedon every item for each subject, so it is possible for different subsetsof items to be presented to different subjects.

In the model specifications that follow, the notation X � Fwhere X is a random variable and F is a probability distribution,such as the standard normal distribution with mean 0 and stan-dard deviation 1, Nð0;1Þ, denotes that the random variable X isdrawn from the distribution F.

Similarly, if Y is also a random variable, the notation XjY � FY

denotes that the conditional distribution of X given the value ofthe random variable Y is FY . For example, XjY ¼ y � Nðy;1Þ denotesthat, given that Y ¼ y, the conditional distribution of X, given theknowledge that Y ¼ y, is a normal distribution with mean y andstandard deviation 1. In addition, if X1; . . . ;Xn is a collection of ran-

dom variables, the notation Xi �iid F, i ¼ 1; . . . ;n is used to denotethat these random variables are independent and identically dis-tributed from the distribution F (i.e., independently selected fromthe same distribution, F).

We let Yijk denote the binary response obtained from subject kwhen item j is assigned to condition i. The approach, hereafter ter-med ‘standard aggregating approach’, often applied in the analysisof such data begins by averaging the response variables across theitems for each condition to obtain accuracy scores correspondingto each subject and condition. Referring to Table 1, this averagingresults in three response variables for each row of the table, onefor each of the three conditions. A repeated-measures ANOVA isthen applied to the aggregated data. In contrast, trial-based analy-ses like those pursued here (and in Dixon, 2008) avoid averagingover items and model each binary score using a Bernoulli distribu-tion. We let pijk denote the probability of an accurate response(Yijk ¼ 1) when item j is assigned to condition i and subject k.The model assumes

Yijkjpijk �ind BernoulliðpijkÞ;

and we emphasize that our modeling approach is specified at thelevel of binary observations (i.e., Yijk is either 0 or 1) and the accu-racy probability p is a parameter of the corresponding Bernoulli dis-tribution with p ¼ PrðY ¼ 1Þ. Each of these binary observations isspecific to a particular subject, item, and level of the experimentalcondition. The analysis we propose evaluates a number of modelseach corresponding to different assumptions on how this probabil-

ity varies across items, subjects, and conditions. We note that theexperimental design is such that we expect lack of independenceof the data within subject (the rows of Table 1) and also within item(the columns of Table 1). As a result all of the models that we con-sider include random effects for both subjects and items (Clark,1973) to account for possible dependence.

These models rely on first transforming the accuracy probability(pijk) using a link function gðpÞ : ½0;1� ! R. That is, the link function,g, takes an value of p from 0 to 1 and maps it onto an element of theset of real numbers. We consider two common link functionsgðpÞ ¼ log p

1�p which corresponds to a logistic model, and

gðpÞ ¼ U�1ðpÞ which corresponds to a probit model, where Uð�Þdenotes the cumulative distribution function of the standard nor-mal distribution. In the latter model it is useful to think of theprobability values, p, as being converted to a corresponding Z-score. The two link functions are depicted in the left panel of Fig. 1.

The hierarchical Bayesian models for accuracy are presentedbelow. The different models represent different possible effects.Following the recommendations of Gelman et al. (2008), Cauchyprior distributions are assigned to the fixed effects in all of themodels. The center panel of Fig. 1 provides an illustration of theCauchy prior distribution in relation to the normal distribution.As is typical in Bayesian generalized linear mixed models, the ran-dom effects are assigned normal distributions and the variancecomponents corresponding to the random effects are assignedinverse-gamma distributions. The latter distribution is illustratedin the right panel of Fig. 1 and is a convenient choice as it meansthat the algorithm used to fit the model to the data has an analyt-ical form that is easy to work with.

1. (LM0 - Logit/PM0 - Probit) Baseline model with random subject anditem effects with no effect of the experimental condition:

gðpijkÞ ¼ b0 þ aðRÞj þ bðRÞk

aðRÞj �iid Nð0;r2

aÞ; bðRÞk �iid Nð0;r2

bÞ; b0 � Cauchyð0;rb ¼ 10Þ

r2a � Inverse-Gammaðjra ; sra Þ;

r2b � Inverse-Gammaðjrb

; srbÞ

Here b0 is the model intercept, aðRÞj is the item random effect

with variance r2a , and bðRÞ

k is the subject random effect with vari-ance r2

b . The hyper-parameters jra , sra , jrb, srb

are fixed to val-ues that make the inverse-gamma prior distributions weaklyinformative, with infinite variance. We reiterate that this modelassumes that there is no effect of the experimental condition onthe probability of an accurate response and this is the onlymodel where this assumption is made. This model correspondsto the null hypothesis.

2. (LMF - Logit/PMF - Probit) Fixed effect for the experimentalcondition:

gðpijkÞ ¼ b0 þ ai þ aðRÞj þ bðRÞk

with a1 ¼ 0, ai �iid Cauchyð0;ra ¼ 2:5Þ, i ¼ 2; . . . ; I, and with allother priors identical to model 1. The constraint a1 ¼ 0 isimposed for model identification and as a result one arbitrarilyselected experimental condition is considered a baseline condi-tion and the remaining fixed effects ai represent the effect ofcondition i relative to that baseline. The value of the scaleparameter ra ¼ 2:5 in the Cauchy prior distribution for the fixedeffects is different from the corresponding value in the prior forthe intercept rb ¼ 10 based on the work of Gelman et al. (2008)who recommend these choices for Bayesian logistic regression

Page 5: Journal of Memory and Language · Yin Songa, Farouk S. Nathooa,⇑, Michael E.J. Massonb,⇑ a Department of Mathematics and Statistics, University of Victoria, Canada bDepartment

Table 1Example data structure: I ¼ 3 conditions, K subjects, J items where conditionsare indicated as subscripts b; g, or r of each binary data value.

82 Y. Song et al. / Journal of Memory and Language 96 (2017) 78–92

as a weakly informative prior. The Cauchy prior is centeredaround zero so that there is no preference for either a positiveor negative effect. Using cross-validation, Gelman et al. (2008)show that this class of priors outperform Gaussian and Laplacepriors and we use it here for both logistic and probit regression.This model extends the first model by assuming that the proba-bility of an accurate response depends on the experimental con-dition through a fixed effect ai.

3. (LMRS - Logit/PMRS - Probit) The effect of experimental conditionvaries across subjects:

gðpijkÞ ¼ b0 þ ai þ aðRÞj þ bðRÞk þ ðabÞðRÞik

where, in this case, the effect of the experimental condition isrepresented by both the fixed effects ai and the random effects

ðabÞðRÞik which represent an interaction between subject and con-dition, implying that the effect of condition varies across sub-jects. As before, constraints are imposed so that one arbitrarily

selected condition is taken as a baseline condition, ðabÞðRÞ1k ¼ 0,and the remaining random effects are assumed to be normally

distributed ðabÞðRÞik �iid Nð0;r2abÞ, i ¼ 2; . . . ; I; k ¼ 1; . . . ;K . An

inverse-gamma prior distribution is assumed for the corre-sponding variance component, r2

ab � Inverse-Gammaðjrab , srab Þwith hyper-parameters jrab , srab fixed to values that make theinverse-gamma prior distribution weakly informative, with infi-nite variance.

4. (LMRi - Logit/PMRi - Probit) The effect of experimental conditionvaries across items:

Fig. 1.probabi

gðpijkÞ ¼ b0 þ ai þ aðRÞj þ bðRÞk þ ðaaÞðRÞij

0.0 0.2 0.4 0.6 0.8 1.0

-6-4

-20

24

6

p

Link

Fun

ctio

n

Link Functions for Binary Regression

logitprobit

-6 -4 -2

0.0

0.1

0.2

0.3

0.4

V

Prob

abilit

y D

ensi

ty F

unct

ion Cauchy

Normal

Left Panel - the logistic and probit link functions; Center Panel - the probability dlity density function of an inverse-Gamma distribution.

with the constraint, ðaaÞðRÞ1j ¼ 0, imposed so that one arbitrarilyselected condition is taken as a baseline condition and theremaining random effects are assumed to be normally dis-

tributed, ðaaÞðRÞij �iid Nð0;r2aaÞ, i ¼ 2; . . . ; I; j ¼ 1; . . . ; J. An inverse-

gamma prior distribution is assumed for the corresponding vari-ance component, r2

aa � Inverse-Gammaðjraa ; sraa Þ with jraa , sraafixed to values that make the inverse-gamma prior distributionweakly informative, with infinite variance. All other prior distri-butions are identical to model 2. In this case the effect of theexperimental condition is represented by both the fixed effects

ai and the random effects ðaaÞðRÞij which represent an interactionbetween item and condition (items potentially vary in the extentto which they exhibit effects of the experimental conditions).

5. (LMRS;i - Logit/PMRS;i - Probit) The effect of experimental conditionvaries across items and subjects:

gðpijkÞ ¼ b0 þ ai þ aðRÞj þ bðRÞk þ ðaaÞðRÞij þ ðabÞðRÞik

with distributions for random effects (condition-by-item andcondition-by-subject effects) and hyper-priors set as in models3 and 4. This is the most general of the models considered fora single-factor repeated measures design.

Each of the five models presented above represents differentassumptions on the effect of the experimental conditions whileexplicitly modeling the binary response through the Bernoulli dis-tribution and accounting for between-subject and between-itemvariability with random effects. Considering the two possiblechoices for the link function, logit or probit, there are ten possiblemodels and these models are summarized in Table 2.

Analysis of two-factor designs

In the case of a design with two experimental factors we let Yhijk

denote the binary response obtained from subject k when item j isassigned to condition i of the first factor and condition h of the sec-ond factor, k ¼ 1; . . . ;K , j ¼ 1; . . . ; J, i ¼ 1; . . . ; I, h ¼ 1; . . . ;H. Weassume

Yhijkjphijk �ind BernoulliðphijkÞ

where phijk is the corresponding probability of an accurate response.Different models correspond to different assumptions on how thisprobability varies across items, subjects, and the levels of the two

0 2 4 6alue

0 1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Value

Prob

abilit

y D

ensi

ty F

unct

ion

Inverse-Gamma Distribution

ensity function of a Cauchy distribution and a normal distribution; Right Panel - the

Page 6: Journal of Memory and Language · Yin Songa, Farouk S. Nathooa,⇑, Michael E.J. Massonb,⇑ a Department of Mathematics and Statistics, University of Victoria, Canada bDepartment

Table 2The full set of Bernoulli mixed models for single-factor designs representing differentassumptions about effects on the accuracy probability.

Model Link Condition effect

LM0 Logistic NullLMF Logistic FixedLMRs Logistic Varies across subjectsLMRi

Logistic Varies across itemsLMRs;i

Logistic Varies across subjects and itemsPM0 Probit NullPMF Probit FixedPMRs Probit Varies across subjectsPMRi

Probit Varies across itemsPMRs;i

Probit Varies across subjects and items

Y. Song et al. / Journal of Memory and Language 96 (2017) 78–92 83

experimental factors. The most general model we consider for atwo-factor design takes the form

gðphijkÞ ¼ b0 þ ch þ ai þ ðcaÞhi þ aðRÞj þ bðRÞk þ ðcaÞðRÞhj þ ðaaÞðRÞij

þ ðcbÞðRÞhk þ ðabÞðRÞik

where ai, ch, and ðcaÞhi are fixed effects corresponding to the firstfactor, the second factor, and their interaction respectively. As

before aðRÞj and bðRÞk are random effects for items and subjects while

the random effects ðcaÞðRÞhj , ðaaÞðRÞij , ðcbÞðRÞhk , ðabÞðRÞik allow the effects ofthe two experimental factors to vary across items and subjects.Just as with the singe-factor case, we assume Cauchy priors forthe fixed effects and adopt the identification constraintsa1 ¼ c1 ¼ ðcaÞ1i ¼ ðcaÞh1 ¼ 0 so that the first level of both factorsare taken to be baseline conditions. The random effects are assignednormal priors as before and the corresponding variancecomponents are assigned weakly informative inverse-gammahyper-priors.

For simplicity, we do not consider models where the interactionbetween the two experimental factors itself interacts with eitheritems or subjects. Although allowing such terms increases the flex-ibility of the model, the associated parameters can be only weaklyestimatable in practice and will therefore exhibit a low degree ofBayesian learning (prior-to-posterior movement) in particular withbinary data. We note that there is some controversy in the litera-ture with regards to the inclusion of high-order random effectsin mixed models. For example, Barr, Levy, Scheepers, and Tily(2013) suggest that linear mixed models generalize best whenthe maximal random effects structure supported by the design isemployed; however, Bates, Kliegl, Vasishth, and Baayen (2015)indicate problems with this suggestion, including convergenceproblems of numerical algorithms for fitting mixed models andthe potential lack of model interpretability. Our choice to excludemodels with random effects that represent three-way interactionsis inline with the discussion in Bates et al.

Many different models can be obtained by removing certainterms in the model equation above, and either the logit or probitlink can be employed. For a single-factor design, the set of ten pos-sible models is summarized in Table 2, where, for example, LMRi

(PMRi ) denotes the logistic (probit) model where the effect of theexperimental condition is represented as a random effect that var-ies across items, and we assume the presence of a fixed effect inany model that contains a corresponding random effect for theexperimental condition. In the case of two factors, the set of mod-els obtainable by removing appropriate terms from the generalmodel equation above is considerably larger. For the logistic (pro-bit) link we refer to specific models using the notation LMxNyIz(PMxNyIz), where x 2 f0; F;Rs;Ri;Rs;ig denotes the model structurefor the first factor as defined for single-factor designs in Table 2,y 2 f0; F;Rs;Ri;Rs;ig similarly denotes the model structure for thesecond factor, and z 2 f0;1g is used to specify the presence orabsence of an interaction between the two factors, with z ¼ 1(z ¼ 0) indicating presence (absence). For example, LMRs;iNRs;i I1denotes the most general model specified in the equation abovewith a logit link, PM0N0I0 denotes the null probit model whereall terms corresponding to the effects of the two factors have beenremoved, and LMRsNRs;i I1 denotes a logistic model where the effectof the first factor varies across subjects, the effect of the second fac-tor varies across subjects and items, and the interaction betweenthe factors is included.

Model fitting and software

The posterior distribution of the model parameters (fixedeffects, random effects, and variance components) associated with

each model can be computed using standard Markov chain MonteCarlo (MCMC) sampling algorithms. These procedures can beimplemented in the R (Ihaka & Gentleman, 1996) and JAGS(Plummer, 2003), programming languages in conjunction withthe R package ‘rjags’ (Plummer, 2013) which provides an interfacebetween the two. We have developed an R function ‘BinBayes.R’that allows these models and algorithms to be used in a relativelystraightforward manner requiring only very basic knowledge ofthe R language. The software along with a detailed illustration ofits use for single-factor and two-factor repeated-measures designs,sample data, and examples are available for download at: https://v2south.github.io/BinBayes/.

Bayesian model comparison

For a given dataset, we are able to summarize the posterior dis-tribution for any of the models for single-factor and two-factordesigns. An arguably more important task is the comparison ofmodels, as each model represents different assumptions regardingthe effect of the experimental condition on the probability ofresponse accuracy. For example, and in reference to Table 2, a com-parison of models LM0 and LMF corresponds to testing for a fixedeffect of the experimental condition, whereas a comparison ofmodels LM0 and LMRi corresponds to testing for an effect of theexperimental condition that allows for this effect to vary acrossitems. As the link function is a modeling choice, logit and probitmodels can also be compared (e.g. LMF and PMF) to determinewhich is more appropriate for the data at hand.

The traditional approach for model comparison in the Bayesianframework is based on the Bayes factor. Given two models denotedby M0 and M1 the Bayes factor comparing M0 to M1 is defined as

BF01 ¼ PrðyjM0ÞPrðyjM1Þ

where y denotes the data, and PrðyjMÞ denotes the probability ofthe data under M. A value of BF01 > 1 can be viewed as evidencein favour of model M0 over M1 in the sense that the probabilityof the data is higher under M0. Kass and Raftery (1995) provide acomprehensive review of the Bayes factor including informationabout its interpretation where it is suggested that a value ofBF01 P 3 corresponds to positive evidence in favour of model M0over M1, whereas, decisive evidence corresponds to BF01 > 150.

In general, the Bayes factor can be difficult to compute and agreat deal of research in the area of statistical computing has beendedicated to this problem (see e.g. Chib & Jeliazkov, 2001; Chen,2005; Meng & Schilling, 2002; Meng & Wong, 1996; Raftery,Newton, Satagopan, & Krivitsky, 2007). A number of Monte Carloalgorithms can be applied for the computation of the Bayes factor;however, for the Bernoulli mixed models under consideration inthis article, we have found that stable estimation of the Bayes fac-tor is extremely time consuming (e.g. several hours to days on a

Page 7: Journal of Memory and Language · Yin Songa, Farouk S. Nathooa,⇑, Michael E.J. Massonb,⇑ a Department of Mathematics and Statistics, University of Victoria, Canada bDepartment

84 Y. Song et al. / Journal of Memory and Language 96 (2017) 78–92

fast laptop for datasets of standard to large size). As a more practi-cal alternative that is easy to compute in just a few seconds, we usethe Bayesian information criterion (BIC) defined for a given modelM by

BICðMÞ ¼ �2 log bL þ p logn;

where bL is the maximized likelihood function for model M;p is thenumber of parameters in the model, and n is the sample size. In thecase of generalized linear mixed models for repeated-measuresdesigns, both the number of parameters p and the sample size nare not straightforward to define (Jones, 2011; Spiegelhalter, Best,Carlin, & Van Der Linde, 2002). We will assume that p excludesthe random effects but includes the corresponding variance compo-nents and the number of fixed effects. Indeed, this definition for p isthe default used in the computation of the BIC in the R packagelme4 (Bates, Mächler, Bolker, & Walker, 2014). For the sample size,we assume that n ¼ K , the number of subjects. The issue of effectivesample size for BIC is considered in detail by Berger, Bayarri, andPericchi (2014); see also Nathoo and Masson (2016).

Given the value of BIC for two competing models the Bayes fac-tor is approximated by BF01 � expfðBICðM1Þ � BICðM0ÞÞ=2g wherethis expression assumes PrðM0Þ ¼ PrðM1Þ a priori and the accuracyof approximation will increase with the sample size. The approxi-mation is based on a unit information prior for the model param-eters (see e.g. Kass & Raftery, 1995; Masson, 2011; Nathoo &Masson, 2016; Wagenmakers, 2007). Alternatively, given a set ofcompeting models such as those listed in Table 2, the BIC for eachmodel can be computed in order to rank the models, with lowervalues corresponding to preferred models. Typically we require adifference in the BIC scores of two models to be at leastjDBICj ¼ 2 in order to claim that there is positive evidence in favourof one model over the other, which corresponds to a Bayes factor ofBF � 2:72.

Rather than comparing models based on Bayes factors, an alter-native second approach that can be used to compare models isbased on an evaluation of how well each model can predict newdata or heldout data. By heldout data, we mean a subset of the datathat is not used in the process of fitting the model but is subse-quently used to evaluate the predictive ability of the model. In thiscontext, cross-validation is a common approach for estimating theout-of-sample prediction error which can then be used to comparemodels (Gelman, Hwang, & Vehtari, 2014). As cross-validationrequires splitting the data into multiple partitions and then repeat-edly fitting the model to subsets of the data, which is computation-ally demanding, alternative measures of predictive accuracy that insome sense approximate cross-validation have been proposed formodel selection. One such approximation that has been appliedextensively for model selection is the AIC (Akaike, 1998) which iscomputed using the maximum likelihood estimator. The computa-tion of the AIC, like the BIC, requires a value for the number ofmodel parameters pwhich is not clearly defined in the case of hier-archical models as described above. An alternative, fully Bayesianapproximation to cross-validation that avoids this problem, is theWAIC, which has been proposed by Watanabe (2010) and takesthe form

WAIC ¼ �2XKk¼1

log Eh½pðykjhÞjy1; . . . ; yK �

þ 2XKk¼1

Varh½logpðykjhÞjy1; . . . ; yK �

where yk is the data collected from subject k; h denotes the set of allunknown parameters, pðykjhÞ denotes the probability mass functionof yk conditional on h, and the expectation Eh½�� and variance Varh½��are taken with respect to the posterior distribution of the model

parameters. The first part of the formula for WAIC provides a mea-sure of the fit of the model against the data and the second part ismeant to capture the inherent complexity of the model, where thereasoning is that more complex models are a priori less likely (i.e., itis a form of Occam’s razor).

In practice, these are computed using an MCMC algorithm thatis used to fit the model. The WAIC is thus easily computed as a by-product of fitting the model. As discussed in Gelman, Carlin, Stern,and Rubin (2014) and Gelman, Hwang et al. (2014), the WAIC hasthe desirable property of averaging over the posterior distributionof the model parameters, that is, considering many possible valuesof the model parameters weighted by their posterior density,rather than conditioning on just a single value of the parameter,the maximum likelihood estimator as is done with AIC. This isdesirable because it captures the estimated uncertainty a rationalresearcher should have into the estimates of the model given theassumptions the researcher was willing to make about the priorand model structure.

More importantly, the WAIC works well with so called singularmodels, that is, models with hierarchical structures where thenumber of parameters increases with sample size. It is thus partic-ularly well suited for the random effect models we are consideringin the present article, and unlike the penalties used by the BIC andAIC, the penalty used in the WAIC has been rigorously justified formodel comparison in this setting (Watanabe, 2010).

Although a generally applicable calibration for differences inthe WAIC scores of two models is currently lacking, a reasonableheuristic is to apply the criteria often used for the standard AIC(Burnham & Anderson, 1998; Dixon, 2008). One such criterion isto require that jDWAICj ¼ 2 in order to claim that there is positiveevidence in favour of one model over the other, though other crite-ria may also be used. The operating characteristics of this decisionrule are evaluated in comparison with the BIC and AIC through oursimulation studies.

We emphasize that BIC and WAIC, although both Bayesian intheir formulation, are constructed with different utilities in mind.The BIC is based on the notion of posterior model probabilitiesand the Bayes factor, whereas the WAIC is an estimate of theexpected out-of-sample-prediction error. Given the differing utili-ties it is certainly possible that the two criteria will disagree, andwe view the approaches as complimentary. Detailed comparisonsaremade in the next section. In addition to computing the posteriordistribution for a given model, our R function BinBayes.R will alsocompute the BIC and WAIC for any of the models in Table 2 or formodels associated with the two-factor designs discussed above.

Simulation studies

In order to evaluate the methodology described in the previoussection we conducted two simulation studies. The studies exam-ined the type I error and the power of specific decision rules basedon the Bernoulli mixed models and the BIC or the WAIC for evalu-ating the effect of experimental conditions. These are comparedwith the type I error and the power of the standard aggregatingapproach after aggregating over items. Although we make ourcomparisons based on frequentist criteria, we note that it is notuncommon to evaluate Bayesian methods using such criteria (seee.g. Carlin & Louis, 2008; Gelman et al., 2013). In the non-Bayesian context, simulation studies somewhat similar to thosepresented here are conducted in Dixon (2008). For the standard

aggregating approach, we let eY ik denote the proportion of accurateresponses at the ith condition in the experiment for subject k. Thisis obtained by averaging that subject’s binary response variablesYijk over the items at condition i. The statistical model consideredin this case is

Page 8: Journal of Memory and Language · Yin Songa, Farouk S. Nathooa,⇑, Michael E.J. Massonb,⇑ a Department of Mathematics and Statistics, University of Victoria, Canada bDepartment

Y. Song et al. / Journal of Memory and Language 96 (2017) 78–92 85

eY ik ¼ li þ bk þ �ik; �ik �iid Nð0;r2�Þ; i ¼ 1; . . . ; I; k ¼ 1; . . . ;K

ð1Þ

where bk �iid Nð0;r2bÞ are subject-specific random effects that are

assumed independent of the model errors �ik, and li is the fixedeffect of the experimental condition. To evaluate the effect of theexperimental condition, this model is compared with the simplermodel that assumes no effect of the experimental condition andthus has li ¼ l, i ¼ 1; . . . ; I. For the standard aggregating approachwe will again make this evaluation within the Bayesian paradigmand compute the Bayes factor using the BayesFactor package(Rouder et al., 2012) in the R programming language. The BayesFac-tor package implements Monte Carlo techniques for computing theBayes factor for a class of Gaussian error models, of which (1) is aspecial case. Finally, we compare the three Bayesian approachesfor model selection (the first Bayesian approach is the standardaggregating approach under a Bayesian framework) with the stan-dard AIC criterion applied to logistic mixed models.

In each study, we simulated binary response accuracy data ofthe type depicted in Table 1, with K ¼ 72 subjects, J ¼ 120 items,and I ¼ 4 experimental conditions manipulated as a repeated-measures factor. In the first study we simulate data where theeffect of the experimental condition does not depend on items orsubjects, and in the second study this effect is not held constant,but instead varies across the items. In each setting, 1500simulation replicates were used to estimate each point of thepower curve for comparing the null and alternative models asthe effect size varies. To create power curves when the BICapproximation to the BF was used, our decision rule was to rejectthe null model whenever BF > expð1Þ (which occurs whenDBIC ¼ BICnull � BICalt > 2) and the same rule was used for boththe WAIC and AIC. For the standard item aggregated approachusing the BF, the null model was rejected whenever the Bayes fac-tor favouring the alternative model was greater than BF > expð1Þ.Additionally, we also created power curves for each of the fourmethods where the decision rule was chosen for each so that thetype I error rate was fixed at 0.05. Fixing the type I error rate forall four methods to have the same value has the advantage of mak-ing the power curves more directly comparable.

Simulation study I

We generated the simulated data using the logistic mixedmodel LMF which contains a fixed effect for the experimental con-dition. The simulated data could also be generated from a probitmixed model; however, we simulated from the logistic model asit is more commonly used in practice. In this case, the effect ofthe experimental condition does not depend on subjects or items.The fixed effect is represented by ai, and we set a1 ¼ a2 ¼ a3 ¼ 0and a4 ¼ C, where C ranged over a series of values from C ¼ 0 toC ¼ 1:5. This particular pattern of fixed effects where the effectof only a single condition varies was chosen so that the resultsare easier to summarize and visualize (i.e., in a two-dimensionalplot). The intercept was set at b0 ¼ 3:22 corresponding to a base-line accuracy rate of 96% (representative of performance inspeeded word identification experiments, for example). We alsoconsidered and will summarize later results from simulations inwhich performance is not near ceiling or floor. The variancecomponents were set according to rb ¼ 1:045 for the standarddeviation of the subject random effects, and we considered threevalues for the standard deviation of the item random effectsra ¼ 1:5, 3, or 5. These choices for the simulation parameter valuesare guided by the model estimates obtained from the single-factorreal-data example analyzed in the next section.

For each value of the fixed effect C and the item standard devi-ation ra, we simulated 1500 datasets, and for each dataset we fitmodel LMF which contains the fixed effect of the experimental con-dition, as well as model LM0 which has no effect for the experimen-tal condition (the null model). The BIC approximation to the BF,AIC, and WAIC for both models were computed, and in addition,the standard model Eq. (1) was applied after aggregating overitems, and the Bayes factor comparing the models with and with-out a fixed effect for condition was computed. In all four cases, thedecision rules described in the previous section were applied tocreate power curves for the different model selection criteria.

The results of the simulation study are presented in Fig. 2 whichshows the result of assessing the significance of the experimentalmanipulation by comparing models with and without fixed effectsfor each approach. In the case where the decision rules were set sothat the type I error rate was fixed at 0.05 for all three Bayesianapproaches as well as the AIC (second column of Fig. 2), a clearpattern emerges. When the between-item variability is at itslowest level, with ra ¼ 1:5, all four approaches have power curvesthat are virtually identical. In this case there appears to be noadvantage to applying the Bernoulli mixed model over thestandard aggregating approach. However, as the between-itemvariability increases, the Bernoulli mixed models with BIC approx-imation to BF, AIC, or WAIC outperform the standard aggregatingapproach with BF and have uniformly higher power. Interestingly,the BIC approximation to the BF, AIC, and WAIC have identicalpower curves when they are calibrated to have the same type Ierror rate.

In practice, outside of a simulation study, it may not be possibleto calibrate the decision rule so as to ensure a specific value for thetype I error rate. The first column of Fig. 2 depicts the power curveswhen the decision rules DAIC > 2;DWAIC > 2, and BF > expð1Þ areapplied. In this case, it must be understood that a comparison ofthe power curves is not an ’apples-to-apples’ comparison as thetype I error rates are not the same. For all values of ra the type Ierror rate for the Bernoulli mixed model with BF > expð1Þ(DBIC > 2) is 0. This produces a conservative rule that has uni-formly lower power than the Bernoulli mixed model withDAIC > 2 and DWAIC > 2. As a tradeoff, the latter have a highertype I error rate and we note that the power curves for WAICand AIC are virtually identical in this case. The standard aggregat-ing approach with rejection of the null when BF > expð1Þ also has atype I error rate of 0 in all cases. It should be noted that althoughboth the BIC approximation to the BF and the standard item aggre-gated approach with BF have a type I error rate of 0, the corre-sponding adjusted power curves differ due to the fact that thetwo approaches require different decision rules in order to fix thetype I error rates at 0.05.

The power curve associated with the standard aggregatingapproach is always below that of the WAIC and AIC, and is abovethe power curve of the BIC approximation to the BF when ra isset to its lowest value of 1.5. As the value of ra increases, the powerof the Bernoulli mixed model with BIC approximation to the BFbegins to improve relative to the standard aggregating approach.In the case where ra ¼ 5 both the BIC approximation tothe BF and standard aggregating approach have a type I error rateof 0, but the power of the Bernoulli mixed model with BICapprximation to the BF is generally higher than that of the stan-dard aggregating approach, particularly for higher values of theeffect size C.

Simulation study II

We next consider the situation where the effect of the experi-mental condition varies across the items and where the objectiveis again to evaluate the data for the existence of a condition effect.

Page 9: Journal of Memory and Language · Yin Songa, Farouk S. Nathooa,⇑, Michael E.J. Massonb,⇑ a Department of Mathematics and Statistics, University of Victoria, Canada bDepartment

0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

Effect size (C)

P(R

ejec

t H0)

a= 1.5 (All are uncalibrated)

Standard aggregating, BFNon-Bayes GLMM, AICBayes GLMM, BF(via BIC)Bayes GLMM, WAIC

0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

Effect size (C)

P(R

ejec

t H0)

a= 1.5 (All are calibrated)

0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

Effect size (C)

P(R

ejec

t H0)

a= 3 (All are uncalibrated)

0.0 0.5 1.0 1.5

0.2

0.4

0.6

0.8

1.0

Effect size (C)P(

Rej

ect H

0)

a= 3 (All are calibrated)

0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

Effect size (C)

P(R

ejec

t H0)

a= 5 (All are uncalibrated)

0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

Effect size (C)

P(R

ejec

t H0)

a= 5 (All are calibrated)

Fig. 2. Results from simulation study I. The left column corresponds to the decision rules DBIC > 2 (for Bayes GLMM, BF via BIC), DAIC > 2 (for non-Bayes GLMM, AIC),DWAIC > 2 (for Bayes GLMM, WAIC), and BF > expð1Þ (for standard aggregating BF), whereas in the right column the decision rules were chosen to ensure that all fourmethods had a type I error rate of 0.05. The rows correspond to different values of the between item variability ra ¼ 1:5;3;5. Values of C represent the strength of the effect ofthe experimental conditions.

86 Y. Song et al. / Journal of Memory and Language 96 (2017) 78–92

We generated data sets from the model LMRi which containsrandom effects for both subjects and items, a fixed effect for theexperimental condition, and a random effect representing theinteraction between experimental condition and items. Theparameter values for the simulation were again guided by modelestimates obtained from the single-factor dataset considered inthe next section. For simulating data, we set the intercept to beb0 ¼ 3:21 corresponding to a baseline accuracy rate of 96%, thevariance components for the subject and item random effects wereset based on rb ¼ 1:04 and ra ¼ 0:44 respectively, with thesevalues being based on estimates obtained from fitting the modelto the real data example discussed in the next section. The fixedeffect for the experimental condition was again set based ona1 ¼ a2 ¼ a3 ¼ 0 and a4 ¼ C, and we considered three possiblevalues, C ¼ 0, 0.25, 0.5. The effect of the experimental condition

varied across items through the random effect ðaaÞðRÞij �iid Nð0;r2aaÞ

and the magnitude of this variability depended on the parameterraa, which we varied over a series of values ranging from 0 to 2.The values of C and raa are varied factorially so that overall therecan be a random effect even without a fixed effect (C ¼ 0) andthere can be a fixed effect without a random effect (raa ¼ 0). Whenboth C ¼ 0 and raa ¼ 0 there is no effect present.

For each value of C and raa we again simulated 1500 datasets,and for each dataset we fit the model LMRi which contains an effectfor the experimental condition that varies across the items, as wellas the model LM0 which has no effect for the experimental condi-

tion (the null model). The comparison of models LMRi and LM0 thencorresponds to an overall test for an effect of the experimental con-dition, where this effect can be either random and varying acrossitems, fixed, or both. After aggregating over items, the standardaggregating approach is applied as in the previous section. We notethat the standard aggregating approach is not sufficiently flexibleto allow the condition effect to depend on items, and so as before,the effect of experimental condition was evaluated through themodel Eq. (1) and the Bayes factor comparing the models withand without a fixed effect for condition was computed. In all casesthe decision rules used in the first simulation study were appliedagain.

The results are depicted in Fig. 3. In this case the effect of theexperimental condition is represented by both the fixed effect Cand the standard deviation of the random condition-by-item inter-action raa. We clarify that in this figure the power curves vary onthe x-axis by raa; whereas, they vary on the x-axis by C in Fig. 2. Aswe are testing for an overall effect of the experimental conditions,a type I error can occur only when C ¼ 0 and raa ¼ 0, that is, whenboth the fixed and random components of the experimental effectare absent, which is possible only in the left-most data point in thetop panels of Fig. 3, where both C ¼ 0 and raa ¼ 0. The second col-umn of Fig. 3 compares the power curves when the decision ruleswere chosen so as to fix the probability of a type I error at 0.05. Inthese cases, when C ¼ 0 (so that the fixed effect of condition isabsent but the random condition-by-item effect is not) the Ber-

Page 10: Journal of Memory and Language · Yin Songa, Farouk S. Nathooa,⇑, Michael E.J. Massonb,⇑ a Department of Mathematics and Statistics, University of Victoria, Canada bDepartment

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

P(R

ejec

t H0)

a

C = 0 (All are uncalibrated)

Standard aggregating, BFNon-Bayes GLMM, AICBayes GLMM, BF(via BIC)Bayes GLMM, WAIC

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

P(R

ejec

t H0)

a

C = 0 (All are calibrated)

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

P(R

ejec

t H0)

a

C = 0.25 (All are uncalibrated)

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

P(R

ejec

t H0)

a

C = 0.25 (All are calibrated)

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

P(R

ejec

t H0)

a

C = 0.5 (All are uncalibrated)

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

P(R

ejec

t H0)

a

when C = 0.5 (All are calibrated)

Fig. 3. Results from simulation study II. The left column corresponds to the decision rules DBIC > 2 (for Bayes GLMM, BF via BIC), DAIC > 2 (for non-Bayes GLMM, AIC),DWAIC > 2 (for Bayes GLMM, WAIC), and BF > expð1Þ (for standard aggregating BF), whereas in the right column the decision rules were chosen to ensure that all fourmethods had a type I error rate of 0.05. Note that a type I error can occur only when C ¼ 0 and raa ¼ 0 (since otherwise the null is false) so the calibration of the type I errorrates is based on the C ¼ 0 and raa ¼ 0 case for all three panels in the right column.

Y. Song et al. / Journal of Memory and Language 96 (2017) 78–92 87

noulli mixed model with any of the BIC approximation to the BF,AIC, or WAIC outperforms the standard aggregating approachrather significantly with respect to power. In this case the latterapproach does not contain parameters in the model for a randomeffect, and can only detect this through the fixed effect for condi-tion, though it is interesting to note that the model is able to dothis once raa is sufficiently large. This phenomenon can be thoughtof as a type of false alarm, because the approach detects the exis-tence of a fixed effect when in fact the real effect that is present is arandom effect. It also appears that WAIC has higher power thanboth the AIC and BIC approximation to the BF in this case. Thuswe see here an advantage in terms of power for the fully Bayesianapproach compared with the non-Bayesian approach when the samegeneralized linear mixed models are applied and it is only the modelselection criterion that varies.

When C is increased so that C ¼ 0:25 (so that there is now botha fixed effect of condition as well as a random condition-by-itemeffect) the power of all four approaches increases; however, theBernoulli mixed model still generally outperforms the standardaggregating approach with a fairly large difference in power atmost values of raa. Considering the same model comparisonsbased on generalized linear mixed models (LMRi versus LM0) weagain see that WAIC has higher power than both AIC and BIC approx-imation to the BF in this case. When C ¼ 0:5 the fixed effect is nowsufficiently large that all four methods have very high power todetect an effect of experimental condition and the power curvesare generally flat as raa varies.

Turning to the first column of Fig. 3 where the methods are notcalibrated to have the same type I error rate, we see that the Ber-noulli mixed model with decision rule BF > expð1Þ (DBIC > 2)results in an extremely conservative procedure, where the type Ierror is 0 but the power is generally lower (particularly with raa

small) than the other procedures, with the exception of AIC whichis also very conservative and has a power curve matching that ofthe BIC approximation to the BF when C ¼ 0:25 and C ¼ 0:5. Inter-estingly, when C ¼ 0 (so that the fixed of the experimental condi-tion is absent but the random condition-by-item effect is not) thepower curve for the AIC is higher than that of both the BIC approx-imation to the BF and the standard aggregating approach but isuniformly lower than that of the WAIC. The power of the standardaggregating approach improves as the values of C increases, but itspower is uniformly dominated by the Bernoulli mixed model withdecision rule DWAIC > 2 in all cases. We emphasize again that thisis not an ‘apples-to-apples’ comparison as the two approaches arenot calibrated to have the same type I error rate. The type I errorrate of both approaches is indicated in the left-most data point inthe first row and first column of Fig. 3 when raa ¼ 0, and we notethat the Bernoulli mixed model with decision rule DWAIC > 2 has aslightly higher type I error than the other approaches. Neverthe-less, its power to detect an experimental effect is much higherfor most values of raa, and this is particularly evident within aneighborhood of raa ¼ 0:5. Thus it is interesting to note that fullyBayesian WAIC has higher power than AIC when the same logisticmixed models are being compared.

Page 11: Journal of Memory and Language · Yin Songa, Farouk S. Nathooa,⇑, Michael E.J. Massonb,⇑ a Department of Mathematics and Statistics, University of Victoria, Canada bDepartment

0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

Effect size (C)

P(R

ejec

t H0)

a = 5 and 0= 0 (All are uncalibrated)

Standard aggregating, BFNon-Bayes GLMM, AICBayes GLMM, BF(via BIC)Bayes GLMM, WAIC

0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

Effect size (C)

P(R

ejec

t H0)

a = 5 and 0= 0 (All are calibrated)

Fig. 4. Results from simulation study I with b0 ¼ 0 corresponding to a baseline accuracy of 50%. The left panel corresponds to the decision rules DBIC > 2 (for Bayes GLMM, BFvia BIC), DAIC > 2 (for non-Bayes GLMM, AIC), DWAIC > 2 (for Bayes GLMM, WAIC), and BF > expð1Þ (for standard aggregating BF), whereas in the right panel the decisionrules were chosen to ensure that all four methods had a type I error rate of 0.05. These settings correspond to the third row of Fig. 2 where the baseline accuracy rate is 96%.

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

P(R

ejec

t H0)

a

C = 0 and 0= 0 (All are uncalibrated)

Standard aggregating, BFNon-Bayes GLMM, AICBayes GLMM, BF(via BIC)Bayes GLMM, WAIC

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

P(R

ejec

t H0)

a

C = 0 and 0= 0 (All are calibrated)

Fig. 5. Results from simulation study II with b0 ¼ 0. The left figure corresponds to the decision rules DBIC > 2 (for Bayes GLMM, BF via BIC), DAIC > 2 (for non-Bayes GLMM,AIC), DWAIC > 2 (for Bayes GLMM, WAIC), and BF > expð1Þ (for standard aggregating BF), whereas in the right figure the decision rules were chosen to ensure that all fourmethods had a type I error rate of 0.05. These settings correspond to the first row of Fig. 3 where the baseline accuracy rate is 96%.

88 Y. Song et al. / Journal of Memory and Language 96 (2017) 78–92

In both simulation studies I and II we have assumed that thetrue value of the intercept in the logistic model is b0 ¼ 3:22. Thisis a fairly large value corresponding to a baseline accuracy rate ofapproximately 96%. We have also conducted additional simulationstudies where the baseline accuracy rate was not so extreme, basedon setting the true value to b0 ¼ 0, corresponding to a baselineaccuracy rate of approximately 50%. The results for this more mod-erate accuracy rate are depicted for study I in Fig. 4 which corre-sponds to the third row of Fig. 2, and for study II in Fig. 5 whichcorresponds to the first row of Fig. 3.

Figs. 4 and 5 indicate that the comparison of the power curvesat a baseline accuracy rate of 50% yields results that are quite sim-ilar to those already presented where the baseline accuracy ratewas 96%. The primary difference is that the relative performanceof the standard aggregating approach appears to drop in the casewhere the baseline accuracy rate is 50%. A baseline rate of 50%was also the value considered in Dixon (2008) where the interceptwas taken to be zero in the simulations that evaluated generalizedlinear mixed models.

Example application: single-factor design

We now present an example application illustrating the use ofBernoulli mixed models for the analysis of response accuracy for

a single-factor design. The analysis presented here can be repro-duced using the software, data, and examples provided at:https://v2south.github.io/BinBayes/. The data considered herewere taken from a study that investigated the influence of asemantic context on the identification of printed words showneither under clear (high contrast) or degraded (low contrast) con-ditions. The semantic context consisted of a prime word presentedin advance of the target item. On critical trials, the target item wasa word and on other trials the target was a nonword. The task wasa lexical-decision procedure where the subject was instructed toclassify the target on each trial as a word or a nonword, and theresponse was either accurate or inaccurate. Our interest was con-fined to trials with word targets. The prime word was eithersemantically related or unrelated to the target word (e.g.,granite-STONE vs. attack-FLOWER), and the target word was pre-sented either in clear (high contrast) or degraded (low contrast)form. Combining these two factors produced four conditions:related-clear (RC), unrelated-clear (UC), related-degraded (RD),unrelated-degraded (UD). The two factors are treated as a singlefactor with four levels for this example. For the current analysis,the accuracy of the response was the dependent measure.

The study comprised K ¼ 72 subjects, I ¼ 4 conditions, andJ ¼ 120 items, and the total number of binary observations wasKJ ¼ 8640. The overall rate of response accuracy was 95.4%. We

Page 12: Journal of Memory and Language · Yin Songa, Farouk S. Nathooa,⇑, Michael E.J. Massonb,⇑ a Department of Mathematics and Statistics, University of Victoria, Canada bDepartment

Table 3The BIC and WAIC values for each of the ten binomial mixed models presented in Table 2 after application to the study data. Note: the lowest (i.e., best) scores are in bold.

Model Link Condition effect WAIC BIC

LM0 Logit Null 2827 2928LMF Logit Condition: Fixed 2803 2925LMRs Logit Condition ⁄ Subject: Random 2804 3010LMRi

Logit Condition ⁄ Item: Random 2800 2997LMRs;i

Logit Condition ⁄ Subject + Condition ⁄ Item: Random 2802 3078PM0 Probit Null 2827 2935PMF Probit Condition: Fixed 2803 2934PMRs Probit Condition ⁄ Subject: Random 2806 3015PMRi

Probit Condition ⁄ Item: Random 2803 3007PMRs;i

Probit Condition ⁄ Subject + Condition ⁄ Item: Random 2806 3088

Y. Song et al. / Journal of Memory and Language 96 (2017) 78–92 89

took the UD (unrelated-degraded) level of the experimental condi-tion as the baseline condition and fit each of the ten Bernoullimixed models listed in Table 2, and the resulting WAIC and BICscores were obtained for each model. The model comparisons arepresented in Table 3.

According to the WAIC, the optimal model is the logistic modelwith random subject and item effects, and where the effect of theexperimental condition depends on items. According to the BIC theoptimal model is the logistic model with random subject and itemeffects, and where the effect of condition is fixed and does notdepend on items. Taken together both criteria point to the exis-tence of an effect for the experimental condition; however, thefixed condition effect has the highest (approximate) posteriormodel probability (BIC), whereas the model with random conditioneffects depending on item is expected to make the best out-of-sample predictions (WAIC). Both model selection criteria takentogether seem to provide evidence in support of the logistic linkas opposed to the probit link. This is primarily the case with BICas the WAIC scores are more neutral towards the link functionbut do show some support in favor of the logistic link.

1 8 16 25 34 43 52 61 70 79 88 97 107 118

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

item order

UC

con

ditio

n Ef

fect

UC effect by item

Fig. 6. The posterior distributions for the effects of the unrelated-clear (UC)condition on the probability of response accuracy. In this case there is a separateeffect for each of the 120 items used in the experiment and the figure depicts theposterior distribution for condition UC across the items as a boxplot summarizingMarkov chain Monte Carlo samples drawn from the posterior distribution. Items areordered according to their estimated effect sizes. The effects depicted are on thelogit scale. The black line represents the posterior median for each item, the greenregion represents the set of values between the first and third quartile of theposterior distribution, and the dotted bars extend out to the extreme values.

The BIC scores can be used to compute an approximate Bayesfactor for comparing models. For example, comparing LM0 (nullcondition) versus LMF (fixed condition) we obtain a Bayes factor of

BF � expfðBICðLM0Þ � BICðLMFÞÞ=2g ¼ 4:48

which indicates substantial evidence in favour of the model with afixed condition effect when compared to the model with no condi-tion effect.

Because it was chosen as the optimal model by the WAIC, wesummarize the posterior distribution of model LMRi in more detail.In this case, the effect of experimental condition varies acrossitems. The condition UD (unrelated-degraded) is taken as the base-line condition so that the item-dependent condition effects associ-ated with the remaining three conditions are then interpretedrelative to UD.

The posterior distributions for the item-dependent conditioneffects represent the information obtained about these effects fromthe data, the model, and Bayes rule. These are depicted as box-plots in Figs. 6–8, for the conditions UC, RD, and RC, respectively.

1 8 16 25 34 43 52 61 70 79 88 97 107 118

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

item order

RD

con

ditio

n Ef

fect

RD effect by item

Fig. 7. The posterior distributions for the effects of the related-degraded (RD)condition on the probability of response accuracy. In this case there is a separateeffect for each of the 120 items used in the experiment and the figure depicts theposterior distribution for condition RD across the items as a boxplot summarizingMarkov chain Monte Carlo samples drawn from the posterior distribution. Items areordered according to their estimated effect sizes. The effects are depicted on thelogit scale. The black line represents the posterior median for each item, the greenregion represents the set of values between the first and third quartile of theposterior distribution, and the dotted bars extend out to the extreme values.

Page 13: Journal of Memory and Language · Yin Songa, Farouk S. Nathooa,⇑, Michael E.J. Massonb,⇑ a Department of Mathematics and Statistics, University of Victoria, Canada bDepartment

1 8 16 25 34 43 52 61 70 79 88 97 107 118

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

item order

RC

con

ditio

n Ef

fect

RC effect by item

Fig. 8. The posterior distributions for the effects of the related-clear (RC) conditionon the probability of response accuracy. In this case there is a separate effect foreach of the 120 items used in the experiment and the figure depicts the posteriordistribution for condition RC across the items as a boxplot summarizing Markovchain Monte Carlo samples drawn from the posterior distribution. Items areordered according to their estimated effect sizes. The effects are depicted on thelogit scale. The black line represents the posterior median for each item, the greenregion represents the set of values between the first and third quartile of theposterior distribution, and the dotted bars extend out to the extreme values.

90 Y. Song et al. / Journal of Memory and Language 96 (2017) 78–92

These plots relate to the best linear unbiased predictors (BLUPs)plots for standard generalized linear mixed models. Overall, wesee that all three conditions increase the probability of responseaccuracy relative to the baseline UD. This increase is roughly thesame for RD and UC; however, it appears that RC leads to higheraccuracy rates compared to all other conditions. Examining allthree conditions, we see that the variability in the condition effectsacross the items is not substantial; however, it does appear thatsome items have relatively lower or higher effects, particularly inthe RD condition. Examining Fig. 8, there is also one item in theRC condition that appears to have a substantially lower effect rel-ative to all other items. Thus, it seems that the variability in theeffect size across items picked up by the WAIC is driven primarilyby just a few items. Our analysis enables the user to identify suchitems through the posterior distribution.

Bayesian interval estimates can be obtained by selecting thosevalues of highest posterior probability density including the mode.This is sometimes called the 95% highest posterior density intervalwhen the posterior probability associated with the interval is 0.95.With respect to the variance components for the random effects,the between-item effect standard deviation ra has a 95% highestposterior density (HPD) interval of ð0:863;1:261Þ, the between-subject effect standard deviation rb has a 95% HPD interval ofð0:280;0:625Þ, and the standard deviation for the condition-by-item random effects raa has a 95% HPD interval of ð0:109;0:587Þ.

Conclusions and recommendations

We have introduced a collection of Bernoulli mixed models thatcan be used for the analysis of accuracy studies in the Bayesianframework. The set of models represents a number of differentassumptions for the effect of the experimental condition on accu-

racy. These assumptions range from a null model to a model wherethe experimental effect varies across both items and subjects. Themodels we consider are based on random effects that are assumednormally distributed. Although one may consider generalized lin-ear mixed models where the random effects are not normally dis-tributed (see e.g. Nathoo & Ghosh, 2013) it is generally the casethat inference in generalized linear mixed models is fairly robustto misspecification of the random effects distribution (e.g.McCulloch & Neuhaus, 2011).

For the models we have considered here, we have generallyassumed the presence of a fixed effect in any model that containsa corresponding random effect. Florian Jaeger has suggested thealternative of assessing the significance of effects by comparingto a model without the fixed effect but, critically, with thesubject- or item-based random slopes for the fixed effects. Thiscomparison makes sure that the assessment of significance (e.g.,through model comparison) does not confound the effect of thefixed effect predictor with the variance captured by the randomslopes for that predictor. The extent to which such confoundingcan cause problems for the model comparisons considered hereis unclear but presents a potentially interesting avenue forinvestigation.

To compare possible models, we have investigated the AIC andboth the BIC as a large-sample approximation to the Bayes factorand the WAIC as a large-sample approximation to Bayesiancross-validation. These have been compared to the standard,item-aggregated approach using simulation studies. An alternativeapproach for model selection that we did not consider in our sim-ulation studies is that of nested comparison of models via a chi-square over differences in model deviances. This approach,although arguably a current standard for model comparison acrossgeneralized linear mixed models (in the field), is less flexible thanthe approaches we considered as it requires the models being com-pared to be nested and is not appropriate for testing randomeffects because the asymptotic chi-square distribution under thenull is not valid when the null hypothesis lies on the boundary ofthe parameter space (as it does when testing random effects wherethe null corresponds to setting a variance component to zero, seee.g., Lin, 1997).

Overall, we recommend the use of the WAIC as it appears tohave power that is higher or at least as high as the BIC approxima-tion to the BF, AIC, and the standard aggregating approach with BFwhen applied with industry-standard decision rules. The BICapproximation to the BF can be used alongside WAIC and theresults treated as complimentary when Bayes factors are of inter-est. While not typical, there could be specific cases where thechoice of random effects structure and link function could be dri-ven by theoretical considerations. In these cases, these considera-tions could be used to narrow the class of possible models andthen combined with Bayesian model selection.

For the simulation settings considered here, we found that theperformance of the Bernoulli mixed model approach improved rel-ative to the standard repeated-measures approach as both thebetween-item variability ra (study I) and the item-condition vari-ability raa (study II) increased. In the latter case, the observed dif-ference in performance when comparing our approach to thestandard item-aggregated approach was rather substantial for afairly large range of values for raa. In general, we see that as thevariability across items increases, the application of the proposedapproach becomes increasingly more valuable and likely to detecteffects, relative to the standard aggregating approach to evaluatingthe effects of independent variables on accuracy.

With respect to comparisons between AIC and the fully Baye-sian WAIC, when the same logistic mixed models were appliedthe two approaches had identical power curves when testing forfixed effects of the experimental condition in study I. Interestingly,

Page 14: Journal of Memory and Language · Yin Songa, Farouk S. Nathooa,⇑, Michael E.J. Massonb,⇑ a Department of Mathematics and Statistics, University of Victoria, Canada bDepartment

Y. Song et al. / Journal of Memory and Language 96 (2017) 78–92 91

the WAIC had higher power than the AIC when testing for randomeffects of the experimental conditions in study II. As with any sim-ulation study, the conclusions drawn are specific to the particularsimulation settings adopted. The conclusions we have drawnregarding the power of the Bayesian approach relative to its com-petitors can be guaranteed to hold only under the conditionsassumed for the simulation studies. Nevertheless, these initialfindings are very instructive.

We note that the WAIC is based on the priors included in ourmodel specifications, whereas the BIC approximation to the BF isbased on the unit information prior. The differences in perfor-mance seen when comparing these approaches is due in part tothe differences in priors. It was clear that the BIC when used withthe decision rule based on DBIC ¼ 2 results in a conservativeapproach. Aside from comparisons based on standard decisionrules, we also made comparisons after controlling for type I error,in which case the performance of the BIC approximation to the BFimproved with respect to power. Practically speaking, outside of asimulation study, it is currently not possible to control the type Ierror of a decision rule when using information criteria for modelcomparison; however, we are currently investigating an approachfor doing so based on the parametric bootstrap as an avenue forfuture work.

We again note that there is a large body of work that debatesthe pros and cons of a Bayesian analysis. The use of Bayesianapproaches requires researchers to digest new ideas that can beconceptually difficult. Users must take the time to acquire the nec-essary background in order to use Bayesian methods appropriatelyand we refer the interested reader to the introductory textbook ofGelman, Carlin et al. (2014). We also note that our approach, whichrequires MCMC sampling, is more computationally intensive thaneither the standard aggregating approach or the generalized linearmixed model fit by maximum likelihood with AIC used for modelcomparisons. We have demonstrated that WAIC has power thatis higher or at least as high as AIC under certain settings, and webelieve that this justifies the additional computation required. Inaddition, the advantages of a Bayesian analysis in the ability toconstruct posterior distributions for parameters of interest, andthe availability of a user-friendly software implementation forboth single-factor and two-factor designs make our approach anexciting alternative to standard item aggregation approaches andcontemporary mixed logit/probit procedures for the analysis ofrepeated-measures accuracy studies.

Software implementation

The software implementing the methodology presented in thispaper, along with sample datasets and two examples illustratingits use is available at: https://v2south.github.io/BinBayes/.

Acknowledgements

This work was supported by discovery grants to Farouk Nathooand Michael Masson from the Natural Sciences and EngineeringResearch Council of Canada. Farouk Nathoo holds a Tier II CanadaResearch Chair in Biostatistics. Research was enabled in part bysupport provided by WestGrid (www.westgrid.ca) and ComputeCanada (www.computecanada.ca). The authors thank Dr. BelaidMoa for assistance with the implementation of the simulationstudies on the Westgrid high-performance computing clusters.We are also grateful to Peter Dixon, Florian Jaeger, Adam Krawitz,and Anthony Marley for very helpful comments on earlier versionsof this article.

References

Akaike, H. (1998). Information theory and an extension of the maximum likelihoodprinciple. In Selected papers of Hirotugu Akaike (pp. 199–213). Springer.

Arndt, J., & Reder, L. M. (2003). The effect of distinctive visual information on falserecognition. Journal of Memory and Language, 48(1), 1–15.

Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure forconfirmatory hypothesis testing: Keep it maximal. Journal of Memory andLanguage, 68(3), 255–278.

Bates, D., Mächler, M., Bolker, B., & Walker, S. (2014). Fitting linear mixed-effectsmodels using lme4. arXiv preprint arXiv:1406.5823.

Bates, D., Kliegl, R., Vasishth, S., & Baayen, H. (2015). Parsimonious mixed models.arXiv preprint arXiv:1506.04967.

Berger, J., Bayarri, M., & Pericchi, L. (2014). The effective sample size. EconometricReviews, 33(1–4), 197–217.

Berger, J. O., & Berry, D. A. (1988). The relevance of stopping rules in statisticalinference. Statistical Decision Theory and Related Topics IV, 1, 29–47.

Burnham, K. & Anderson, D. (1998). Model selection and inference: A practicalinformation-theoretic approach (pp. 60–64).

Carlin, B. P., & Louis, T. A. (2008). Bayesian methods for data analysis. CRC Press.Chateau, D., & Jared, D. (2003). Spelling–sound consistency effects in disyllabic

word naming. Journal of Memory and Language, 48(2), 255–280.Chen, M.-H. (2005). Computing marginal likelihoods from a single MCMC output.

Statistica Neerlandica, 59(1), 16–29.Chib, S., & Jeliazkov, I. (2001). Marginal likelihood from the Metropolis-Hastings

output. Journal of the American Statistical Association, 96(453), 270–281.Clark, H. H. (1973). The language-as-fixed-effect fallacy: A critique of language

statistics in psychological research. Journal of Verbal Learning and VerbalBehavior, 12(4), 335–359.

Cochran, W. G. (1940). The analysis of variance when experimental errors follow thepoisson or binomial laws. The Annals of Mathematical Statistics, 11(3), 335–347.

Dixon, P. (2008). Models of accuracy in repeated-measures designs. Journal ofMemory and Language, 59(4), 447–456.

Efron, B. (1986). Why isn’t everyone a Bayesian? The American Statistician, 40(1),1–5.

Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., & Rubin, D. (2013). Bayesiandata analysis. London: Chapman Hall.

Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2014). Bayesian data analysis (Vol.2) FL, USA: Chapman & Hall/CRC Boca Raton.

Gelman, A., Hwang, J., & Vehtari, A. (2014). Understanding predictive informationcriteria for Bayesian models. Statistics and Computing, 24(6), 997–1016.

Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y.-S. (2008). A weakly informative defaultprior distribution for logistic and other regression models. The Annals of AppliedStatistics, 1360–1383.

Haller, H., & Krauss, S. (2002). Misinterpretations of significance: A problemstudents share with their teachers.Methods of Psychological Research, 7(1), 1–20.

Ihaka, R., & Gentleman, R. (1996). R: a language for data analysis and graphics.Journal of Computational and Graphical Statistics, 5(3), 299–314.

Jacoby, L. L., Wahlheim, C. N., & Kelley, C. M. (2015). Memory consequences oflooking back to notice change: Retroactive and proactive facilitation. Journal ofExperimental Psychology: Learning, Memory, and Cognition, 41(5), 1282.

Jaeger, T. F. (2008). Categorical data analysis: Away from anovas (transformation ornot) and towards logit mixed models. Journal of Memory and Language, 59(4),434–446.

Jones, R. H. (2011). Bayesian information criterion for longitudinal and clustereddata. Statistics in Medicine, 30(25), 3050–3056.

Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American StatisticalAssociation, 90(430), 773–795.

Kruschke, J. K. (2013). Bayesian estimation supersedes the t-test. Journal ofExperimental Psychology: General, 142(2), 573.

Lin, X. (1997). Variance component testing in generalised linear models withrandom effects. Biometrika, 309–326.

Masson, M. E. (2011). A tutorial on a practical Bayesian alternative to null-hypothesis significance testing. Behavior Research Methods, 43(3), 679–690.

McCulloch, C. E., & Neuhaus, J. M. (2011). Misspecifying the shape of a randomeffects distribution: Why getting it wrong may not matter. Statistical Science,388–402.

Meng, X.-L., & Schilling, S. (2002). Warp bridge sampling. Journal of Computationaland Graphical Statistics, 11(3), 552–586.

Meng, X.-L., & Wong, W. H. (1996). Simulating ratios of normalizing constants via asimple identity: A theoretical exploration. Statistica Sinica, 831–860.

Nathoo, F. S., & Ghosh, P. (2013). Skew-elliptical spatial random effect modeling forareal data with application to mapping health utilization rates. Statistics inMedicine, 32(2), 290–306.

Nathoo, F. S., & Masson, M. E. (2016). Bayesian alternatives to null-hypothesissignificance testing for repeated-measures designs. Journal of MathematicalPsychology, 72, 144–157.

Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical modelsusing gibbs sampling. Proceedings of the 3rd international workshop ondistributed statistical computing (Vol. 124, pp. 125). Austria: TechnischeUniversit at Wien Wien.

Plummer, M. (2013). rjags: Bayesian graphical models using MCMC. R packageversion 3.

Page 15: Journal of Memory and Language · Yin Songa, Farouk S. Nathooa,⇑, Michael E.J. Massonb,⇑ a Department of Mathematics and Statistics, University of Victoria, Canada bDepartment

92 Y. Song et al. / Journal of Memory and Language 96 (2017) 78–92

Quené, H., & Van den Bergh, H. (2008). Examples of mixed-effects modeling withcrossed random effects and with binomial data. Journal of Memory andLanguage, 59(4), 413–425.

Raftery, A. E., Newton, M. A., Satagopan, J. M., & Krivitsky, P. N. (2007). Estimatingthe integrated likelihood via posterior simulation using the harmonic meanidentity. Bayesian Statistics, 8, 1–45.

Rouder, J. N., Morey, R. D., Speckman, P. L., & Province, J. M. (2012). Default Bayesfactors for anova designs. Journal of Mathematical Psychology, 56(5),356–374.

Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian ttests for accepting and rejecting the null hypothesis. Psychonomic Bulletin &Review, 16(2), 225–237.

Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & Van Der Linde, A. (2002). Bayesianmeasures of model complexity and fit. Journal of the Royal Statistical Society:Series B (Statistical Methodology), 64(4), 583–639.

Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of pvalues. Psychonomic Bulletin & Review, 14(5), 779–804.

Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widelyapplicable information criterion in singular learning theory. The Journal ofMachine Learning Research, 11, 3571–3594.

Yap, M. J., Balota, D. A., Tse, C.-S., & Besner, D. (2008). On the additive effects ofstimulus quality and word frequency in lexical decision: Evidence for opposinginteractive influences revealed by rt distributional analyses. Journal ofExperimental Psychology: Learning, Memory, and Cognition, 34(3), 495.


Recommended