+ All Categories
Home > Documents > Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant...

Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant...

Date post: 17-Aug-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
24
GENOMIC SELECTION Priors in Whole-Genome Regression: The Bayesian Alphabet Returns Daniel Gianola 1 Department of Animal Sciences, Department of Biostatistics and Medical Informatics, and Department of Dairy Science, University of Wisconsin, Madison, Wisconsin 53706 ABSTRACT Whole-genome enabled prediction of complex traits has received enormous attention in animal and plant breeding and is making inroads into human and even Drosophila genetics. The term Bayesian alphabetdenotes a growing number of letters of the alphabet used to denote various Bayesian linear regressions that differ in the priors adopted, while sharing the same sampling model. We explore the role of the prior distribution in whole-genome regression models for dissecting complex traits in what is now a standard situation with genomic data where the number of unknown parameters (p) typically exceeds sample size (n). Members of the alphabet aim to confront this overparameterization in various manners, but it is shown here that the prior is always inuential, unless n p. This happens because parameters are not likelihood identied, so Bayesian learning is imperfect. Since inferences are not devoid of the inuence of the prior, claims about genetic architecture from these methods should be taken with caution. However, all such procedures may deliver reasonable predictions of complex traits, provided that some parameters (tuning knobs) are assessed via a properly conducted cross-validation. It is concluded that members of the alphabet have a room in whole-genome prediction of phenotypes, but have somewhat doubtful inferential value, at least when sample size is such that n p. W HOLE-genome enabled prediction of complex traits has received much attention in animal and plant breeding (e.g., Meuwissen et al. 2001; Heffner et al. 2009; Lorenz et al. 2011; de los Campos et al. 2012a; Heslot et al. 2012) and is making inroads into human and even Drosoph- ila genetics (e.g. , de los Campos et al. 2010, 2012b; Makowsky et al. 2011; Ober et al. 2012; Vázquez et al. 2012). This ap- proach is known as genomic selectionin breeding of agri- cultural species. The term Bayesian alphabetwas coined by Gianola et al. (2009) to refer to a (growing) number of letters of the alphabet used to denote various Bayesian linear regres- sions used in genomic selection that differ in the priors adop- ted while sharing the same sampling model: a Gaussian distribution with mean vector represented by a regression on p markers, typically SNPs, and a residual variance, s 2 e . A recent review of some of these methods is in de los Cam- pos et al. (2012a). In addition to prediction, this whole- genome approach lends itself to investigation of genetic architecture,often dened as the number of genes affecting a quantitative trait, the allelic effects on phenotypes, and the frequency distribution spectrum of alleles at these genes (e.g., Hill 2012). If epistasis and pleiotropy are brought into the picture, this denition of genetic architecture needs to be expanded signicantly. Most researchers in genomic selection are familiar with most letters of the alphabet, but we provide a brief review of its ontogeny. The alphabet started with Bayes A and B (Meuwissen et al. 2001), but there has been rapid expansion since, as illustrated by the Bayes Cp and Dp methods (Habier et al. 2011). Apart from between-letter variation, there is also variation within letters, such as fast EM-Bayes A (Sun et al. 2012), fast Bayes B (Meuwissen et al. 2009), and BRR (Bayesian ridge regression on markers), which is equivalent to G-BLUP (Van Raden 2008) but with variance parameters estimated Bayesianly; the equivalence between G-BLUP and ridge regression is given, for example, in de los Campos et al. (2009a,b). The letter D has several variants: Bayes D0, D1, D2, and D3 (Wellman and Bennewitz 2012). Here, L is used to denote the Bayesian Lasso (Park and Casella 2008; de los Campos et al. 2009a,b) while L1 and L2 can be used to refer to variants due to Legarra et al. (2011). Copyright © 2013 by the Genetics Society of America doi: 10.1534/genetics.113.151753 Manuscript received February 16, 2013; accepted for publication April 13, 2013 This article is dedicated to the late Professor George Casella (19512012), a great (and friendly) Bayesian statistician who made incursions into quantitative genetics at several points of his career. 1 Address for correspondence: 1675 Observatory Dr., University of Wisconsin, Madison, WI 53706. E-mail: [email protected] Genetics, Vol. 194, 573596 July 2013 573
Transcript
Page 1: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

GENOMIC SELECTION

Priors in Whole-Genome Regression: The BayesianAlphabet Returns

Daniel Gianola1

Department of Animal Sciences, Department of Biostatistics and Medical Informatics, and Department of Dairy Science, Universityof Wisconsin, Madison, Wisconsin 53706

ABSTRACT Whole-genome enabled prediction of complex traits has received enormous attention in animal and plant breeding and ismaking inroads into human and even Drosophila genetics. The term “Bayesian alphabet” denotes a growing number of letters of thealphabet used to denote various Bayesian linear regressions that differ in the priors adopted, while sharing the same sampling model.We explore the role of the prior distribution in whole-genome regression models for dissecting complex traits in what is now a standardsituation with genomic data where the number of unknown parameters (p) typically exceeds sample size (n). Members of the alphabetaim to confront this overparameterization in various manners, but it is shown here that the prior is always influential, unless n� p. Thishappens because parameters are not likelihood identified, so Bayesian learning is imperfect. Since inferences are not devoid of theinfluence of the prior, claims about genetic architecture from these methods should be taken with caution. However, all suchprocedures may deliver reasonable predictions of complex traits, provided that some parameters (“tuning knobs”) are assessed viaa properly conducted cross-validation. It is concluded that members of the alphabet have a room in whole-genome prediction ofphenotypes, but have somewhat doubtful inferential value, at least when sample size is such that n � p.

WHOLE-genome enabled prediction of complex traitshas received much attention in animal and plant

breeding (e.g., Meuwissen et al. 2001; Heffner et al. 2009;Lorenz et al. 2011; de los Campos et al. 2012a; Heslot et al.2012) and is making inroads into human and even Drosoph-ila genetics (e.g., de los Campos et al. 2010, 2012b; Makowskyet al. 2011; Ober et al. 2012; Vázquez et al. 2012). This ap-proach is known as “genomic selection” in breeding of agri-cultural species. The term “Bayesian alphabet” was coined byGianola et al. (2009) to refer to a (growing) number of lettersof the alphabet used to denote various Bayesian linear regres-sions used in genomic selection that differ in the priors adop-ted while sharing the same sampling model: a Gaussiandistribution with mean vector represented by a regressionon p markers, typically SNPs, and a residual variance, s2

e.A recent review of some of these methods is in de los Cam-pos et al. (2012a). In addition to prediction, this whole-

genome approach lends itself to investigation of “geneticarchitecture,” often defined as the number of genes affectinga quantitative trait, the allelic effects on phenotypes, andthe frequency distribution spectrum of alleles at these genes(e.g., Hill 2012). If epistasis and pleiotropy are brought intothe picture, this definition of genetic architecture needs tobe expanded significantly.

Most researchers in genomic selection are familiar withmost letters of the alphabet, but we provide a brief reviewof its ontogeny. The alphabet started with Bayes A and B(Meuwissen et al. 2001), but there has been rapid expansionsince, as illustrated by the Bayes Cp and Dp methods (Habieret al. 2011). Apart from between-letter variation, there is alsovariation within letters, such as fast EM-Bayes A (Sun et al.2012), fast Bayes B (Meuwissen et al. 2009), and BRR(Bayesian ridge regression on markers), which is equivalentto G-BLUP (Van Raden 2008) but with variance parametersestimated Bayesianly; the equivalence between G-BLUP andridge regression is given, for example, in de los Campos et al.(2009a,b). The letter D has several variants: Bayes D0, D1,D2, and D3 (Wellman and Bennewitz 2012).

Here, L is used to denote the Bayesian Lasso (Park andCasella 2008; de los Campos et al. 2009a,b) while L1 and L2can be used to refer to variants due to Legarra et al. (2011).

Copyright © 2013 by the Genetics Society of Americadoi: 10.1534/genetics.113.151753Manuscript received February 16, 2013; accepted for publication April 13, 2013This article is dedicated to the late Professor George Casella (1951–2012), a great(and friendly) Bayesian statistician who made incursions into quantitative genetics atseveral points of his career.1Address for correspondence: 1675 Observatory Dr., University of Wisconsin, Madison,WI 53706. E-mail: [email protected]

Genetics, Vol. 194, 573–596 July 2013 573

Page 2: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

There is also the EL Bayesian Lasso of Mutshinda and Sillanpää(2010), with EL standing for “extended Lasso.” An almostempty hiatus spans from letters D to R (Erbe et al. 2012) withBayes RS emerging even more recently (Brondum et al. 2012).Wang et al. (2013) presented Bayes TA, TB, and TCp, exten-sions of the corresponding letters to threshold models. Theupper bound of the alphabet seems to have been defined byLarry Schaeffer (personal communication, Interbull Meeting,Guelph, 2011) when he threatened attendants of this confer-ence with Bayes Z-D, although full details have not been pub-lished yet. The preceding review may not be comprehensive,as there may be other members of the alphabet that are un-known to the author. It is tempting to conjecture that theremay be issues with individual members of the alphabet, as thiscontinued growth is suggestive of a state of lack of satisfactionwith any given letter.

This article explores the role of the prior distribution inwhole-genome regression models for predicting or dissectingcomplex traits. In particular, we address a standard situationencountered in genomic selection: with genomic data, thenumber of unknown parameters exceeds sample size. SectionGeneral Setting presents the regression model and remindsreaders that, for the preceding situation, regression coeffi-cients on marker genotypes are not identified in the likeli-hood function so that the data do not contain informationfor inference that is uncontaminated from the effects of theprior, except in a subspace. Bayesian methods for confront-ing the blatant overparameterization of genomic selectionmodels are reviewed in this section, where it is shown thatthe prior is always influential in this setting. The sectionBayesian Shrinkage discusses how ridge regression producesfrequency-dependent shrinkage, while Bayes A, Bayes B,Bayes L, and Bayes R effect a type of shrinkage that is bothallelic frequency and effect-size dependent. After establish-ing in the preceding sections, hopefully in a firm manner,that all members of the alphabet do not lead to inferencesthat are devoid of the influence of the prior, it is argued inthe Discussion that all such methods may deliver reasonablepredictions of complex traits, provided that some parameters(“tuning knobs”) are assessed properly. It is concluded that,while members of the alphabet cannot be construed as pro-viding solid inferences about “genetic architecture,” they dohave a room in whole-genome prediction of phenotypes.

General Setting

Let y be an n · 1 vector of target responses (e.g., phenotypes,preprocessed data). Using molecular markers, all membersof the alphabet pose the same linear regression of pheno-types on marker codes, that is

y ¼ Xbþ e; (1)

where X is an n · p matrix of marker codes (e.g.,21, 0, 1 foraa, Aa, and AA genotypes, respectively); when additiveaction is assumed, b = {bj} is a vector of allelic substitution

effects for each of p markers, and e is a vector of residualstypically assigned the normal distribution ejs2

e � Nðej0; Is2eÞ

where s2e is the residual variance, defined earlier.

In the standard additive model of quantitative genetics(e.g., Falconer and Mackay 1996), the bj are fixed parame-ters, while the elements xij of X are random variables; e.g.,members of the jth column of X may be realizations froma Hardy–Weinberg distribution with corealizations in col-umns j and j9 reflecting some linkage disequilibrium distri-bution. The maximum-likelihood estimator of b treats X asa fixed matrix and satisfies the system of equations

X9Xbð0Þ ¼ X9y;

where b(0) may not be a unique solution (Searle 1971). Ifn , p X9X is singular so the maximum-likelihood estimatoris not unique, as there is an infinite number of solutions tothe equations above. Letting (X9X)2 be a generalized inverseof X9X, one solution is b(0) = (X9X)2X9y with expectationE(b(0)|b) = (X9X)2X9Xb, producing a biased estimator ofb, with at least p2 n of its elements being equal to 0. On theother hand, E(y|b) = Xb = g (the genetic signal capturedby markers) is estimated uniquely because its estimator,Xb(0) is unique, although this reproduces y exactly in then , p situation. Fisher’s information content about b in thesample is X9Xs22

e and, because this matrix is singular, onecannot speak about information pertaining to individualmarker effects in a strict sense. However, the informationcontent about g = {gi} is Is22

e , meaning that informationabout each genotypic value gi is proportional to that con-veyed by a sample of size 1. Hence, in an n , p model,maximum-likelihood cannot be used either as an inferentialor as a predictive machine. In the latter case, it does notgeneralize to new samples, because it copies both noise andsignal contained in model training data.

In their proposals for employing whole-genome markersin a linear regression model, Meuwissen et al. (2001) wereinspired by the fact that animal breeders had dealt with then � p problem successfully in the context of predictingrandom effects via best linear unbiased prediction (BLUP);see Henderson (1984) for a review, with a gentler treat-ment in Mrode (2005). BLUP assumes that marker effectsare drawn from some distribution with known variance compo-nents; only knowledge of the covariance structure is neededand the form of the distribution is immaterial, although a linearmodel must hold. An alternative is provided by the Bayesiantreatment but, here, the meaning of probability and the mannerin which unknowns are inferred are different from their fre-quentist counterparts (Gianola and Fernando 1986; Robinson1991). The distinctions between these two views are empha-sized next, but the two approaches confront n � p by bring-ing external information to the problem, as noted early inthe game by Robertson (1955).

The BLUP approach to whole-genome prediction assumesthat b has a null mean vector and some known covariance

574 D. Gianola

Page 3: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

matrix Vb; then E(y) = 0 and the best linear unbiased pre-dictor of b is

BLUPðbÞ ¼ VbX9�XVbX9þ Is2

e�21

y:

Here, b is regarded as a random draw from the distributionindicated above so, on average Ey[BLUP(b)] = 0, meaningthat BLUP(b) is unbiased with respect to the mean of therandom effects distribution, E(b|Vb). BLUP envisages a sam-pling scheme where one draws a different realization ofmarker effects in every repetition of the sampling, such that,over all repetitions, 0 is obtained, on average. BLUP esti-mates zero without bias! However, when one is interestedabout individual marker effects (or about the genetic valuesof a given individual), the inference to be made pertains tothe specific item of interest, and not to the average of theirdistribution. If so, BLUP is biased with respect to specificmarker effects (the classical fixed model of quantitative ge-netics) because

E½BLUPðbÞjb� ¼ VbX9�XVbX9þ Is2

e�21

Xb;

so that the bias is ½Ip 2VbX9ðXVbX9þ Is2eÞ21X�b; where Ip is

an identity matrix of order p. The random effects treatmentresults in that BLUP(b) is unique whether n � p or not, butit produces a biased estimator of marker effects; this biasnever disappears when n � p. On the other hand, if n / N

and p stays fixed, bias goes away, given that the model istrue. A toy example of the bias of BLUP with respect to thetrue, fixed, substitution effects is shown in the Appendix (Biasof BLUP with respect to marker effects).

For the n � p situation, Fan and Li (2001) discuss esti-mators that induce sparsity. However, to meet their so-called“oracle properties” (e.g., asymptotic unbiasedness), Fisher’sinformation matrix must be nonsingular for p0 nonzeroparameters (p0 , n), these being the “true” effects of somemarkers on quantitative traits. BLUP and most membersof the Bayesian alphabet do not produce a sparse modelautomatically; rather, they produce shrinkage of regressioncoefficients. Consider a sequence of P models of increasingdimensionality fitted to the same data, with p0 ,p1 , p2 . . ., pP. The size of the “true” signal is dictated bythe “true” effects p0 and the size of the models could beviewed as corresponding to the number of SNPs in platformsof increasing density applied to the same data set. As markerdensity increases while n and p0 remain fixed, estimates ofmarker effects must become necessarily smaller. How can“true” effects be learned properly if the model forces esti-mates to become smaller as p grows? Given the rhythm oftechnology, it is unlikely that we will reach the situationwhere n � pP. At this point there is not much hope forlearning marker effects in a manner that is free from makingadditional untestable assumptions. A minor complication: forthe oracle properties to hold the true model must be “hit.”This is probably an unrealistic proposal when dealing withcomplex traits where many difficulties arise; for example,

linkage disequilibrium creates ambiguity because manymarkers can act as proxies for others and complex forms ofepistasis are bound to produce havoc in a naive linear modelon additive effects.

One way of tackling the n � p problem is by introducingrestrictions on the size of the regression coefficients, i.e.,shrinkage or “regularization.” In the machine-learning liter-ature this is attained via ad hoc penalty functions that pro-duce regularization (e.g., Bishop 2006; Hastie et al. 2009).Bayesian methods with proper priors produce regularizationautomatically, to an extent that depends on the prior adop-ted. The various members of the Bayesian alphabet effectshrinkage in different manners, an issue explored subse-quently in this article. Let all unknown parameters ofa model be represented by u = (u1, u2), where u1 and u2denote distinct parameters, e.g., marker effects and theirapparent (the reason for these terms is made clear later)variances, respectively, in Bayes A. The posterior distributionof u (assume, for simplicity, that the residual variance isknown) is

p�u1; u2jy;s2

e ;H�} p�yju1; u2;s2

e ;H�pðu1; u2jHÞ (2)

} p�yju1;s2

e�pðu1; u2jHÞ: (3)

Above, H is a set of more or less arbitrarily specified hyper-parameters. Expression (3) results from the assumptionthat, given location effects u1 (e.g., allelic substitutioneffects), the data are conditionally independent of u2.Further,

pðu1; u2jHÞ ¼ pðu1ju2;HÞpðu2jHÞ ¼ pðu1ju2Þpðu2jHÞ;

with the expression on the right side resulting because,given u2, location effects u1 do not depend on H (e.g., Sor-ensen and Gianola 2002). Note that p(u1|u2) is a conditionalprior distribution, while the marginal prior distribution ac-tually assigned to u1 is

pðu1jHÞ ¼Z

pðu1ju2;HÞpðu2jHÞdu2: (4)

Likewise, p(u1|y, H) and p(g(u1)|y, H) denote the marginalposterior distributions of u1 and g(u1), the latter being anyfunction of u1. For example, if u1 is the vector b, one may beinterested in the posterior distribution of Xb, the markedsignal. The results of a Bayesian analysis should not be inter-preted from a frequentist perspective, as the meaning ofprobability is different in the two camps (Bernardo andSmith 1994; O’Hagan 1994; Sorensen and Gianola 2002).For example, BLUP is an unbiased predictor in conceptualrepeated sampling, but corresponds to the posterior mean ofmarker effects in a Bayesian Gaussian model with knowncovariance structure. In the latter, the data are fixed; inthe BLUP setting, the data vary at random.

An important issue is the influence of priors on inference.Theory on Bayesian asymptotics dictates that, as sample size

Whole-Genome Regressions on Markers 575

Page 4: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

grows, the influence of the prior vanishes gradually. In thelimit, the posterior distribution becomes normal, centered atthe maximum-likelihood estimator and with covariance ma-trix given by the inverse of Fisher’s information measure, sothe prior matters little in large samples (Bernardo and Smith1994). However, this result holds for parameters that are like-lihood identifiable, i.e., when their maximum-likelihood esti-mator exists, but it must be kept in mind that markers are notQTL, so the marker-based model is arguably wrong. In ann � p setting, true Bayesian learning can take place for atmost n parameters or functions thereof, since p 2 n param-eters are unidentified. Gelfand and Sahu (1999) show thatone can learn about at most n linearly independent functionsof marker effects, such as x9ib. Carlin and Louis (1996) andSorensen and Gianola (2002) give an example where themarginal posterior distributions of unidentified parametersexist if these are assigned proper priors; however, the priorswill always matter and their influence will never vanishasymptotically. In the n� p setting, inferences about markereffects (often referred to as learning genetic architecture,e.g., inferring effects of some QTL) are always influencedby the priors adopted, apart from the fact that the modelis wrong, as argued above. This means that stories that canbe made from posterior distributions will depend on storiesthat are made a priori. For example, Lehermeier et al.(2013) demonstrated the influence of priors on predictiveability from various Bayesian models (Bayes A, B, L) withsimulated and empirical data. Also, Gianola et al. (2009)showed that the priors in Bayes A and B drive inferenceson variances of marker-specific effects.

A formal verification that individual marker effects arenot identified from a Bayesian perspective using a defini-tion by Dawid (1979) is presented in the Appendix (Markereffects are not identified from a Bayesian perspective in the n, p setting); this holds for any model, linear or nonlinear.A proof that is specific to the linear regression model on pmarkers with sample size n is given in the Appendix as well(Inferences in a linear model with unidentified parameters);there, it is shown that what is learned about b is a functionof what is learned about Xb. In other words, Bayesianlearning occurs for n items but then this knowledge is“distributed” into p pieces via the relationship between b

and Xb induced by the prior.In summary, proper Bayesian learning from data in

a linear regression model with n , p takes place only forlinear combinations of marker effects that are identified inthe likelihood, that is, estimable. Any other marker effects orlinear combinations thereof are redundant in the samplingmodel, but their posterior distributions exist and the posteriormean will differ from the prior mean. It follows that mecha-nistic conjectures about genetic architecture in the n , psituation are, to a large extent, driven by prior assumptionsand not by data. This observation has been corroborated em-pirically (e.g., Heslot et al. 2012; Ober et al. 2012; Lehermeieret al. 2013): Bayesian models differing in their prior producedifferent inferences about individual marker effects, but most

often deliver similar predictive abilities if tuned properly.Not surprisingly, the posterior distributions of x9ib (the signalto be predicted for datum i) from varying models are moresimilar to each other than the corresponding priors, as thisfunction is likelihood identifiable. In short, extant theory saysthat given that a model is “true” (oracle principle 1, Fan andLi 2001), the posterior mean of an identifiable parameter orof a likelihood-identified combination of parameters will con-verge to its true value, including any “true zero,” as samplesize goes to infinity (oracle principle 2). This works for n . por for some estimators where sparsity is built in automatically,but the model must be true; Fan and Li (2001) describeseveral such estimators.

A situation in which proper Bayesian learning can takeplace is presented in the Appendix (An example of properBayesian learning).

Bayesian Shrinkage

Given that learning about genetic architecture withoutcontamination from effects of the prior does not take placewhenever n � p, a question is what the various differentmembers of the alphabet do. We examine ridge regression(BLUP), Bayes A, Bayes B, Bayes L, and Bayes R and alsogive a warning about a commonly used description of a spe-cific prior; these procedures are prototypical so there is noneed to consider other letters of the alphabet. All thesemethods have been reported in the genomic selection liter-ature. Since the marginal posterior distribution of markereffects (with the exception of that of BLUP under normality)cannot be arrived at analytically, the methods are appraisedfrom a heuristic perspective.

BLUP (ridge regression)

The vector of marker effects b is assigned the normal priorNðbj0; Is2

bÞ. The structure of the problem is well known andthe mixed model equations leading to BLUP satisfy

ðX9Xþ IlÞ~b ¼ X9y ¼ X9Xbð0Þ; (5)

where ~b ¼ BLUPðbÞ; l ¼ s2e=s

2b is the variance ratio, and

b(0) = (X9X)2 X9y is as before. One can write

~b ¼ ðX9Xþ IlÞ21�X9Xbð0Þ þ l · 0

�; (6)

so ~b (unique) is a matrix weighted average of solution b(0)

(not unique) and of the prior mean 0, where the weightsare X9X and l, respectively. For fixed p, as n increases, therank of X9X will increase, eventually exceeding p and, inthe limit, the posterior distribution will be centered at theunique maximum-likelihood estimator (by consistency,this will converge to the “true” value of b, given themodel).

Representation (5) suggests that the same amount ofshrinkage is effected to all p markers (because the same var-iance ratio l is added to every diagonal element of X9X), but

576 D. Gianola

Page 5: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

this is not the case. This is clear from Equation 6 where, foreach marker effect, the contributions from the data vary overmarkers; this is more transparent from inspection of the sol-utions in scalar form. For marker 1, as an example, the esti-mator of substitution effect is

~b1 ¼Pn

i¼1xi1�yi 2 xi2~b2 2 ...2 xip~bp

�Pn

i¼1x2i1 þ l

¼Pn

i¼1x2i1   b

4

1 þ l · 0Pni¼1x

2i1 þ l

;

where

b4

1 ¼Pn

i¼1xi1�yi 2 xi2~b2 2 ...2 xip~bp

�Pn

i¼1x2i1

:

Then, the BLUP ~b1 of the allele substitution effect can beviewed, heuristically, as a weighted average of a “datadriven” estimate ðb41Þ and of the mean of the prior dis-tribution (0) where the respective weights arePn

i¼1x2i1=ðPn

i¼1x2i1 þ lÞ and l=ðPn

i¼1x2i1 þ lÞ; respectively.

This suggests less shrinkage toward zero for markers (j,say) having larger values of

Pni¼1x

2ij . Now, if for any marker

genotypes are coded as 21, 0, 1 for aa, Aa, and AA, respec-tively, it follows (assuming Hardy–Weinberg proportions andcentered marker codes) that Eðx2ijÞ ¼ VarðxijÞ ¼ 2pjð12 pjÞ soEðPn

i¼1x2ijÞ ¼ 2pjð12 pjÞn; where pj is the frequency of the

A-type allele at that locus. Hence, at fixed sample size n, BLUPeffects less shrinkage toward zero of markers that have in-termediate allelic frequencies, simply because 2pj(1 2 pj) ismaximum at pj ¼ 1

2. To illustrate, we use this Hardy–Weibergapproximation and plot (Figure 1) the weight “assigned tothe data” for marker j

Wj ¼Pn

i¼1x2ijPn

i¼1x2ij þ l

�2pj�12 pj

�2pj�12 pj

�þ l=n

;

against allelic frequency at l=n ¼ 1; 0:1; 0:01 and 0.001,respectively. As depicted in Figure 1, the extent of shrink-age is frequency and sample size dependent, with somedifferential shrinkage (bottom two curves) taking place atlarge values of l=n, that is, at small sample sizes, but withlittle or no differential shrinkage otherwise, unless allelesare rare. Then, the often-made statement that BLUPor ridge regression perform an homogeneous shrinkageof marker effects is not correct. In short, shrinkage is fre-quency and sample size dependent but effect-sizeindependent.

Bayes A

Bayes A (Meuwissen et al. 2001) consists of a three-stage hi-erarchical model. The first stage is the normal regression (1);the second stage assigns a normal conditional prior to each ofthe marker effects, all possessing a null mean but with a vari-ance that is specific to each marker; the third stage assigns thesame scaled inverted chi-square distribution with known scale

ðS2bÞ and degrees of freedom (n) parameters to each of themarker variances. The mechanistic argument for the Bayes Aprior was that markers may contribute differentially to geneticvariance (they do, to an extent depending on their effects,allelic frequencies, and linkage disequilibrium relationshipswith causal variants), so it seemed a good idea to “estimate”such variances. There are two difficulties: the first one is thatthe marginal prior for the markers effects, resulting from decon-ditioning the second stage over the third stage as done in (4), isthe same for all markers. Second, there is scant Bayesian learn-ing for marker-specific variances. This was pointed out byGianola et al. (2009), who showed that all markers havethe same prior distribution: a tðbjj0; n; S2bÞ process with nullmean and variance nS2b=ðn2 2Þ. Given that this prior is ho-moscedastic over markers, why is it that it behaves differentlyfrom ridge regression BLUP, where all individual markereffects are assigned the prior Nðbjj0;s2

bÞ?In Bayes A, the marginal posterior distribution of marker

effects cannot be arrived at in closed form, but insight can beobtained from inspection of the joint mode of the posteriordistribution of b, assuming that the residual variance isknown; recall that S2b and n are known hyperparameters inBayes A. The hierarchical model is then

yijb;s2e � N

�yijx9ib;s2

e

�; i ¼ 1; 2; . . . ; n; bj

��S2b; n� IID t

�bj��0; S2b; n�; j ¼ 1; 2; . . . ; p;

where x9i is the ith row of X. Conditionally on s2e, S

2b and n,

the joint posterior density is

p�bjS2b; n;s2

e ; y�}Yni¼1

exp�2

12s2

e

�yi2x9ib

�2�Ypj¼1

"1þ b2

j

S2bn

#2ð1þnÞ=2: (7)

Using results presented in the Appendix (Mode of the condi-tional posterior distribution in Bayes A), an iterative schemefor locating a mode of (7) is given by

Figure 1 Approximate weight (W) assigned to the data (the weightassigned to prior information is 1 2 W) as a function of allelic frequencyat a marker locus. From top to bottom the lines give the trajectory for l/nvalues of 0.001 (solid line), 0.01 (short dashes), 0.1 (long dashes), and 1(dots).

Whole-Genome Regressions on Markers 577

Page 6: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

b½tþ1� ¼�X9XþW½t�

b

�21X9y ¼

�X9XþW½t�

b

�21X9Xbð0Þ

(8)

with successive updating; here,

W½t�b ¼ Diag

8<:s2e

S2b

ð1þ 1=nÞ�1þ b

2½t�j =S2bn

�9=;

is a diagonal matrix. If this converges, it will do so to one ofperhaps many stationary points, as it is known that t-regressionmodels may produce multimodal log-posterior surfaces, espe-cially if n is small (McLachlan and Krishnan 1997). Hence, iter-ation (8) may lead to a point receiving little posterior plausibility.

The role of Wb ¼ fwjj;bjg in (8) parallels that of the inverse

of the genetic variance–covariance matrix (times s2e) in stan-

dard BLUP (Henderson 1984), so that the larger wjj;bjis, the

stronger the shrinkage toward 0 (mean of the prior distribu-tion). However, while the variance ratio l ¼ s2

e=s2b is constant

in ridge regression BLUP, here it varies over markers, as it takesthe form wjj;bj

. As n / N (the t-distribution approaches a nor-mal one), lj/s2

e=S2b, resembling l of BLUP. On the other

hand, if the t prior has a finite number of degrees of freedom,markers whose effects are closer to 0 are shrunk more stronglythan those with larger absolute values, simply because lj islarger for the former. To illustrate, let s2

e ¼ S2b ¼ 1 so thatthe “variance ratio” is lj ¼ ð1þ 1=nÞ=ð1þ b2=nÞ. Figure 2illustrates the impact of the marker effect on the “varianceratio” for n = 4, 6, 10, and 1000. It is seen that lj becomessmaller (less shrinkage toward zero) as the absolute value ofthe effect of the marker increases; also, shrinkage increases asthe degrees of freedom of the distribution increase, at any givenmarker effect. Eventually, when n / N (so that the prior isnormal) the variance ratio takes the same value for all markers(thick line in Figure 2, almost horizontal, corresponding to n =1000). For markers with effects near zero, the t-distributionshrinks effects more strongly than the normal process, but itdoes not severely penalize markers having strong effects on thephenotype. Hence, in Bayes A shrinkage is marker-effect specific,with this specificity becoming milder as the value of n increases.Note that (7) also induces frequency-specific shrinkage, due tothe Bayesian compromise between the prior and X9X, as in thecase of BLUP. Hence, apart from the effects of sample size, thereare two sources of shrinkage in Bayes A, contrary to a single onein BLUP. This seems to confer Bayes A more flexibility thanBLUP, but this is not necessarily good because the extra param-eters n and S2b (this one playing the role of s2

b) are influential,and may affect “inference” of marker effects adversely (Leher-meier et al. 2013). Naturally, these parameters can be assignedpriors and inferred from the resulting Bayesian model, but thiswas not suggested by Meuwissen et al. (2001).

Bayes B

A formulation of Bayes B as a mixture at the level of effects,but not of their variances, as in Meuwissen et al. (2001), is

in Gianola et al. (2009) and Habier et al. (2011). The hier-archical model is

yijb;s2e � N

�x9ib;s

2e�;

bj��S2b; n;p � IID

0t�0; S2b; n

� with probability pwith probability 12p

; j ¼ 1; 2; . . . ; p:

The prior is a mixture of a “0-state” (a point mass at 0) witha t-distribution, the mixing probabilities being p and 1 2 p,respectively, where p is assumed known and specified arbi-trarily. Recall (in informal notation) that

t�bj��0; S2b; n� ¼

ZN�bj��0;s2

bj

�x22

�s2bj

���S2b; n�ds2bj; j ¼ 1; 2; . . . ; p;

where x22ðs2bjjS2b; nÞ is a scaled-inverted chi-square distribu-

tion assigned as prior to the variance of the jth marker effect,s2bj. Meuwissen et al. (2001) formulated the mixture at the

level of these variances, arguing as follows: “the distributionof genetic variances across loci is that there are many loci withno genetic variance (not segregating) and a few with geneticvariance.” Gianola et al. (2009) were critical of this formu-lation, both from statistical and genetic points of view.

The hierarchical prior is deceiving because, in fact, BayesB ends up assigning the same marginal prior to everymarker. This follows from consideration of the mean andvariance of a mixture, e.g., Gianola et al. (2006). The meanof a mixture is the weighted average of the means of thecomponents (the weights being the mixing probabilities p

and 1 2 p) and the variance is the weighted average of thecomponent variances, plus a term that can be interpreted as“variance” among component means. One has

E�bj��p� ¼ ð12pÞE

ht�bj��0; S2b; n�i ¼ 0; j ¼ 1; 2; . . . ; p;

where E½tðbjj0; S2b; nÞ� ¼ 0 is the mean of the t-distribution,and

Figure 2 Impact of marker effect and of the degrees of freedom (d.f.)parameter on the extent of shrinkage toward the prior distribution inBayes A: the larger the value in the y-axis, the stronger the shrinkagetoward 0. d.f. = 4, solid line; d.f. = 6, long-dashed line; d.f. = 10, short-dashed line; d.f. = 1000, gray circles, almost horizontal.

578 D. Gianola

Page 7: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

Var�bj��p�¼ ð12pÞ S2bn

n2 2; j ¼ 1; 2; . . . ; p:

Above, Var½tðbjj0; S2b; nÞ� ¼ S2bn=ðn2 2Þ is the variance of thet-distribution. It follows that Bayes B assigns, a priori, thesame mean and variance to all marker effects and that it usesa prior that is even more precise than the prior in Bayes A(the prior variance is reduced by a fraction p in Bayes B,relative to that of Bayes A). This makes effective Bayesianlearning even more difficult in Bayes B than in Bayes A, asit takes more information from the data to “neutralize” theprior of Bayes B than that of Bayes A. At any rate, none ofthese two regression models allows for proper learning aboutmarker effects or genetic architecture in the n � p setting, asargued earlier in this article.

As for Bayes A, no closed forms for the marginal posteriordistributions of marker effects exist for Bayes B. Theposterior expectation of b is

EBayes B�bjp; S2b; n;s2

e ; y�¼ ð12pÞEBayes  A

�bjp; S2b; n;s2

e ; y�:

(9)

This indicates that shrinkage toward 0 is stronger than inBayes A since posterior means are smaller in Bayes B bya fraction p. Coupled with the arbitrary assignment ofa value to p, the implication is that the the prior is evenmore influential in Bayes B than in Bayes A. This could havebeen expected intuitively, but the point has not been madebefore, at least in this manner.

Wimmer et al. (2012) noted that methods such asBayes B have yielded better predictive abilities than BLUPin many simulation studies reported in the literature, butthat this has not been observed with real data (e.g., Oberet al. 2012). Wimmer et al. (2012) investigated predictiveabilities of these two methods in maize and in Arabidopsis.The target populations differed in effective populationsize and in extent of linkage disequilibrium. Despiteexpected differences in genetic architecture among popu-lations and traits, predictive abilities delivered by BLUPand Bayes B did not differ significantly for their targettraits. Further, they found via simulation (personal com-munication) that Bayes B was effective for learning ge-netic architecture in the n � p setting only when thenumber of true nonzero marker effects (s) is such thats � n, given the true model. Otherwise, the error of esti-mation of marker effects was as poor as that of BLUP, thelatter found to be more robust over a wide range of sit-uations (this is ironic, because BLUP or G-BLUP were nottailored for learning genetic architecture). In short, theyconfirmed that, provided one “hits” the true model (thusfulfilling oracle property 1 of Fan and Li 2001), effectivelearning of “true genetic architecture” is possible onlyif the model is very sparse relative to sample size. Thecondition s � n would lead to oracle property 2, as antic-ipated by standard asymptotic Bayesian theory under reg-ularity conditions.

BayesSSVS was proposed by Verbyla et al. (2009), but itis not discussed here because it is similar to Bayes B. BayesCp of Habier et al. (2011) provides a more sensible formu-lation of the mixture, but it is similar in spirit and shares thesame limitations of Bayes B, since parameter identification isnot attained for most of the unknowns. An interesting ex-ample of consequences of overparameterization in Bayes Cpis provided by Duchemin et al. (2012); these authors notedthat as values of p went up in the sampling process, realiza-tions of marker effect variances went down. Hints aboutgenetic architecture from Bayes B or Bayes Cp or from othermembers of the alphabet should be taken very cautiously, atleast when n � p.

Bayes L

Lasso regression (Tibshirani 1996) inspired the BayesianLasso (Bayes L here) of Park and Casella (2008), a methodwith followers such as Vázquez et al. (2010) and Crossaet al. (2010) and with an implementation available in thesoftware R described by Pérez et al. (2010). The linear re-gression model is given in (1), but the prior assigned tomarker effects is a Laplace (double exponential, DE) distri-bution. All marker effects are assumed to be independentlyand identically distributed as DE with the prior densitybeing

pðbjlÞ ¼ l

2expð2 ljbjÞ: (10)

Here, E(b|l) = 0 and VarðbjlÞ ¼ 2=l2 for all markers; as lincreases the variance of the DE distribution decreases andthe density becomes sharper. This prior assigns the samevariance or prior uncertainty to all marker effects, but itpossesses thicker tails than the normal prior. A comparativediscussion of the DE prior is in de los Campos et al. (2012a).Even though Bayes L bears a parallel with the Lasso, it doesnot “kill” or remove markers from the model, contrary towhat happens in variable selection approaches. Bayes Lposes a leptokurtic prior, so it is expected to shrink effectsmore strongly toward zero than the Gaussian prior, as op-posed to inducing sparsity in the strict sense of the Lasso.

Bayes L shrinks strongly: To appraise how Bayes L shrinksmarker effects, we examine the mode(s) of the jointposterior distribution of b using the DE prior (10), assumingthat l and the residual variance are known. As in Tibshirani(1996), write jbjj ¼ b2

j =jbjj; with this representationPpj¼1jbjj ¼ b9W21

b b, where W21b ¼ Diagf1=jbjjg. Using this,

the log-posterior (apart from an additive constant) is

L�bjy; l;s2

e� ¼ 2

ðy2XbÞ9ðy2XbÞ þ s2elb9W

21b b

2s2e

; (11)

If, as in Tibshirani (1996), it is ignored thatW21b is a random

matrix (because it is a function of |bj|) this takes the form ofa standard BLUP representation, so the mode of the condi-tional posterior distribution of b satisfies

Whole-Genome Regressions on Markers 579

Page 8: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

~b ¼�X9Xþ s2

elW21b

�21X9y: (12)

Contrary to BLUP–ridge regression where shrinkage fac-tors are marker effect independent, these factors take theform s2

el=jbjj in Bayes L, implying that markers with tinyeffects are shrunk more strongly toward zero, as a largernumber is added to the diagonal elements of the coeffi-cient matrix leading to solution ~b. Note, however, that(12) is not an explicit system, so it would make senseto iterate; details on an iterative scheme are in the Ap-pendix (Mode of the conditional posterior distribution inBayes L).

The preceding implies that Bayes L produces a more “ef-fectively sparse” model. This can be seen from inspection ofan “effective number of parameters” measure (e.g., Tibshirani1996; Ruppert et al. 2003) given by

d:f:ridge ¼ tr X

X9Xþ I

s2e

s2b

!21

X9

24 35¼ tr

X9Xþ I

s2e

s2b

!21

X9X

24 35;and

d:f:Bayes L ¼ trh�

X9Xþ s2elW

21b

�21X9X

i:

If X is orthonormalized, so that X9X = I (with dispersionparameters scaled accordingly)

d:f:ridge ¼ tr

" Iþ s2

e

s2b

!21#¼ p

s2b

s2b þ s2

e; (13)

and

d:f:Bayes L ¼ trh�

Iþ s2elW

21b

�21i ¼Xpj¼1

��bj����bj

��þ s2el

: (14)

This enables us to see that, in ridge regression, every degreeof freedom (contributor to model complexity) representedby a column of the orthonormalized marker matrix isattenuated by the same factor, s2

b=ðs2b þ s2

eÞ. On the otherhand, in Bayes L markers having tiny effects are effec-tively, but not physically, wiped out of the model. Also,markers with strong effects receive a heavier weight inthis overall measure of complexity.

We simulated P = 100, 000 marker effects from DE dis-tributions with mean 0 and variances 10216, 1028, or 1024;setting s2

e ¼ 1, the preceding three values can be inter-preted as the contribution of an individual marker to vari-ance relative to residual variability. When a marker effecthad a large variance (1024), the entire battery of markers,assuming a priori independence of effects, represented 10

11 ofthe total variance; on the other hand, when markers were

assigned a variance of 10216, markers accounted for about10211=ð10211 þ 1Þ of the total variability. Since the varianceof the DE distribution is 2=l2 the settings led to l values offfiffiffi2

p· 108,

ffiffiffi2

p· 104, and

ffiffiffi2

p· 102, respectively; larger values

of l produce stronger shrinkage toward 0. The shrinkagefactor is s2

el=jbjj for marker j in Bayes L vs. s2e=s

2b in ridge

regression–BLUP. The contribution of a marker to the modelwas assessed as follows: from (13), in ridge regression eachmarker contributes the same amount, s2

b=ðs2b þ s2

eÞ; tomodel complexity, whereas in Bayes L the correspondingmetric is jbjj=ðjbjj þ s2

elÞ, as given in (14). For ridge regres-sion, the effective number of parameters was approxi-mately 10211, 1023, and 10, for s2

b ¼ 10216; 1028, and1024, respectively. For Bayes L, the corresponding effectivenumber of parameters was 4.96 · 10212, 4.98 · 1024, and4.98, respectively. Clearly, Bayes L produced a model thatwas more sparse than ridge regression–BLUP. Each of themarkers made a tiny contribution to model complexity; forinstance, when the variance of the double exponential ofmarker effects was 10216, the relative contributions to themodel of individual markers ranged from 0 to 10216; whenthe variance was 1028, these ranged from 0 to 1028, whilethe range was 02 1024 for s2

b ¼ 1024. A plot of the densityof the effective contributions to the model of each of the100, 000 markers is in Figure 3, for the case s2

b ¼ 1024;.95% of the markers contributed ,2 · 1024 effectivedegrees of freedom to the model. Hence, when a markercontributes to variance in a tiny manner, shrinkage of theirindividual effects toward 0 is very strong. Then, if a markereffect conveys the meaning of a fraction equal to 1028, say,of some physical parameter, what can this tell us about thestate of nature (i.e., genetic architecture) in the absence ofeffective Bayesian learning, as argued earlier in the article?Probably not much unless n � p and the model fitted is the“true” one, the latter requiring the extraordinarily strongassumption that a complex trait is well represented bya (multiple) linear regression.

Bayes L With Gamma prior for l2: The DE density (10) isindexed by a single positive parameter l, and if this is trea-ted as unknown, the marginal prior density of a markereffect is

pðbÞ ¼Z N

0pðbjlÞpðlÞdl;

where p(l) is the prior density of l. Clearly E(b) = ElE(b|l) =0, but the prior variance of b will depend on the distributionassigned to l. Typically, a G(r, d) prior is placed on l2 with thedensity being

p�l2��r; d� ¼ dr

GðrÞ�l2�r21

exp�2 dl2

�; (15)

and Eðl2jr; dÞ ¼ r=d and Varðl2jr; dÞ ¼ r=d2. Since l is pos-itive, p(b|l) = p(b|l2), so that

580 D. Gianola

Page 9: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

pðbjr; dÞ ¼Z N

0p�bjl2�p�l2��r; d�dl2

}

Z N

0

�l2�rþ1

221exph2�jbj

ffiffiffiffiffil2

pþ dl2

�idl2:

(16)

Using in the above expression an approximation given in theAppendix (Approximation of an integral in Bayes L), Equation16 gives

pðbjr; dÞ}approx:e2jbjffiffiffiffiffiffir=d

p (12

12

ffiffiffid

r

rjbj 1

2dþ d

8r

jbj2 þ

ffiffiffid

r

rjbj!4rþ 34d2

);

(17)

where }approx.means “approximately proportional to.” Ifonly the first term of the approximation is used, after nor-malization one gets

p1ðbjr; dÞ � e2jbjffiffiffiffiffiffir=d

pR N

2Ne2jbj

ffiffiffiffiffiffir=d

pdb

; (18)

and this is a DE density with parameter l ¼ ffiffiffiffiffiffiffir=d

p: If both the

first and second terms of the approximation are employed,one gets

p2ðbjr; dÞ �e2jbj

ffiffiffiffiffiffir=d

p �12 1

2

ffiffiffiffiffiffiffid=r

p jbjð1=2dÞ�R N

2Ne2jbj

ffiffiffiffiffiffir=d

p �12 1

2

ffiffiffiffiffiffiffid=r

p jbjð1=2dÞ�db

:

(19)

Next, we examine the shape of the unnormalized density(19) for two different G(r, d) prior distributions of l2. Set-

ting r = d gives Gamma distributions with expected value 1and variance 1

d; use of r = d = 4 and r = d = 16 producesprior distributions with variances 1

4 and116, respectively, and

the corresponding densities are shown in Figure 4, top left.Taking into account that the prior distributions of markereffects have null means, the variance of approximation(19) to the marginal prior of b was evaluated by numericalintegration between 29 and 9 as

Var2ðbjr ¼ dÞ ¼

R 9

29b2e2jbj

�12 1

2jbj 12d�dbR 9

29e2jbj

�12 1

2jbj 12d�db

;

yielding 1.73 (d = 4) and 1.93(d = 16). This producesa seemingly paradoxical situation, where the more uncer-tain prior for l2 (d = 4) gives a marginal prior for themarker effect that is more precise (as measured by thevariance) than that for d = 16. The densities, shown inFigure 4, top right, seem indistinguishable. However, ifthe plot is zoomed in the middle and right tails of thedistribution (left and right bottom, respectively) the priorwith d = 16 turns out to be less sharp and with thickertails, thus explaining its larger variance. Also, the priorprobability that a marker has an effect ranging from 20.3to 0.3 is 0.274 for d = 4 (more variable prior for l2) and0.263 for d = 16; the probabilities that a marker has aneffect from 2 to 7 are 0.058 (d = 4) and 0.065 (d = 16),respectively.

Bayes L with uniform prior on l: In an attempt to makethe prior in a Bayesian analysis less aggressive, one maynaively think that Bayes’s “principle of insufficient rea-son” (the uniform prior) may render the analysis objec-tive. Let the uniform prior on l be l|L, U � Uniform(L,U), where L and U are the lower and upper bounds, re-spectively, of the prior distribution. Mixing the DE distri-bution with parameter l over this prior gives as marginaldensity

pðbjL;UÞ ¼ 1U2 L

Z U

L

l

2expð2 ljbjÞdl:

As before, we employ a Taylor series to approximate exp(2l|bj|), but now around the expectationm ¼ ðU þ LÞ=2 ofthe uniform distribution, giving

expð2ljbjÞ � e2mjbj�12 jbjðl2mÞ þ 1

2jbj2ðl2mÞ2

�:

Then

pðbjL;UÞ}approx:e2mjbj

U2 L

ZUL

�12 jbjðl2mÞ þ 1

2jbj2ðl2mÞ2

�l

2dl: (20)

If the constant and the linear terms of the expansion areretained this produces

Figure 3 Density (over 100,000 markers) of the “effective degrees offreedom” contributed by a marker to the model for a double exponentialprior distribution with variance 1024.

Whole-Genome Regressions on Markers 581

Page 10: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

punif;1ðbjL;UÞ}approx:1

U2 Le2mjbj

Z U

L½12 jbjðl2mÞ� l

2dl:

Since l is positive, one can set L = 0 and m ¼ U=2 yielding

punif;1�bj��L ¼ 0;U

�}approx:

e2mjbj

U

�14U2 þ jbjU2

2

�m22

U3

��¼ U

4e2Ujbj=2

�12 jbjU

6

�:

A plot of punif,1(bj|L = 0, U) is shown in Figure 5. As Uincreases, the prior distribution of the marker effect gets in-creasingly concentrated near 0, reaching a point mass in thelimit. This implies that the regression model becomes effec-tively very simple if U is assigned large values, as most regres-sion coefficients take values close to 0. In theory, this shouldproduce underfitting and out of sample predictions that do notgeneralize well. It is thus intriguing why Legarra et al. (2011)obtained reasonable predictive accuracies when placing a uni-form prior on l, with L = 0 and U = 106. This theoretical

excursion suggests that a big warning should be inserted indocumentation of software implementing DE regression mod-els with a flat prior on the regularization parameter l.

On parameterizations of Bayes L: How any Bayesian or“classical” model is parameterized depends on mechanistic(e.g., interpretation with respect to some theory) or computingconsiderations, but alternative parameterizations must beequivalent in terms of the inference attained. For example,a parameterization of the classical infinitesimal model (e.g.,Hill 2012) in terms of additive genetic and environmentalvariances (VA, VE) must be equivalent to parameterization(V2 VE, VE), where V is the phenotypic variance, or to param-eterization (Vh2, (1 2 h2)V), where h2 is heritability. The sec-ond and third parameterizations do not imply causally that thegenetic variance depends on the environmental variance orthat the environmental variance depends on heritability. Inlikelihood-based inference there is invariance of parametersunder transformation. However, care must be exercised inBayesian analysis because parameters are random, so any ro-tation of coordinates (some transformations involve nonlinear

Figure 4 (A) Gamma prior density of lambda square for r = d = 4 (dot–dash) and r = d = 16 (solid). (B) Marginal prior densities of marker effects for r =d = 4 (dot–dash) and r = d = 16 (solid). (C) and (D) focus on the middle and right tails of the densities displayed in B, respectively.

D. Gianola

Page 11: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

rotations) require intervention of the Jacobian of the transfor-mation. One can go back and forth between parameterizations,provided that probability volumes are preserved properly. Forinstance, if one assigns independent priors to h2 and V ina (Vh2, (1 2 h2)V) parameterization, those used in a VA, VEparameterization should be probabilistically consistent with thepreceding, such that samples from the joint posterior of h2 andV produce the same distribution as that obtained by samplingfrom the joint posterior of VA, VE. Further, conditioning anddeconditioning may be necessary due to computing issues, e.g., the Gibbs sampler works with conditional distributions, butthe algorithm automates the deconditioning. It is precisely inthis context that Legarra et al. (2011) misinterpreted the pa-rameterization of Bayes L in Park and Casella (2008), de losCampos et al. (2009b), Weigel et al. (2009), and Vázquez et al.(2010) who, instead of working directly with prior (10) adop-ted a conditional prior discussed further below. All theseauthors have applied this parameterization successfully usingdata from animals and plants.

For reasons related to the behavior of Markov chainMonte Carlo algorithms for Bayes L, Park and Casella(2008) introduced a conditional DE distribution, with den-sity This distribution has mean Eðbjl;s2

eÞ ¼ 0 and varianceVarðbjl;s2

eÞ ¼ 2ðs2e=l

2Þ; this, of course, is not the varianceof bj. Legarra et al. (2011) incorrectly wrote VarðbÞ¼ 2ðs2

e=l2Þ;

and made the statement: “we do expect the distribution of SNPeffects not to be related to unobservable, unaccounted (residual)effects that can, for example, vary from site to site for the sameindividuals." It is fairly obvious that Varðbjl;s2

eÞ cannot be Var(bj) since

VarðbjlÞ ¼ Es2e

Var�bjl;s2

e��þ Vars2

e

E�bjl;s2

e�� ¼ Es2

e

�2s2e

l2

�;

with the term Vars2e½Eðbjl;s2

e� dropping because it is null.Hence, Var(b|l) depends on the prior adopted for s2

e. If s2e

is assigned a scaled inverted chi-square distribution on ne

degrees of freedom and with scale S2e , with density as in(29),

Var�bjl; ne; S2e

�¼ Es2e

�2s2e

l2

����l�¼ 2l2

ZN0

s2ep�s2e

��ne; S2e�ds2e

¼ 2neS2el2ðve2 2Þ; n. 2:

(21)

f�b=l;s2

e� ¼ l

2ffiffiffiffiffiffis2e

p exp

2

l

2ffiffiffiffiffiffis2e

p jbj!:

Therefore, the variance of the prior distribution of markereffects does not depend on s2

e but, rather, on l2 and on theparameters of the prior distribution of s2

e. There is theadditional complication that (21) does not take into ac-count uncertainty associated with l, and this is examinednext.

Since l must be positive, conditioning on l is equivalentto conditioning on l2, so that EðbjÞ ¼ El2Eðbjjl2Þ ¼ 0, and

VarðbÞ ¼ El2Var�bjl2�þ Varl2E

�bjl2� ¼ El2Var

�bjl2�:

Hence, unconditionally, use of (21) in El2Varðbjl2Þ produces

Var�bjne; S2e

� ¼ 2neS2eve 2 2

Z1l2

p�l2�dl:

If a G(r, d) prior is placed on l2, with density (13)

Var�bjne; S2e ; r; d

� ¼ 2neS2eve 2 2

Z1l2

dr

GðrÞ�l2�r21

exp�2 dl2

�dl2:

Changing variables to u ¼ 1=l2 gives

Var�bjne; S2e ; r; d

� ¼ 2neS2eve 22

Zu

dr

GðrÞðuÞ2rþ1exp

�2

d

u

�1u2

du

¼ 2neS2eve 2 2

Zu

dr

GðrÞðuÞ2r21exp

�2

d

u

�du:

The integral is the expected value of a random variable (u)following an inverted Gamma distribution with parameters rand d, which is d=ðr2 1Þ (r . 1), so

Var�bjne; S2e ; r; d

�¼ 2neS2edðve2 2Þðr2 1Þ: (22)

As argued in Gianola et al. (2009), the connection betweenthe variance of the prior distribution of marker effectsand additive genetic variance is subtle and elusive. IfVarðbjne; S2e ; r; dÞ were to be viewed as the variance of anadditive effect in some infinitesimal model, how are its dif-ferent components interpreted? If the standard infinitesimalmodel is parameterized in terms of (VE, h2) one can write

Figure 5 Prior density of a marker effect when a uniform (0, U) prior isadopted for l at varying values of upper bound U of the uniform distri-bution: 1 (solid line), 5 (dashes), 10 (dots–dashes).

Whole-Genome Regressions on Markers 583

Page 12: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

VA ¼ VEh2

12 h2:

In (22) neS2e=ðve 22Þis the counterpart of VE, since this is theexpected value of the prior distribution assigned to the re-sidual variance, s2

e. Then, 2d=ðr2 1Þ plays the role ofh2=ð12 h2Þ; since d=ðr2 1Þ is the prior expectation of1=l2, it would turn out that l2=2 would be the counterpartof ð12 h2Þ=h2.

The statements made in Legarra et al. (2011) are mis-leading due to an incorrect interpretation of the parame-terization of Bayes L proposed by Park and Casella (2008),used to address a multimodality problem that seems toarise in nonhierarchical implementations of Bayes L inthe sense of Kärkkäinen and Sillanpäa (2012). Theseauthors reported that hierarchical and nonhierarchicalversions of the Bayesian Lasso led to different posteriorinferences, but could not find clear reasons for this dis-crepancy. It might be related to lack of convergence ofthe Markov chain Monte Carlo scheme in the nonhierar-chical parameterization or perhaps to some impropriety.Additional basic research is needed to explain this para-dox, but Kärkkäinen and Sillanpäa (2012) recommendedthe hierarchical implementation, possibly because of eas-ier computation.

Bayes R

Erbe et al. (2012) presented this method as follows. Bayes Rstarts the hierarchical model with (1) and poses a mixture offour zero-mean normal distributions as a conditional priorfor a specific SNP effect:

p�bjs2

b1¼ 0;s2

b2¼ 1024s2

g ;s2b3

¼ 1023s2g ;s

2b4

¼ 1022s2g ;p1;p2;p3;p4

�¼ p1 · 0þ p2N

�bj0;1024s2

g

�þp3N

�bj0;1023s2

g

�þ p4N

�bj0; 1022s2

g

�:

(23)

Here, if the SNP effect is generated from the firstcomponent of the mixture (with probability p1) it will be0 with complete certainty; if drawn from the second com-ponent it will have a normal distribution with null meanand variance s2

b2¼ 1024s2

g; and so on. In Bayes R,s2g ¼ r2s2 is the assumed genetic variance, r2 is the as-

sumed reliability, and s2 is the variance of the target trait.Presumably, the assumption about r2 is either model de-rived or based on prior cross-validation information, whichis good Bayesian behavior, normatively. Makowsky et al.(2011) gave evidence that what one assumes about ge-netic variance from inference in training data is not recov-ered in cross-validation.

The mean of the mixture is obviously 0. Since the fourcomponents of the mixture have null means, the variance,given p = (p1, p2, p3, p4,), is

VarðbjpÞ ¼ �p2 · 1024 þ p3 · 1023 þ p4 · 1022�s2g:

Further,

VarðbÞ ¼ Ep½VarðbjpÞ� þ Varp½EðbjpÞ� ¼ Ep½VarðbjpÞ�:

Erbe et al. (2012) used a Dirichlet distribution with param-eter vector a= (a1,a2,a3,a4)9 as prior for the elements of p,so that

VarðbjaÞ ¼ EphVar�bj��p�i ¼

�1024a2 þ 1023a3 þ 1022a4

�a1 þ a2 þ a3 þ a4

s2g :

(24)

In particular, Erbe et al. (2012) took a1 = a2 = a3 = a4 = 1,producing a uniform distribution on p. It follows that allSNPs have the same marginal prior distribution, with nullmean, and variance

Var�bj��a� ¼ r2s2

400

�1þ 1

10þ 1100

�¼ 111

4 · 104r2s2:

This suggests that a simple ridge regression–BLUP obtainedby solving

�X9Xþ s2

eða1 þ a2 þ a3 þ a4Þr2s2ð1024a2 þ 1023a3 þ 1022a4Þ

�b ¼ X9y

may deliver predictive abilities that are similar to those ofBayes R, except that it would differ with respect to Bayes Ron how marker effects are shrunk.

Insight on how shrinkage takes place in Bayes R is gainedby inspecting the joint posterior density of all marker effects,given r2, s2, and p. Here

p�bjy;p; r2;s2

�} exp

�2

ðy2XbÞ9ðy2XbÞ2s2

e

·Qpj¼1

hp1 · 0þ p2N

�bj��0;s2

2

�þ p3N

�bj��0;s2

3

�þp4N

�bj��0;s2

4

�i;

(25)

where s22 ¼ r2s21024;s2

3 ¼ r2s2 1023, and s24 ¼ r2s2 1022

(these values can be modified a piacere). Taking derivativesof the log-posterior with respect to b gives (apart from anadditive constant)

@

@blog p�bjy;p; r2;s2��¼ 1

s2eðX9y2X9XbÞ

þ

8>><>>:P4

i¼2piddbj

fi

�bj��0;s2

i

�p2f2

�bj��0;s2

2

�þ p3f3

�bj��0;s2

3

�þ p4f4

�bj��0;s2

4

�9>>=>>;;

(26)

where {.} denotes a p · 1 vector. Above, fiðbjj0;s2i Þ(i = 2,

3, 4) is the density of bj under the normal distribution cor-responding to component i of the mixture, with

584 D. Gianola

Page 13: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

ddbj

fi

�bj��0;s2

i

�¼ 2

fi

�bj��0;s2

i

�s2i

bj:

Employing the preceding expression in Equation 26 yields

@

@blog p�bjp; r2;s2�� ¼ � 1

s2e

�ðX9y2X9XbÞ2

8>>><>>>:P4

i¼2pi

fi

�bj��0;s2

i

�s2iP4

i¼2pifi

�bj��0;s2

i

�bj

9>>>=>>>;:

(27)

Setting this to zero and rearranging leads to iteration

b½tþ1� ¼�X9XþVb½t�

�21X9y;

where Vb½t� is a p · p diagonal matrix with typical element

V½t�jj;b ¼ s2

e

P4i¼2pifi

�b½t�j

���0;s2i

��1=s2

i�

P4i¼2pifi

�b½t�j

���0;s2i

� ¼X4i¼2

p9½t�ijs2e

s2i; (28)

where

p9½t�ij

�bj

�¼

pifi

�b½t�j

���0;s2i

�P4

i¼2pifi

�b½t�j

���0;s2i

�; i ¼ 1; 2; . . . ;4 and j ¼ 1; 2; . . . ; p:

This is interpretable as the probability that a value bj in thecourse of iteration comes from the ith component of themixture, as the value of bj changes iteratively. Note that Vjj,b

is a weighted average of the shrinkage factors s2e=s

2i corre-

sponding to those that would be employed if the varianceparameter of the ith component of the mixture were to beused in ridge regression–BLUP. If s2

i is taken as constantover the three “slab” components, Bayes R reduces to BLUP.On the other hand, when s2

i varies over components, theratio s2

e=s2i will be larger for components having the small-

est variance. Observe that p1 does not play a role in thisposterior mode interpretation of how Bayes R effectsshrinkage.

In summary, Bayes R assigns the same prior distributionto all markers in the battery of SNPs, one with null meanand variance (for a mixture of K components)

VarðbjaÞ ¼XKk¼1

akPKk¼1ak

s2k ;

where the a9s are the parameters of the prior distribution ofthe mixing probabilities p. Bayes R takes s2

1 ¼ 0.The superior performance of Bayes R over other methods

found by Erbe et al. (2012) probably results from using priorempirical knowledge about r2, the assumed reliability. BayesR has been extended to Bayes RS (Brondum et al. 2012).This is a minor variant of Bayes R in which the mixture (23)is expanded by a factor S, so that there are now S mixturesof four normal distributions each. The letter S denotesa number of chromosome segments constructed in some

manner that reflects prior knowledge that some such seg-ments contribute more variance than others. Using the argu-ments outlined above, it is easy to see that Bayes RS leads toa shrinkage that, instead of being component specific, is nowregion-component specific.

An incorrect prior often used in the Bayesian alphabet

The following statement is found at high frequency in thegenomic selection literature: “The prior distribution of theresidual variance is x22ðs2

e jne ¼ 2 2; S2e ¼ 0Þ, meaning thatthe degrees of freedom of the prior is 22 and that the scaleparameter is null.” Examples are Meuwissen et al. (2001)and Jia and Jannink (2012), respectively. Note that Bayes’stheorem returns with null posterior density or probabilityany parameter value that is assigned 0 density or mass a pri-ori. If the prior density (or probability) of parameter u issuch that p(u|hyperparameters) = 0, it must be that

pðujhyperparameters; yÞ¼ pðyju;hyperparametersÞ · 0pðyjhyperparametersÞ ¼ 0;

as well. The prior x22ðs2e jne ¼ 2 2; S2e ¼ 0Þ is absurd for

two reasons. First, a scaled inverted chi-square distributionexists only if both ne and S2e are .0. To see the secondreason, we write the prior density explicitly, that is,

p�s2e

��ne; S2e� ¼�neS2e=2

�ne2

Gðne=2Þ ·�s2e�2�ne

2þ1�exp�2

neS2e2s2

e

�;

(29)

so for S2e ¼ 0 and any “legal” value of ne, pðs2e jne;

S2e ¼ 0Þ ¼ 0∨s2e. Then, it must be that pðs2

e jne; S2e ; yÞ ¼0∨s2

e as well. Hence, a scaled inverted chi-square with a nullscale parameter is not a probability model at all, as it doesnot assign appreciable density to any value of the unknownresidual variance. It does not convey uncertainty whatso-ever: any value of the residual variance is assigned a densityof zero, prior and posterior to observing data.

A possible reason for this mistake is as follows: Sorensenand Gianola (2002), as many other Bayesians often do,write the prior as being proportional to the kernel of thescaled inverted chi-square density, that is, as

p�s2e

��ne; S2e�} �s2e�2ðne

2þ1Þ

exp�2

neS2e2s2

e

�;

and note that this kernel reduces to a uniform distributionby taking ne = 22 and S2e ¼ 0, yielding pðs2

e jne; S2eÞ}1.However, it takes more than a kernel to make a density,as multiplication of 1 times the integration constantðneS2e=2Þne=2=Gðne=2Þ produces zero.

Discussion

The main message from this article is that it is not clear howone can learn about genetic architecture from data in n � p

Whole-Genome Regressions on Markers 585

Page 14: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

situations. This is because individual marker effects are notestimable from the likelihood, apart from the fact that it isunlikely that a multiple linear regression provides a sensibledescription of biological complexity. On the other hand, it isfeasible to learn about the signal Xb because there is infor-mation about this unknown vector in the data, althoughequivalent to that conveyed by a sample of size 1. Unfortu-nately, the Bayesian alphabet (Gianola et al. 2009) continuesto grow under the incorrect perception that different speci-fications stemming from various choices of prior informabout genetic architecture; Erbe et al. (2012) and Brondumet al. (2012) provide good examples of this. It is difficult todefend such claims unless n � p and provided that themodel is “true” and effectively sparse (Wimmer et al.2012). Otherwise, the prior always matters whenever n �p and different priors lead to different claims about the stateof nature, merely because their shrinkage behavior in finitesamples varies. All members of the alphabet produce uniquepoint and interval Bayesian estimates of marker effects, butthe driver is the prior and not the data.

Mixtures of Gaussian distributions are widely used innonparametric density estimation (Wasserman 2010) be-cause most distributions can be approximated well. Mixturescan capture vagaries from cryptic distributions but at theexpense of parsimony, thus posing the risk of copying noise,as opposed to signal, especially if the mixture model has toomany parameters. McLachlan and Peel (2000) give a warn-ing: estimation of the parameters of a mixture (C) on thebasis of data are meaningful only if C is likelihood identifi-able. In Bayes RS (apart from nuisance effects and the re-sidual variance) the number of unknown parameters is 2p +4S. Here, 2p comes from the fact that each marker isassigned a distinct variance; the 4S comes from the fact thatthere are S segments each having four segment-specific mix-ing probabilities ps(s = 1, 2, . . ., S). Unfortunately, n ��,2p + 4S, and this creates a huge identification deficit rela-tive to the information content in a sample of size n. Ina Bayesian context, there is the additional issue (occurringeven when n . p) called label switching, leading Celeuxet al. (2000) to write: “Although somewhat presumptuous,we consider that almost the entirety of Markov chain MonteCarlo samplers for mixture models has failed to converge!”In view of these pitfalls, one wonders what meaningfulmechanistic sense can be extracted from these richly param-eterized specifications intended to inform about geneticarchitecture.

Although their inferential outcomes may be misleading,one should not dismiss the potential value of Bayes B, C, R,RS, or of any of the mixture models proposed so far asprediction machines. Predictive distributions stemming fromthe various members of the alphabet may be analyticallydistinct from each other, but such differences are seldomrevealed in cross-validation (e.g., Heslot et al. 2012); anexception is Lehermeier et al. (2013). Below we reviewhow the alphabet can be interpreted from a predictiveperspective.

A pioneer of Bayesian predictive inference (Geisser 1993)wrote:

Clearly hypothesis testing and estimation as stressed inalmost all statistics books involve parameters. . .this pre-sumes the truth of the model and imparts an inappropriateexistential meaning to an index or parameter. . .inferringabout observables is more pertinent since they can occurand be validated to a degree that is not possible forparameters.

Bayesian methods play an important role in machinelearning (e.g., Bishop 2006; Barber 2012; Dehmer and Basak2012; Rogers and Girolami 2012). A reason is that Bayes’stheorem provides a predictive distribution automatically,something that has not been appreciated in full yet in thewhole-genome prediction literature.

The problem of prediction can be cast as one of makingstatements about future data yf, given past data y. A modelM (e.g., Bayes L) with parameter vector u is fitted (trained)to y, leading to the posterior distribution p(u|y, H, M),where H denotes hyperparameters. If yf is treated as an un-known, the prior becomes p(u, yf|H, M) = p(yf|u,M)p(u|H,M) so that

p�u; yf jy;H; M

�} p�yju; yf ;M

�p�yf ju;M

�pðujH;MÞ:

Since past observations do not depend on future observa-tions, given the parameters, p(y|u, yf, M) = p(y|u, M), sothat

p�yf jy;H;M

�}

Zp�yf ju;M

�pðujy;H;MÞdu: (30)

This is the predictive distribution, where parameters u donot necessarily play an “existential role” in the sense ofGeisser (1993); rather, they are tools enabling one to gofrom past to future observations. Note that

p�yf jy;H;M

�¼ Eujy;H;M

hp�yf ju;M

�i;

meaning that the predictive distribution weights an infinitenumber of predictions made at a specific values of u, withthe averaging distributions being p(u|y, H, M); this posteriorconveys the plausibility assigned to a specific value of u,posterior to the observed data y. For example, for ridge re-gression–BLUP, the posterior distribution of b isbjy; variances � Nð~b; ðX9Xþ IlÞ21s2

eÞ; where ~b is the solu-tion to Equation 5. It follows that the posterior distributionof the signal is Xbjy; variances � NðX~b;XðX9Xþ IlÞ21X9s2

eÞ:This implies that the predictive distribution of a future vec-tor of data yf = Xfb + ef would also be normal

yf jy; variances � N�Xf ~b;XfðX9Xþ IlÞ21X9

fs2e þ Ifs

2ef

�:

Here, the strong assumption that the stochastic processgenerating current and future data are the same is made;

586 D. Gianola

Page 15: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

typically, it is assumed that s2ef ¼ s2

e ; but this may not berealistic. While the different priors of the alphabet lead todifferent predictive distributions, it is to be expected that atleast the point predictions will be fairly similar. This is be-cause Xb is identified in the likelihood, so some Bayesianlearning about the signal will take place, especially wheny is a vector of preprocessed means (e.g., means of daughteryield deviations for a battery of dairy cattle bulls with a largenumber of progeny records). In the latter case, the variousmembers of the alphabet are expected to differ minimally inpredicting ability.

The predictive distribution can be used to check whetherobserved data are consistent with what a model would leadone to expect. Sorensen and Waagepetersen (2003) usedthis idea to examine goodness of fit of model for litter sizein pigs. However, the predictive approach outlined abovedoes not take uncertainty about the model into account,and this may understate variability seriously. Bayesians ad-dress this via model averaging, where the predictive distri-bution is averaged over models, that is,

p�yf jy

� ¼ Z p�yf jy;HM;M

�dmðMjyÞ:

This integral represents both the situation where thenumber of models is finite and countable or infinite. In thefirst case the integral is a sum and the measure m(M|y) isthe posterior probability of model M. In the second case thenumber of possible models may be huge, e.g., in variableselection approaches for linear models aiming to includeor exclude p markers, there are 2p possible specifications.If p is very large, the number of models is practically infinite,so the measure m(M|y) is the posterior density assigned toa specific model.

Although p(yf|y) provides a more sensible assessment ofpredictive uncertainty, in practice one proceeds by construct-ing cross-validation distributions, with respect to one or sev-eral competing models. Each prediction generates an error,and this error will have a cross-validation distribution. Therelevance of cross-validation is another important contribu-tion of Meuwissen et al. (2001) to whole-genome predic-tion. Here, hyperparameters of genomic selection models(e.g., p in Bayes Cp) can be viewed as “tuning knobs” andevaluated over a grid. Unfortunately, the reality is that Man-hattan plots tend to overwhelm cross-validation graphs ingenome-wide association studies.

Also, differences in predictive ability are often masked bythe variation conveyed by a properly constructed cross-validation distribution (e.g., González-Camacho et al. 2012).On the other hand, the various Bayesian predictive machinesresulting from different priors may possess differential ro-bustness in finite samples. For instance, some priors may beless sensitive with respect to differences in true genetic ar-chitecture (Wimmer et al. 2012).

Given that the data do not contain information aboutindividual marker effects, variation in inference is an artifact

caused by the various priors. This leads to the question: Howmuch does one prior differ from another one? Informationon this can be obtained by use of some notion of statisticaldistance between distributions, such as the Kullback–Leibler(KL) metric. For example, Gianola et al. (2009) used KL fordebunking the notion that marker-specific-effect variancesin Bayes A tell us something about genetic variability ofchromosomal regions. Recently, Lehermeier et al. (2013)used a metric that is easier to interpret than KL, the Hellin-ger distance or HD (e.g., Roos and Held 2011). They foundthat Bayesian learning in Bayes A and Bayes B was morelimited than with Bayes L or Bayesian ridge regression. Inour context, the HD between prior Nðbj0;s2

bÞ assigned toa marker effect in ridge regression and prior tðbj0; S2b; nÞ ofBayes A is

HDðN; tÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi12

Z ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiN�bj0;s2

b

�t�bj0; S2b; n

�rdb:

s

HD takes values between 0 and 1, with 1 corresponding tothe situation where, say, any realization from tðbj0; S2b; nÞ isassigned 0 density under Nðbj0;s2

bÞ, and vice versa. Similarexpressions hold for HD(N, DE), where DE(b|0, l) is thezero-mean double-exponential distribution with parameterl that is used in Bayes L and for HD(t, DE). To comparethese three priors, we took s2

b ¼ 1, S2bn=ðn2 2Þ ¼ 1, and2=l2 ¼ 1, so that the three priors had the same variance;for the t-distribution we assigned n = 6, to produce suffi-ciently thick tails. With these assignments S2b ¼ 2

3 andl ¼ ffiffiffi

2p

; so that, using numerical integration between 210and 10, HD(N, t) = 0.0690. Further, HD(N, DE) = 0.122,and

HDðt;DEÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi12

Z ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiG½3:5� 1þ b2=4

�2ð3:5Þ

G½3� ffiffiffiffiffiffi4pp exp

�2

ffiffiffi2

p jbj�ffiffiffi2

p

vuut db

vuuut ¼ 0:06:

This shows, at least when variances are matched, that thesethree priors are not too different from each other, sodifferences in inference would stem from difference in thetype and extent of shrinkage effected. However, if priors arenot matched, these distances would be expected to increase.Since ridge regression–BLUP, Bayes A, and Bayes L postulatethe same sampling model, whenever n � p differences inposterior inferences between these three members of theBayesian alphabet must be due to the fact that the priorsare very different and influential.

To conclude, whole-genome prediction can be useful forproviding locally valid predictions of complex traits. How-ever, the additive regression models employed thereinshould not be taken at face value from an inferentialperspective unless an additive model with many 0 coeffi-cients turns out to hold as approximately true (oracleproperty 1 met), and n �. p0, where p0 is the number ofnonzero coefficients (oracle property 2 met). If these twoconditions are (ever) fulfilled, it may be that the genetic

Whole-Genome Regressions on Markers 587

Page 16: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

architecture of the very elusive additive QTL (on whoseexistence the statistical abstraction of marker-assisted infer-ence is based) will be unraveled by statistical means.

The question of the extent to which an additive geneticmodel is a good representer of complexity is another issueyet to be sorted out. The Bayesian alphabet may expandfurther on this matter, e.g., Bayes A may grow into BayesAAA if additive · additive · additive epistasis is included ina model. Additional expansions of the Bayesian alphabet toaccommodate epistatic interactions will further exacerbatethe inferential problems, because of a vast increase in num-ber of regression coefficients. It is far from obvious howgenetic architecture of complex traits can be learned viahighly dimensional statistical models.

Acknowledgments

A big note of thanks goes to Christos Dadousis, ChristinaLehermeier, Valentin Wimmer, and Chris-Carolin Schön(Technische Universitat Munchen, TUM, Germany) andWilliam G. Hill (University of Edinburgh) for providinga thorough external review of the manuscript. EduardoManfredi (Institut National de la Recherche Agronomique,Toulouse, France) is acknowledged for pointing out thearticle of Duchemin et al. (2012), who detected overpara-meterization problems of Bayes Cp. Heather Adams, JuanManuel González Camacho, Gota Morota, and FranciscoPeñagaricano, all from Wisconsin, and Brad Carlin (Minnesota)are thanked for their comments on an earlier draft of this arti-cle. The author is indebted to Chiara Sabatti, the AssociateEditor handling the review, and to two anonymous reviewersfor their constructive criticism leading to a more succinct man-uscript, albeit a much less humorous one than the originalsubmission. Work was partially supported by the WisconsinAgriculture Experiment Station.

Literature Cited

Barber, D., 2012 Bayesian Reasoning and Machine Learning. Cam-bridge University Press, Cambridge, UK.

Bernardo, J. M., and A. F. M. Smith, 1994 Bayesian Theory. Wiley,Chichester, UK.

Bishop, C. M., 2006 Pattern Recognition and Machine Learning.Springer, New York.

Brondum, R. F., G. Su, M. S. Lund, P. J. Bowman, M. E. Goddardet al., 2012 Genome specific priors for genomic prediction.BMC Genomics 10.1186/1471–2164–13–543.

Carlin, B. P., and T. A. Louis, 1996 Bayes and Empirical BayesMethods for Data Analysis. Chapman & Hall, London.

Celeux, G., M. Hurn, and C. Robert, 2000 Computational and in-ferential difficulties with mixture posterior distributions. J. Am.Stat. Assoc. 95: 957–979.

Crossa, J., G. de los Campos, P. Pérez, D. Gianola, J. Burgueñoet al., 2010 Prediction of genetic value of quantitative traitsin plant breeding using pedigree and molecular markers. Genet-ics 186: 713–724.

Dawid, A. P., 1979 Conditional independence in statistical theory(with discussion). J. R. Stat. Soc. B 41: 1–31.

Dehmer, M., and S. C. Basak, 2012 Statistical and Machine Learn-ing Approaches for Network Analysis. Wiley, Hoboken, NJ.

de los Campos, G., D. Gianola, and G. J. M. Rosa, 2009a Reproducingkernel Hilbert spaces regression: a general framework for geneticevaluation. J. Anim. Sci. 87: 1883–1887.

de los Campos, G., H. Naya, D. Gianola, J. Crossa, A. Legarra et al.,2009b Predicting quantitative traits with regression modelsfor dense molecular markers and pedigrees. Genetics 182:375–385.

de los Campos, G., D. Gianola, and D. B. Allison, 2010 Predictinggenetic predisposition in humans: the promise of whole-genomemarkers. Nat. Rev. Genet. 11: 880–886.

de los Campos, G., J. M. Hickey, R. Pong-Wong, H. D. Daetwyler,and M. P. L. Calus, 2012a Whole genome regression and pre-diction methods applied to plant an animal breeding. Genetics193: 327-345.

de los Campos, G., Y. C. Klimentidis, A. I. Vaźquez, and D. B.Allison, 2012b Prediction of expected years of life usingwhole-genome markers. PLoS ONE 7: 1–7.

Duchemin, S. I., C. Colombani, A. Legarra, G. Baloche, H. Larroqueet al., 2012 Genomic selection in the French Lacaune dairysheep breed. J. Dairy Sci. 95: 2723–2733.

Erbe, M., B. J. Hayes, L. K. Matukumali, S. Goswami, P. J. Bowmanet al., 2012 Improving accuracy of genomic predictions withinand between dairy cattle breeds with imputed high-density sin-gle nucleotide polymorphism panels. J. Dairy Sci. 95: 4114–4129.

Falconer, D. S., and T. F. C. Mackay, 1996 Introduction to Quan-titative Genetics, Ed. 4. Longmans Green, Harlow, UK.

Fan, J., and R. Li, 2001 Variable selection via nonconcave penal-ized likelihood and its oracle properties. J. Am. Stat. Assoc. 96:1348–1360.

Geisser, S., 1993 Predictive Inference: An Introduction. Chapman &Hall, New York.

Gelfand, A. E., and S. K. Sahu, 1999 Identifiability, improper pri-ors, and Gibbs sampling for generalized linear models. J. Am.Stat. Assoc. 94: 247–253.

Gianola, D., and R. L. Fernando, 1986 Bayesian methods in ani-mal breeding theory. J. Anim. Sci. 63: 217–244.

Gianola, D., B. Heringstad, and J. Ødegård, 2006 On the quanti-tative genetics of mixture characters. Genetics 173: 2247–2255.

Gianola, D., G. de los Campos, W. G. Hill, E. Manfredi, and R. L.Fernando, 2009 Additive genetic variability and the Bayesianalphabet. Genetics 183: 347–363.

González-Camacho, J. M., G. de los Campos, P. Pérez, D. Gianola, J.E. Cairns et al., 2012 Genome-enabled prediction of geneticvalues using radial basis function neural networks. Theor. Appl.Genet. 125: 759–771.

Habier, D., R. L. Fernando, K. Kizilkaya, and D. J. Garrick,2011 Extension of the Bayesian alphabet for genomic selection.BMC Bioinformatics. Available at: http://www.biomedcentral.com/1471–2105/12/186

Hastie, T., R. Tibshirani, and J. Friedman, 2009 The Elements ofStatistical Learning, Ed. 2. Springer, New York.

Heffner, E. L., M. E. Sorrells, and J. L. Jannink, 2009 Genomicselection for crop improvement. Crop Sci. 49: 1–12.

Henderson, C. R., 1977 Best linear unbiased prediction of breed-ing values not in the model for records. J. Dairy Sci. 60: 783–787.

Henderson, C. R., 1984 Applications of Linear Models in AnimalBreeding. University of Guelph, Ontario, Canada.

Heslot, N., M. E. Sorrells, J. L. Jannink, and H. P. Yang,2012 Genomic selection in plant breeding: a comparison ofmodels. Crop Sci. 52: 146–160.

Hill, W. G., 2012 Quantitative genetics in the genomics era. Curr.Genomics 13: 196–206.

588 D. Gianola

Page 17: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

Janss, L., G. de los Campos, N. Sheehan, and D. Sorensen,2012 Inferences from genomic models in stratified popula-tions. Genetics 92: 693–704.

Jia, Y., and J.-L. Jannink, 2012 Multiple trait genomic selectionmethods increase genetic value prediction accuracy. Genetics192: 1513–1522.

Kärkkäinen, H. P., and M. K. Sillanpää, 2012 Back to basis forBayesian model building in genomic selection. Genetics 191:969–987.

Legarra, A., C. Robert-Granié, P. Croiseau, F. Guillaume, and S.Fritz, 2011 Improved Lasso for genomic selection. Genet.Res. 93: 77–87.

Lehermeier, C., V. Wimmer, T. Albrecht, H. Auinger, D. Gianolaet al., 2013 Sensitivity to prior specification in Bayesian ge-nome-based prediction models. Stat. Appl. Genet. Mol. Biol.DOI: 10.1515/sagmb-2012-0042.

Lorenz, A. J., S. Chao, F. G. Asoro, E. L. Heffner, T. Hayashi et al.,2011 Genomic selection in plant breeding: knowledge andprospects. Adv. Agron. 110: 77–123.

Makowsky, R., N. M. Pajewski, Y. C. Klimentidis, A. I. Vázquez, C.W. Duarte et al., 2011 Beyond missing heritability: predic-tion of complex traits. PLoS Genet. 7(4): 10.1371/journal.pgen.100205.

McLachlan, G., and T. Krishnan, 1997 The EM Algorithm and Ex-tensions. Wiley, New York.

McLachlan, G., and D. Peel, 2000 Finite Mixture Models. Wiley,New York.

Meuwissen, T. H. E., B. J. Hayes, andM. E. Goddard, 2001 Predictionof total genetic value using genome-wide dense marker maps. Ge-netics 157: 1819–1829.

Meuwissen, T. H. E., T. R. Solberg, R. Shepherd, and J. A. Woolliams,2009 A fast algorithm for BayesB type of prediction of genome-wide estimates of genetic value. Genet. Sel. Evol. 41: (2) 1–10.

Mrode, R., 2005 Linear Models for the Prediction of Animal Breed-ing Values. Ed. 2. CABI, Wallingford, UK.

Mutshinda, C.M., and M. J. Sillanpää, 2010 Extended BayesianLASSO for multiple quantitative trait loci mapping and unob-served phenotype prediction. Genetics 86: 1067–1075.

Ober, U., J. F. Ayroles, E. A. Stone, S. Richards, D. Zhu et al.,2012 Using whole-genome sequence data to predict quantita-tive trait phenotypes in Drosophila melanogaster. PLoS Genet. 8(5): e1002685 10.1371/journal.pgen.1002685.

O’Hagan, A., 1994 The Advanced Theory of Statistics: Vol. 2B.Bayesian Inference. Arnold, Cambridge, UK.

Park, T., and G. Casella, 2008 The Bayesian Lasso. J. Am. Stat.Assoc. 103: 681–686.

Pérez, P., G. de los Campos, J. Crossa, and D. Gianola,2010 Genomic-enabled prediction based on molecularmarkers and pedigree using the Bayesian Linear RegressionPackage in R. Plant Genome 3: 106–116.

Robertson, A., 1955 Prediction equations in quantitative genetics.Biometrics 11: 95–98.

Robinson, G. K., 1991 That BLUP is a good thing: the estimationof random effects. Stat. Sci. 6: 15–32.

Rogers, S., and M. Girolami, 2012 A First Course in Machine Learn-ing. CRC Press, Boca Raton, FL.

Roos, M., and L. Held, 2011 Sensitivity analysis in Bayesian gen-eralized linear mixed models for binary data. Bayesian Anal. 6:259–278.

Ruppert, D., M. P. Wand, and R. J. Carroll, 2003 Semipara metricRegression. Cambridge University Press, New York.

Searle, S. R., 1971 Linear models. Wiley, New York.Searle, S. R., 1966 Matrix Algebra for the Statistical Sciences. Wi-

ley, New York.Sillanpäa, M., 2012 Bayesian Lasso-Related Methods for Genomic

Prediction Methods for Genomic Predictions and QTL Analysis Us-ing SNP Data, p. 20. Eucarpia, Programme, Information, Ab-stracts, T4, Hohenheim University, Stuttgart, Germany.

Sorensen, D., and D. Gianola, 2002 Likelihood, Bayesian, andMCMC Methods in Quantitative Genetics. Springer, New York.

Sorensen, D., and R. Waagepetersen, 2003 Normal linear modelswith genetically structured residual variance heterogeneity:a case study. Genet. Res. 82: 207–222.

Sun, X., L. Qu, D. J. Garrick, J. C. M. Dekkers, and R. L. Fernando,2012 A fast EM algorithm for Bayes A-like prediction of geno-mic breeding values. PLoS ONE 7(11): 1–9. e49157 10.1371/journal.pone.0049157.

Tibshirani, R., 1996 Regression shrinkage and selection via theLasso. J. R. Stat. Soc. Ser. A Stat. Soc. 58: 267–288.

Van Raden, P. M., 2008 Efficient methods to compute genomicpredictions. J. Dairy Sci. 91: 4414–4423.

Vázquez, A. I., G. J. M. Rosa, K. A. Weigel, G. de los Campos, D.Gianola et al., 2010 Predictive ability of subsets of single nu-cleotide polymorphisms with and without parent average in USHolsteins. J. Dairy Sci. 93: 5942–5949.

Vázquez, A. I., G. de los Campos, Y. C. Klimentidis, G. J. M. Rosa, D.Gianola et al., 2012 A comprehensive genetic approach forimproving prediction of skin cancer risk in humans. Genetics192: 1493-1502.

Verbyla, K. L., P. J. Bowman, B. J. Hayes, and M. E. Goddard,2009 Sensitivity of genomic selection to using different priordistributions. BMC Proc. 03/2010; 4 (Suppl) 1:S5. 1–4 (doi:10.1186/1753–6561–4-S1–S5).

Wang, C.-L., X.-D. Ding, J.-Y. Wang, J.-F. Liu, W.-X. Fu et al.,2013 Bayesian methods for estimating GEBVs of thresholdtraits. Heredity 110: 213–219.

Wasserman, L., 2010 All of Nonparametric Statistics. Springer,New York.

Weigel, K. A., G. de los Campos, O. González-Recio, H. Naya, X. L.Wu et al., 2009 Predictive ability of direct genomic values forlifetime net merit of Holstein sires using selected subsets ofsingle nucleotide polymorphism markers. J. Dairy Sci. 92:5248–5257.

Wellmann, R., and J. Bennewitz, 2012 Bayesian models withdominance effects for genomic evaluation of quantitative traits.Genet. Res. 94: 21–37.

Wimmer, V., T. Albrecht, C. Lehermeier, H.-J. Auinger, Y. Wanget al., 2012 Eucarpia: Programme, Information, Abstracts. T7,p. 30. Hohenheim University, Stuttgart, Germany.

Communicating editor: C. Sabatti

Whole-Genome Regressions on Markers 589

Page 18: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

Appendix

Bias of BLUP with Respect to Marker Effects

As a toy example, let n= 3 and p= 4. The model includes an intercept plus the effect of 3 markers, and the incidence matrixis

X ¼24 1 0 21 11 1 0 01 1 1 2 1

35:The first column contains the dummy variable for the intercept, and the remaining columns are the genotype codes for themarkers at each of three loci. The first observation (row 1 of X) pertains to an individual that is Aa (coded as 0), bb(coded as21), CC (coded as1), and so on. This matrix has rank 3, and a generalized inverse of X9X is

ðX9XÞ2 ¼

26643 24 2 024 6 23 02 23 2 00 0 0 0

3775:If the true values of the intercept and of the marker effects are denoted as a, b, c, and d, respectively, the expected value ofthe maximum-likelihood estimator of the four parameters is

E�bð0Þ

���b� ¼

2664ab

c2 d0

3775;with the expected value of the effect of the third marker being 0 instead of d because of the rank deficiency. Now, we useBLUP with Vb ¼ I3s2

b and variance ratio l ¼ s2e=s

2b (s2

b is the variance of marker effects) and calculate it (Henderson 1984)as

BLUPðbÞ ¼ ðX9Xþ IlÞ21X9y:

For this example,

ðX9Xþ IlÞ21 ¼

266666666666664

l2þ6lþ6k 2 2lþ8

k2k 2 2

k

2 2lþ8k

l2þ7lþ12k 2 lþ3

klþ3k

2k 2 lþ3

kl3þ7l2þ11lþ1

kl2l2þ9lþ1

kl

2 2k

lþ3k

2l2þ9lþ1kl

l3þ7l2þ11lþ1kl

377777777777775;

where k = l3 + 9l2 + 20l + 2. Then

EðBLUPðbÞjbÞ ¼ ðX9Xþ IlÞ21X9X

2664abcd

3775:

590 D. Gianola

Page 19: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

After tedious algebra, one arrives at

EðBLUPðbÞjbÞ ¼

2666666666664

a�3k q4 2

2k q6

�þ b�

4kl þ 2

k q4 22k q6

�2 c�1k q6 2

8kl

�þ d�1k q62

8kl

�a�2k c22

3k q6

�2 b�2k q62

2k q2 þ 2

kl q5

�þ c�1k q2 2

4kl q5

�2 d�1k q22

4kl q5

�a�6k2

2k q5

�þ b�4k2

2k q5 2

1kl q3 þ 1

kl q1

�2 c�1k q5 þ 2

kl q3 22kl q1

�þ d�1k q5 þ 2

kl q3 22kl q1

�2 a�6k2

2k q5

�2 b�4k2

2k q5 2

1kl q3 þ 1

kl c1

�þ c�1k q5 þ 2

kl q322kl q1

�2 d�1k q5 þ 2

klq322kl q1

3777777777775;

where

q1 ¼ l3 þ 7l2 þ 11lþ 1;q2 ¼ l2 þ 7lþ 12;q3 ¼ 2l2 þ 9lþ 1;q4 ¼ l2 þ 6lþ 6;q5 ¼ lþ 3;q6 ¼ 2lþ 8:

Conditionally on b, all marker effects are estimated with a bias that involves all other markers (and the intercept as well).Since inferences on genetic architecture are primarily based on point estimates (it should be noted that the biased estimatoris more precise), it is quite clear that such inferences are not “clean."

Marker Effects Are Not Identified from a Bayesian Perspective in the n , p Setting

Let a Bayesian linear model consist of location parameters uA and uB (this partition has a different meaning from the onegiven above), with likelihood p(y|uA, uB). If the conditional posterior density of uB is such that

pðuBjuA; yÞ ¼ pðuBjuAÞ;

then uB is not identifiable, meaning that observation of data does not increase knowledge about uB beyond what is conveyedby the conditional prior p(uB|uA) (Dawid 1979; Gelfand and Sahu 1999). For the model in (1), in the n , p situation matrixXn·p has rank n, and one can reorganize its columns into

Xb ¼ ½X1 X2 ��b1b2

�;

where X1 is n · n with rank n, and X2 is n · (p 2 n), with the vector of marker effects b partitioned accordingly. Changingvariables as �

uAuB

�¼�X1 X20 Iðp2nÞ · ðp2nÞ

��b1b2

�produces the inverse transformations b2 = uB and b1 ¼ X21

1 ðuA 2X2uBÞ; because the transformation is linear, the Jacobiandoes not involve the parameters. Using the new parameterization model (1) can now be written as

y ¼ uA þ e;

implying that the data contain information about uA but not about uB (the latter can represent any marker effect, byconstruction). Then, irrespective of the joint prior distribution assigned to uA and uB, the posterior is

pðuA; uBjyÞ} pðyjuA; uBÞpðuBjuAÞpðuAÞ} pðyjuAÞpðuBjuAÞpðuAÞ;

so

pðuBjuA; yÞ ¼ pðuBjuAÞ;

Whole-Genome Regressions on Markers 591

Page 20: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

verifying that the p 2 n marker effects are not likelihood identified. As pointed out by Gelfand and Sahu (1999), this doesnot mean that there is not Bayesian learning about uB. It means, however, that data “speak” about uA and that what can besaid about uB depends on what has been spoken about uA, with the pipe lining of knowledge done through the priordistribution. This can be seen more clearly by writing the posterior of uB as

pðuBjyÞ ¼RpðuBjuA; yÞpðuAjyÞduA ¼

RpðuBjuAÞpðuAjyÞduA

¼ EpðuAjyÞ½pðuBjuAÞ�:

This representation enables one to see that marginal inferences about individual marker effects are the weighted average ofan infinite number of inferences made from the conditional prior p(uB|uA), where the averaging distribution is the posteriorof the signal p(uA|y). If uB is any marker effect, say bj, the preceding becomes

p�bj��y� ¼

Z hp�bj��X1b1

�ipðX1b1jyÞdðX1b1Þ:

In conclusion, for any letter of the alphabet and for any prior distribution adopted, any inference made about geneticarchitecture always depends on the form of p(bj|X1b1) or, more generally, of p(uB|uA), and these densities depend on theprior adopted, but not on the data. Proper Bayesian learning takes place for X1b1 only.

Inferences in a Linear Model with Unidentified Parameters

In the context of model (1), the likelihood function (assuming known s2e) is

l�bjy;s2

e�} exp

�2

ðy2XbÞ9ðy2XbÞ2s2

e

�:

For the n , p situation, and with b(0) being a solution to the normal equations corresponding to generalized inverse (X9X)2,the (singular) likelihood is expressible as

l�bjy;s2

e�} exp

2

�b2bð0Þ

�9X9X

�b2bð0Þ

�2s2

e

!:

Letting r = rank(X) and using results from linear model theory (if n � p, then r # n), it follows that

Xn · pbp· 1 ¼ �Xn · pQ1; p · r��Lr· pbp · 1

�þ �Xn · pQ2; p· ðp2rÞ��Hðp2rÞ · pbp · 1

� ¼ K1a1 þ K2a2;

where Q1 and Q2 are partitions of a p · p matrix of rank-preserving elementary operators (Searle 1966); a1 = Lb is an r · 1vector of likelihood-identified estimable functions and a2 = Hb is a (p 2 r) · 1 vector of pseudoparameters; K1 = XQ1 andK2 = XQ2 are incidence matrices, with K2 = 0 (a2 is a pseudoparameter, because it is effectively wiped out of the model).The genetic signal is given by K1a1 but we include a2 as well, to see what Bayesian inference does for something on whichthe data lack information.

If b is assigned the normal prior N(b|0, Vb),�a1a2

�jVb � N

��00

�;

�LVbL9 LVbH9HVbL9 HVbH9

��: (A.1)

The model is now y = K1a1 + K2a2 + e, and the likelihood under the new parameterization becomes

l�a1;a2jy;s2

e�} exp

0BBBB@2

h �a1 2a

ð0Þ1

�9�a2 2a

ð0Þ2

�9i�K9

1K1 K91K2

K92K1 K9

2K2

�"�a1 2að0Þ1

��a2 2a

ð0Þ2

�#2s2

e

1CCCCA (A.2)

592 D. Gianola

Page 21: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

¼ exp

2

�a1 2a

ð0Þ1

�9K9

1K1

�a1 2a

ð0Þ1

�2s2

e

!: (A.3)

Expression (A.3) indicates that at most, r parameters are likelihood identified, but (A.2) is retained to illustrate what theprior does. It is well known (e.g., Gianola and Fernando 1986; Sorensen and Gianola 2002) that combining (A.1) with (A.2)leads to the posterior distribution

�a1a2

�jy;Vb;s

2e � N

"~a1~a2

#;

�K91K1 þ V11 K9

1K2 þ V12

K92K1 þ V21 K9

2K2 þ V22

�21

s2e

!; (A.4)

where "~a1~a2

#¼�

K91K1 þ s2

eV11 K9

1K2 þ s2eV

12

K92K1 þ s2

eV21 K9

2K2 þ s2eV

22

�21�K91y

K92y

�¼�K91K1 þ s2

eV11 s2

eV12

s2eV

21 s2eV

22

�21�K91y0

�;

since K2 = 0, and where �V11 V12

V21 V22

�¼�LVbL9 LVbH9HVbL9 HVbH9

�21

:

The p-dimensional distribution (A.4) is nonsingular, but it is based on a likelihood that is defined in r dimensions only! Notethat the posterior mean satisfies

�K91K1 þ s2

eV11 s2

eV12

s2eV

21 s2eV

22

�"~a1~a2

#¼�K91y0

�: (A.5)

The coefficient matrix in (A.5) is the counterpart of X9X and it is proportional to the negative matrix of second derivatives) ofthe log-posterior with respect to a1 and a2. This shows that proper Bayesian learning takes place only for a1, as theinformation about a2 and the co-information about a1 and a2 come from the prior only. Further, note the relationship

~a2 ¼ 2�V22�21

V21~a1; (A.6)

indicating that what is learned about a2 is solely a function of what is learned about a1. This is verified by insertingrelationship (A.6) in Equations A.5 above, giving�

K91K1 þ s2

eV11 2s2

eV12�V22�21

V21�~a1 ¼ K9

1y:

Using properties of inverses of partitioned matrices, V2111 ¼ V11 2V12ðV22Þ21V21; so that

~a1 ¼�K91K1 þ s2

eV2111

�21K91y: (A.7)

The preceding confirms that the data inform about a1 but not about a2; what is learned about the latter from phenotypes is doneindirectly, through a1. Such an “indirect” inference parallels the concept of “prediction of breeding values of individuals withoutphenotypes” (Henderson 1977). In the molecular markers setting, n linear combinations of markers are learned from the data,but p2 n remain at the mercy of the prior. In other words, one does not clearly know what marker effects are being learned fromthe data, unless the model is parameterized deliberately. This is clearly shown in the first section of the Appendix.

An example of Proper Bayesian Learning

To illustrate a case of proper Bayesian learning where a connection with genomic BLUP arises, consider inferring the signalg = Xb. This is likelihood identified (estimable) because E(y|Xb) = g, and the likelihood is

Whole-Genome Regressions on Markers 593

Page 22: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

lðgjyÞ} exp�2ðg2 yÞ9ðg2 yÞ

2s2e

�;

with a maximum at g ¼ y. The information matrix is

Eyjg;s2e

�@2

@g@g9ðg2 yÞ9ðg2 yÞ

2s2e

�¼ 1

s2eIn · n;

meaning that, for each individual signal, the information content is proportional to what is conveyed by a sample of size n =1 (if the response variates are means of preprocessed data, the information content will be higher). For the prior N(b|0, Vb),the resulting prior for the signal is g|Vb � N(0, XVbX9) and standard results for Bayesian inference givegjy;Vb;s

2e � Nð~g;VgÞ as posterior distribution, where

~g ¼�1s2eIþ �XVbX9

�21�21

y:

and

Vg ¼�1s2eIþ �XVbX9

�21�21

:

A special case is when Vb ¼ Ips2b; so that for l ¼ s2

e=s2b being the variance ratio, ~g ¼ ½Iþ ðXX9Þ21l�21y; and

Vg ¼ ½Iþ ðXX9Þ21l�21s2e : Using well-established results known from prediction of random variables dating back to Hender-

son (1977) but rediscovered recently (e.g., Janss et al. 2012) one can easily find the posterior distribution of b from that of g,and vice versa . Here, take a1 = Xb so that a1 � Nnð0;XX9s2

bÞ and ~a1 ¼ fXb ¼ ~g; which is known as genomic BLUP. Anymarker effect can be learned indirectly from ~g using standard BLUP theory as ~b ¼ X9ðXX9Þ21~g:

Mode of the Conditional Posterior Distribution in Bayes A

Taking logs of (7) yields

L ¼ loghp�bjS2b; n;s2

e ; y�i

¼ 21

2s2e

Xni¼1

�yi2X9

ib�2

2

�1þ n

2

�Xpi¼1

log

"1þ b2

j

S2bn

#: (A.8)

The gradient vector is

@L@b

¼ 21

2s2e

Xni¼1

@�yi2x9ib

�2@b

2

�1þ n

2

�Xpi¼1

@

@blog

"1þ b2

j

S2bn

#¼ 1

s2e

Xni¼1

xi�yi 2 x9ib

�2

ð1þ nÞS2bn

2666666666666666666666666666664

b1

1þ b21

S2bn

b2

1þ b22

S2bn

:

:

:

bp

1þ b2p

S2bn

3777777777777777777777777777775¼ 1

s2eX9y2

1s2e

�X9XþWb

�b;

(A.9)

where X9y ¼ fPni¼1xiyig; X9X ¼Pn

i¼1xix9i ; and

594 D. Gianola

Page 23: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

Wb ¼ Diag

(s2e

S2b

ð1þ 1=nÞ�1þ b2

j =S2bn�):

Setting to zero, to satisfy the first-order condition, leads to�X9XþWb

�b ¼ X9y:

This system is not explicit in b (because marker effects appear nonlinearly inWb) but a functional iteration can be developedto locate stationary points.

Mode of the Conditional Posterior Distribution in Bayes L

As a side note, consider what happens if it is not ignored that W21b ¼ Diagf1=jbjjg is a random matrix, contrary to what was

done by Tibshirani (1996) in a modal representation of Bayes L. Recalling that jbjj ¼ b2j =jbjj and that djxj=dx ¼ signðxÞ,

@

@bj

��bj�� ¼ @

@bj

b2j��bj��!¼ 2bj��bj

��2 b2j��bj��2 sign�bj

�¼ 2bj��bj

��2 sign�bj

�:

Differentiating (11) with respect to b

@L�bjy; l;s2

e�

@b¼ 2

@

@bðy2XbÞ9ðy2XbÞ þ s2

el@

@b

Xpj¼1

��bj��

2s2e

¼ 21

2s2e

h2 2X9ðy2XbÞ þ 2s2

elW21b b2s2

elsbi;

where sb is a vector containing the signs of the elements of b. Here, the first-order condition would lead to the iteration�X9Xþ s2

elW21b½t�

�b½tþ1� ¼ X9y þ s2

el

2s½t�b :

Approximation of an Integral in Bayes L

The integral in (16) can be approximated using a second-order expansion around l2 ¼ r=d such that (ignoring the subscriptin bj)

exph2�jbj

ffiffiffiffiffil2

p �i� e2jbj

ffiffiffiffiffiffir=d

p "12

12

ffiffiffid

r

rjbj�l2 2

rd

�þ d

8r

jbj2 þ

ffiffiffid

r

rjbj!�

l22rd

�2#:

Use of this in (16) producesR N

0

�l2�rþ1

221exph2�jbj

ffiffiffiffiffil2

pþ dl2

�idl2 ¼

RN

0exph2 jbj

ffiffiffiffiffil2

p i�l2�rþ1

221exp�2 dl2

�dl2

� e2jbjffiffiffiffiffiffir=d

p RN0

�l2�rþ1

221exp�dl2�dl2

2 e2jbjffiffiffiffiffiffir=d

p 12

ffiffiffid

r

rjbjZN0

�l2 2

rd

��l2�rþ1

221exp�2 dl2

�dl2

þ e2jbjffiffiffiffiffiffir=d

p d

8r

jbj2 þ

ffiffiffid

r

rjbj!Z N

0

�l22

rd

�2�l2�rþ1

221exp�2 dl2

�dl2:

Note that ðl2Þrþ1221expð2 dl2Þ is the kernel of a G

�rþ 1

2; d

�distribution, so that

Whole-Genome Regressions on Markers 595

Page 24: Priors in Whole-Genome Regression: The Bayesian Alphabet ... · ing the blatant overparameterization of genomic selection models are reviewed in this section, where it is shown that

R N

0

�l2�rþ1

221exp�2 dl2

�dl2 ¼

264 drþ12

G�rþ 1

2

�37521

;R N

0

�l2 2

rd

��l2�rþ1

221exp�2 dl2

�dl2 ¼

264 drþ12

G�rþ 1

2

�37521

12d

;

and

Z N

0

�l22

rd

�2�l2�rþ1

221exp�2 dl2

�dl2 ¼

264 drþ12

G�rþ 1

2

�37521

4rþ 34d2

:

596 D. Gianola


Recommended