Statistical Methods for Genome-wide Association Studies and Personalized...

Statistical Methods for Genome-wide Association Studies andPersonalized Medicine

by

Jie Liu

A dissertation submitted in partial fulfillmentof the requirements for the degree of

Doctor of Philosophy(Computer Sciences)

at theUNIVERSITY OF WISCONSIN-MADISON

2014

Date of final oral examination: 05/16/14 (9am)Room for final oral examination: CS 4310Committee in charge:

C. David Page Jr., Professor, Biostatistics and Medical InformaticsXiaojin Zhu, Associate Professor, Computer SciencesJude Shavlik, Professor, Computer SciencesElizabeth Burnside, Associate Professor, RadiologyChunming Zhang, Professor, Statistics

i

Abstract

In genome-wide association studies (GWAS), researchers analyze the genetic variation across

the entire human genome, searching for variations that are associated with observable traits or

certain diseases. There are several inference challenges in GWAS, including the huge number

of genetic markers to test, the weak association between truly associated markers and the traits,

and the correlation structure between the genetic markers. This thesis mainly develops statistical

methods that are suitable for genome-wide association studies and their clinical translation for

personalized medicine.

After we introduce more background and related work in Chapters 1 and 2, we further discuss

the problem of high dimensional statistical inference, especially capturing the dependence among

multiple hypotheses, which has been under-utilized in classical multiple testing procedures. Chap-

ter 3 proposes a feature selection approach based on a unique graphical model which can leverage

correlation structure among the markers. This graphical model-based feature selection approach

significantly outperforms the conventional feature selection methods used in GWAS. Chapter 4

reformulates this feature selection approach as a multiple testing procedure that has many elegant

properties, including controlling false discovery rate at a specified level and significantly improv-

ing the power of the tests by leveraging dependence. In order to relax the parametric assumption

within the graphical model, Chapter 5 further proposes a semiparametric graphical model for mul-

tiple testing under dependence, which estimates f1 adaptively. This semiparametric approach is

still effective to capture the dependence among multiple hypotheses, and no longer requires us

to specify the parametric form of f1. It exactly generalizes the local FDR procedure [38] and

ii

connects with the BH procedure [12].

These statistical inference methods are based on graphical models, and their parameter learn-

ing is difficult due to the intractable normalization constant. Capturing the hidden patterns and

heterogeneity within the parameters is even harder. Chapters 6 and 7 discuss the problem of learn-

ing large-scale graphical models, especially dealing with issues of heterogeneous parameters and

latently-grouped parameters. Chapter 6 proposes a nonparametric approach which can adaptively

integrate, during parameter learning, background knowledge about how the different parts of the

graph can vary. For learning latently-grouped parameters in undirected graphical models, Chapter

7 imposes Dirichlet process priors over the parameters and estimates the parameters in a Bayesian

framework. The estimated model generalizes significantly better than standard maximum likeli-

hood estimation.

Chapter 8 explores the potential translation of GWAS discoveries to clinical breast cancer

diagnosis. With support from the Wisconsin Genomics Initiative, we genotyped a breast cancer

cohort at Marshfield Clinic and collected corresponding diagnostic mammograms. We discovered

that, using SNPs known to be associated with breast cancer, we can better stratify patients and

thereby significantly reduce false positives during breast cancer diagnosis, alleviating the risk of

overdiagnosis. This result suggests that when radiologists are making medical decisions from

mammograms (such as suggesting follow-up biopsies), they can consider these risky SNPs for

more accurate decisions if the patients’ genotype data are available.

Contents

Abstract i

1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Related Work 7

2.1 Hypothesis Testing for Case-control Association Studies . . . . . . . . . . . . . 7

2.1.1 Single-marker Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Parametric Multiple-maker Methods . . . . . . . . . . . . . . . . . . . . 15

2.1.3 Nonparametric Multiple-maker Methods . . . . . . . . . . . . . . . . . 16

2.2 Multiple Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.1 Error Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.2 P -value Thresholding Methods . . . . . . . . . . . . . . . . . . . . . . 19

2.2.3 Local False Discovery Rate Methods . . . . . . . . . . . . . . . . . . . 20

2.2.4 Local Significance Index Methods . . . . . . . . . . . . . . . . . . . . . 21

2.3 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3.1 Maximum Likelihood Parameter Learning . . . . . . . . . . . . . . . . . 23

2.3.2 Bayesian Parameter Learning . . . . . . . . . . . . . . . . . . . . . . . 29

iii

iv

2.3.3 Inference Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.4 Feature and Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 High-Dimensional Structured Feature Screening Using Markov Random Fields 35

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2.1 Feature Relevance Network . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2.2 The Construction Step . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2.3 The Inference Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2.4 Related Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3 Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4 Real-world Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4.2 Experiments on CGEMS Data . . . . . . . . . . . . . . . . . . . . . . . 49

3.4.3 Validating Findings on Marshfield Data . . . . . . . . . . . . . . . . . . 51

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 Multiple Testing under Dependence via Parametric Graphical Models 55

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2.1 Terminology and Previous Work . . . . . . . . . . . . . . . . . . . . . . 57

4.2.2 The Multiple Testing Procedure . . . . . . . . . . . . . . . . . . . . . . 58

4.2.3 Posterior Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2.4 Parameters and Parameter Learning . . . . . . . . . . . . . . . . . . . . 60

4.3 Basic Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.4 Simulations on Genetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68


4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

v

5 Multiple Testing under Dependence via Semiparametric Graphical Models 76

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.3.1 Graphical models for Multiple Testing . . . . . . . . . . . . . . . . . . . 80

5.3.2 Nonparametric Estimation of f1 . . . . . . . . . . . . . . . . . . . . . . 81

5.3.3 Parametric Estimation of φ and π . . . . . . . . . . . . . . . . . . . . . 82

5.3.4 Inference of θ and FDR Control . . . . . . . . . . . . . . . . . . . . . . 83

5.4 Connections with Classical Multiple Testing Procedures . . . . . . . . . . . . . 84

5.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.6 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6 Learning Heterogeneous Hidden Markov Random Fields 94

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.2.1 HMRFs And Homogeneity Assumption . . . . . . . . . . . . . . . . . . 96

6.2.2 Heterogeneous HMRFs . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.3 Parameter Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.3.1 Contrastive Divergence for MRFs . . . . . . . . . . . . . . . . . . . . . 98

6.3.2 Expectation-Maximization for Learning Conventional HMRFs . . . . . . 99

6.3.3 Learning Heterogeneous HMRFs . . . . . . . . . . . . . . . . . . . . . 102

6.3.4 Geometric Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105


6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

vi

7 Bayesian Estimation of Latently-grouped Parameters in Graphical Models 115

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.2 Maximum Likelihood Estimation and Bayesian Estimation for MRFs . . . . . . 117

7.3 Bayesian Parameter Estimation for MRFs with Dirichlet Process Prior . . . . . . 118

7.3.1 Metropolis-Hastings (MH) with Auxiliary Variables . . . . . . . . . . . 119

7.3.2 Gibbs Sampling with Stripped Beta Approximation . . . . . . . . . . . . 123

7.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

7.4.1 Simulations on Tree-structure MRFs . . . . . . . . . . . . . . . . . . . . 128

7.4.2 Simulations on Small Grid-MRFs . . . . . . . . . . . . . . . . . . . . . 128

7.4.3 Simulations on Large Grid-MRFs . . . . . . . . . . . . . . . . . . . . . 132


7.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

8 Genetic Variants Improve Personalized Breast Cancer Diagnosis 138

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

8.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

8.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

8.2.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

8.3.1 Performance of Combined Models . . . . . . . . . . . . . . . . . . . . . 145

8.3.2 Performance of Genetic Models . . . . . . . . . . . . . . . . . . . . . . 147

8.3.3 Comparing Breast Imaging Model and Genetic Model . . . . . . . . . . 147

8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

9 Future Work 151

Chapter 1

Introduction

1.1 Background

The human genome project, which was completed in 2003, made it possible for us, for the first

time, to read the complete genetic blueprint of human beings. Since then, researchers started

looking into the germline genetics variants which are associated with the heritable diseases and

traits among humans, known as genome-wide association studies (GWAS). GWAS analyze the

genetic variation across the entire human genome, searching for variations that are associated

with observable traits or certain diseases. In machine learning terminology, typically an example

in GWAS is a human, the response variable is a disease such as breast cancer, and the features

(or variables) are the single positions in the entire genome where individuals can vary, known as

single-nucleotide polymorphisms (SNPs). The primary goal in GWAS is to identify all the SNPs

that are relevant to the diseases or the observable traits.

GWAS are characterized by high-dimension. The human genome has roughly 3 billion po-

sitions, roughly 3 million of which are SNPs. State-of-the-art technology enables measurement

of a million SNPs in one experiment for a cost of hundreds of US dollars. Although this means

the full set of known SNPs cannot be measured in one experiment at present, SNPs that are close

together on the genome are often highly correlated. Hence the omission of some SNPs is not as

1

2

much of a problem as one might first think. Instead, we have the problem of strong-correlation

among our features: most SNPs are very highly correlated with one or more nearby SNPs, with

squared Pearson correlation coefficients well above 0.8.

Another problem making GWAS especially challenging is weak-association, namely the truly

relevant markers are very rare and only weakly associated to the response variable. The first

reason is that most diseases have both a genetic and environmental component. Because of the

environmental component, we cannot expect to achieve anywhere near 100% accuracy in GWAS.

For example, it is estimated that genetics accounts for only about 27% of breast cancer risk [102].

Therefore, given equal numbers of breast cancer patients and controls without breast cancer, the

highest predictive accuracy we can reasonably expect from genetic features alone is about 63.5%,

obtainable by correctly predicting the controls and correctly recognizing 27% of the cancer cases

based on genetics. Furthermore, breast cancer and many other diseases are polygenic, and there-

fore the genetic component is spread over multiple genes. Based on these two observations, we

expect the contribution from any one feature (SNP) toward predicting disease to be quite small.1 Indeed, one published study [82] identified only 4 SNPs associated with breast cancer. When

the most strongly associated SNP (rs1219648) is tested for its predictive accuracy on this same

training set from which it was identified (almost certainly yielding an overly-optimistic accuracy

estimate), the model based on this SNP is only 53% accurate, where majority-class or uniform

random guessing is 50% accurate. Adding credibility is another published study [33] on breast

cancer which identified 11 SNPs from a different dataset. They report the individual odds ratios

for the 11 SNPs are estimated to be around 0.95 - 1.26, and most of them are not identified to be

significant in the former study [82]. Therefore, for breast cancer and other diseases, we expect the

signal from each relevant feature to be very weak.

The combination of high-dimension and weak-association makes it extremely difficult to de-

tect the truly associated genetic markers. Suppose a truly relevant genetic marker is weakly asso-

1Rare alleles for a few SNPs, such as those in BRCA1 and BRCA2 genes, have large effect but are very rare. Othersthat are common have only a weak effect.

3

ciated with the class variable. If its odds ratio is around 1.2, given one thousand cancer cases and

one thousand controls, this marker will not look significantly different between cases and controls,

that is, among examples of different classes. At the same time, if we have an extremely large num-

ber of features, and relatively little data, many irrelevant markers may look better than this relevant

marker by chance alone, especially given even a modest level of noise as occurs in GWAS. Related

work [187] provides a formula to assess the false positive report probability (FPRP), the proba-

bility of no true association between a genetic variant and disease given a statistically significant

finding. If we assume there are around 1000 truly associated SNPs out of the total 500, 000 and

keep the significance level to be 0.05, the FPRP will be around 99%. This means almost all the

selected features in this case are false positives.

Hypothesis testing is one important statistical inference method for genetic association analy-

sis, since one can simply test the significance of association between one genetic marker and the

response variable. However in GWAS, there are usually hundreds of thousands of genetic markers

to test at the same time. Suppose that we have genotyped a total number of m SNPs, and we

have performed m tests simultaneously with each test applying to one genetic marker. In such

a multiple testing situation, we can categorize the results from the m tests as in Table 1.1. One

important criterion, false discovery rate (FDR), defined as E(N10/R|R > 0)P (R > 0), depicts

the expected proportion of incorrectly rejected null hypotheses (or type I errors) . Another crite-

rion, false non-discovery rate (FNR), defined as E(N01/S|S > 0)P (S > 0), depicts the expected

proportion of incorrectly non-accepted non-null hypotheses (or type II errors).

H0 not rejected H0 rejected Total

H0 true N00 N10 m0

H0 false N01 N11 m1

Total S R m

Table 1.1: The classification of tested hypothesis

A multiple testing procedure is termed valid if it controls FDR at the prespecified level α,

4

and is termed optimal if it also has the smallest FNR among all the valid procedures at level α.

Most FDR controlling procedures focus on the validity issue and assume the tests are indepen-

dent. However in GWAS, the tests for highly correlated SNPs are dependent due to this linkage

disequilibrium between them.

On the clinical translation side, high hopes for using genetic profiling for personalized medicine

have been, in part, driven by the rapid progress of genome-wide association studies, which con-

tinue identifying more common genetic variants associated with diseases with high population

prevalence. At the same time, large multi-relational databases containing variables that are infor-

mative of disease risk are increasingly available, providing the opportunity for informatics tools

to better stratify individuals for appropriate healthcare decisions and explore disease mechanism

and behavior. Coincident to this, policy-makers have recommended that interventions, like breast

cancer screening with mammography, be increasingly based on individualized risk and shared

decision-making [132, 158]. The opportunity to use this data to interpret genetic/phenotype asso-

ciation, explain family aggregation of heritable diseases, and shed light on disease mechanism or

natural history is just becoming possible.

1.2 Contributions

The first contribution of this thesis is in the area of high dimensional statistical inference, es-

pecially dealing with the dependence among multiple hypotheses, which has been ignored or

under-utilized in classical multiple testing procedures. This line of work is motivated by a real-

world genome-wide association study (GWAS) on breast cancer. With NCI’s CGEMS dataset

[82], which contains 528,173 genetic markers (single-nucleotide polymorphisms, or SNPs) for

1,145 patients and 1,142 controls, the goal is to identify the genetic markers that are associated

with breast cancer. We propose a feature selection approach based on a unique graphical model

which can leverage correlation structure among the markers. This graphical model-based feature

selection approach significantly outperforms the conventional feature selection methods used in

GWAS. The method can be further reformulated as a multiple testing procedure that has many ele-

5

gant properties, including controlling false discovery rate at a specified level, significantly improv-

ing the power of the tests by leveraging dependence, and generalizing classical multiple testing

procedures such as the Benjamini-Hochberg procedure [12] and the local FDR procedure [38].

The second contribution of this thesis is in the area of learning large-scale graphical models,

especially dealing with issues of latently-grouped parameters and heterogeneous parameters. This

contribution is motivated by the need for efficient, effective parameter learning in our aforemen-

tioned graphical model-based inference approaches. Parameter learning of undirected graphical

models is difficult due to the intractable normalization constant, and capturing the hidden patterns

and heterogeneity within the parameters is even harder. For learning latently-grouped parameters

in undirected graphical models, we impose Dirichlet process priors over the parameters and es-

timate the parameters in a Bayesian framework. The estimated model generalizes significantly

better than standard maximum likelihood estimation. We also propose a nonparametric approach

which can adaptively integrate, during parameter learning, background knowledge about how the

different parts of the graph can vary.

Last but not least, the thesis also explores the potential translation of GWAS discoveries to clin-

ical breast cancer diagnosis. With support from the Wisconsin Genomics Initiative, we genotype

a breast cancer cohort at Marshfield Clinic and collected corresponding diagnostic mammograms.

We discover that, using SNPs known to be associated with breast cancer, we can better stratify

patients and thereby significantly reduce false positives during breast cancer diagnosis, alleviating

the risk of overdiagnosis. This result suggests that when radiologists are making medical decisions

from mammograms (such as suggesting follow-up biopsies), they can consider these risky SNPs

for more accurate decisions if the patients’ genotype data are available.

1.3 Thesis Statement

The dependence in multiple testing can be effectively captured by a Markov-random-field-coupled

mixture model (a.k.a. hidden Markov random field), with FDR controlled at a nominal level and

FNR reduced significantly. The hidden pattern among the Markov random fields can be recovered

6

during parameter learning with a Bayesian estimation approach. The heterogeneity in the hidden

Markov random fields can also be captured by a nonparametric method. Using SNPs known to be

associated with breast cancer, we can stratify breast cancer patients at the time of mammograms,

and thereby significantly reduce false positives during breast cancer diagnosis, alleviating the risk

of overdiagnosis.

Chapter 2

Related Work

This thesis covers many topics that are related to multiple hypothesis testing, graphical models and

variable selection. This chapter summarizes the related work as follows. Section 2.1 reviews a va-

riety of hypothesis testing procedures used in the GWAS community, including the single-marker

methods in Subsection 2.1.1, the parametric multiple-marker methods in Subsection 2.1.2, and

the nonparametric multiple-marker methods in Subsection 2.1.3. Section 2.2 further summarizes

many aspects of multiple testing procedures, including the evaluation criteria in Subsection 2.2.1

and different types of procedures in Subsections 2.2.2, 2.2.3, and 2.2.4. Section 2.3 summarizes

related work for graphical models, including maximum likelihood estimation in Subsection 2.3.1,

Bayesian estimation in Subsection 2.3.2 and inference algorithms in Subsection 2.3.3. Since the

proposed method is also related to variable selection, relevant approaches are also summarized in

Section 2.4.

2.1 Hypothesis Testing for Case-control Association Studies

2.1.1 Single-marker Methods

In a case-control genetic association study, single-marker analysis which tests the association

between the response variable and an individual SNP is often used. In such a hypothesis test,

7

8

the null hypothesis is that the SNP is not associated to the response variable. The alternative

hypothesis is that the SNP is associated. Assume that there are r cases and s controls in a case-

control genetic association study, and that there are two alleles,G and g, at a given SNP locus with

three possible genotypes, namely gg,Gg andGG. Further assume that there are no missing values

and we can observe the genotype counts as in Table 2.1. Approaches that perform hypothesis

testing on genotype counts are called genotype-based methods. The following subsections discuss

several typical genotype-based methods and the connections between them. Those genotype-based

methods include

• Genotype-based Pearson’s χ2 test

• Cochran-Armitage’s trend test

• Likelihood-ratio test, Wald test, and score test with logistic regression

Genotypes gg Gg GG Total

Case r0 r1 r2 rControl s0 s1 s2 s

Total n0 n1 n2 n

Table 2.1: Genotype counts at a given SNP in a case-control genetic association study.

From Table 2.1, we can easily get the counts for the two alleles at the given SNP locus, as

shown in Table 2.2. Therefore, we can carry out the hypothesis test on the allele level. Hypothesis

test methods on the allele level are referred as allele-based methods. The next subsections dis-

cuss several typical allele-based methods and the connections between them. Those allele-based

methods include

• Two-proportion z-test

• Allele-based Pearson’s χ2 test

9

Alleles g G Total

Case u0(= 2r0 + r1) u1(= 2r2 + r1) u(= 2r)Control v0(= 2s0 + s1) v1(= 2s2 + s1) v(= 2s)

Total m0(= 2n0 + n1) m1(= 2n2 + r1) m(= 2n)

Table 2.2: Allele counts at a given SNP in a case-control genetic association study.

Two-proportion z-test

We assume that the alleles are Bernoulli distributed. Further we assume that we have r i.i.d. cases

and s i.i.d. controls with counts of alleles in Table 2.1. F+A denotes the random variable of the

alleles in the positive samples. F−A denotes the random variable of the alleles in the negative

samples.

F+A ∼ Bernoulli(p

+A), F−A ∼ Bernoulli(p

−A). (2.1)

p+A and p−A are the population probability that FA is 1 (corresponding to the alleleG) in the positive

and negative population, respectively. pA is the population probability that FA is 1 in the whole

population. Accordingly, p+A, p−A and pA are sample-based version of p+

A, p−A and pA. We can

calculate p+A, p−A and pA from Table 2.2 as

p+A =

u1

u, p+A =

v1

v, pA =

m1

m. (2.2)

In [40], if we approximate

p+A(1− p+

A)/u+ p−A(1− p−A)/v = pA(1− pA)(u+ v)/uv, (2.3)

then the test statistic for FA is

10

SA =p+A − p

−A√

u+vuv

√pA(1− pA)

. (2.4)

SA is approximately normally distributed with variance 1 and mean λA√

2uvu+v , where

λA =p+A − p

−A√

2√pA(1− pA)

. (2.5)

λA

√2uvu+v is termed the non-centrality parameter. Under H0, SA is approximately standard

normally distributed. Under H1, SA is approximately normally distributed with variance 1 and

mean λA√

2uvu+v . The power of the test, the probability of identifying associated feature FA, at

some significance level α is,

P (α, λA

√2uv

u+ v) = 1− 1√

2π

∫ Φ−1(1−α/2)+λA

√2uvu+v

Φ−1(α/2)+λA

√2uvu+v

e−x2

2 dx, (2.6)

where Φ−1 is the quantile function of the standard normal distribution [40]. For any given signif-

icance level α, the power of the test is entirely determined by the non-centrality parameter. For a

given sample set, the larger λA we have, the larger the power of the test is.

Allele-based Pearson’s χ2 test

Pearson’s χ2 test can be used to test whether or not an observed frequency distribution differs

from a theoretical distribution. In the context of allele-based association analysis, it tests whether

or not the observed frequency distribution of the minor allele in cases differs from that in controls.

Based on the counts from the Table 2.2, the test statistic is

SABP =1∑i=0

[(ui − umi/m)2

umi/m+

(vi − vmi/m)2

vmi/m]. (2.7)

Under the null hypothesis, SABP has an asymptotic χ2 distribution with 1 degree of free-

11

dom. Under the alternative hypothesis, SABP has an asymptotic non-central χ2 distribution with

1 degree of freedom, and the non-centrality parameter δABP is

δABP = uv(p+A − p

−A)2[

1

up+A + vp−A

+1

u(1− p+A) + v(1− p−A)

]. (2.8)

The power of the test is only determined by the non-centrality parameter δABP . In fact, δABP

is the square of the non-centrality parameter λA√

2uvu+v in two-proportion z-test under the approx-

imation (2.3) in the two-proportion z-test.

Genotype-based Pearson’s χ2 test

Pearson’s χ2 test can also be used in the context of genotype-based association analysis. It tests

whether or not the observed frequency distribution of the three genotypes in cases differs from

that in controls. Based on the counts from Table 2.1, the test statistic is

SGBP =2∑i=0

[(ri − rni/n)2

rni/n+

(si − sni/n)2

sni/n]. (2.9)

Under the null hypothesis, SGBP has an asymptotic χ2 distribution with 2 degrees of free-

dom. Under the alternative hypothesis, SGBP has an asymptotic non-central χ2 distribution with

2 degrees of freedom, and the non-centrality parameter δGBP [121],

δGBP = rs[(p+gg − p−gg)2

rp+gg + sp−gg

+(p+gG − p

−gG)2

rp+gG + sp−gG

+(p+GG − p

−GG)2

rp+GG + sp−GG

], (2.10)

where the distribution of the genotypes is multinomial with the parameter vector (r; p+gg, p

+gG, p

+GG)

for cases and with the parameter vector (s; p−gg, p−gG, p

−GG) for controls. The power of the test is

only determined by the non-centrality parameter δGBP .

12

Cochran-Armitage’s trend test

The Cochran-Armitage trend test [31, 3] is usually used in categorical data analysis to test the

presence of an association between a binary variable and a variable with k categories where k is

usually greater than 2. It modifies the Pearson’s χ2 test to incorporate a suspected trend in the

effects of the k levels of the second variable. Therefore in genotype-based association analysis,

we need to associate a vector of scores (x0, x1, x2) to each genotype to specify the trend we

want to test. The scores (x0, x1, x2) are equivalent to scores (0, x, 1) by a linear transformation

(x = (x1 − x0)/(x2 − x0)).

For the scores, researchers usually use the penetrance of the genotypes or equivalent scores via

a linear transformation. Denote the penetrance of gg, gG and GG by f0, f1 and f2, respectively.

Then the relative risk are defined to be γi = fi/f0 for i = 0, 1, 2. Similarly, we define δi =

(1 − fi)/(1 − f0) which can be regarded as relative resistance to the disease. Further denote the

population genotype probabilities as g0 = Pr(gg), g1 = Pr(gG) and g2 = Pr(GG). Then by the

Bayes rule, we can express the genotype probabilities in cases as pi = γigi∑i γigi

and the genotype

probabilities in controls as qi = δigi∑i δigi

. The null hypothesis is pi = qi for i = 0, 1, 2 which is

equivalent to γ1 = γ2 = 1. The alternative hypothesis can be either γ2 > γ1 ≥ 1 or γ2 ≥ γ1 > 1.

When the scores are (0, 1/2, 1), the trend we test is from an additive model with γ1 = (1 +

γ2)/2. When the scores are (0, 0, 1), the trend we test is from an recessive model with γ1 = 1.

When the scores are (0, 1, 1), the trend we test is from an dominant model with γ1 = γ2. The

multiplicative model (γ2 = γ21 ) and the additive model are asymptotically equivalent as (γ1, γ2)

approaches the null value (1, 1).

With the scores (x0, x1, x2), the Cochran-Armitage test statistic [163, 52] is

ZCATT =U√

V ar(U), (2.11)

where

13

U =1

n

2∑i=0

xi(s× ri − r × si). (2.12)

Under the null hypothesis, the expectation of U is 0, and the variance of U is

V arH0(U) = nσ20 =

rs

n[

2∑i=0

x2i qi − (

2∑i=0

xiqi)2]. (2.13)

Therefore, ZCATT is asymptotic normally distributed and Z2CATT has an asymptotic distri-

bution of χ21. In applications, one may use qi = ni/n to estimate σ2

0 when qi is unknown. This

gives

ˆV arH0(U) = nσ20 =

rs

n3[n

2∑i=0

x2ini − (

2∑i=0

xini)2]. (2.14)

Under the alternative hypothesis, the expectation of U is EH1(U) = nµ1, and the variance is

V arH1(U) = nσ21 , where

µ1 =rs

n2

2∑i=0

xi(pi − qi), (2.15)

and

σ21 =

rs2

n3[

2∑i=0

x2i pi − (

2∑i=0

xipi)2] +

r2s

n3[

2∑i=0

x2i qi − (

2∑i=0

xiqi)2]. (2.16)

Therefore under the alternative hypothesis, ZCATT is asymptotic normally distributed with

variance 1 and mean λCATT

λCATT =nµa√nσ2

1

=rs∑2

i=0 xi(pi − qi)√rs2[

∑2i=0 x

2i pi − (

∑2i=0 xipi)

2] + r2s[∑2

i=0 x2i qi − (

∑2i=0 xiqi)

2].

(2.17)

In other words, under the alternative hypothesis, Z2CATT is asymptotic 1-df χ2 distributed with

14

the non-centrality parameter δCATT

δCATT =rs[∑2

i=0 xi(pi − qi)]2

s[∑2

i=0 x2i pi − (

∑2i=0 xipi)

2] + r[∑2

i=0 x2i qi − (

∑2i=0 xiqi)

2]. (2.18)

The trend test has higher power than the Pearson’s χ2 test when the suspected trend is correct,

but the ability to detect unsuspected trends is sacrificed.

Tests with logistic regression

Many GWAS applications such as [82] employ logistic regression followed by a hypothesis test

to identify associated SNPs. A first step builds a logistic regression model in formula (2.19) to

predict disease from each SNP individually; in such a model the SNP is coded by two indicator

variables, one for heterozygous carrier of the minor allele (X1) and one for homozygous carrier

of the minor allele (X2). In other words, we convert AA into “X1=0, X2=0”, AB into “X1=1,

X2=0”, and BB into “X1=0, X2=1” where A stands for the common allele at this locus and B

stands for the minor allele. The dichotomous response variable Y is coded as 1 for cases and 0 for

controls.

logP (Y = 1|X1, X2)

1− P (Y = 1|X1, X2)= β0 + β1X1 + β2X2. (2.19)

In the second step, a hypothesis test is performed to test the fit of each logistic model and to

return a P-value for each SNP. In the test, the null hypothesisH0 is that the SNP is not associated,

namely β1 and β2 are zeros. The alternative hypothesisH1 is that the feature is associated, namely

either β1 or β2 are nonzero. The likelihood ratio test is the most commonly used method, and the

test statistic is

S = 2(logL1 − logL0), (2.20)

15

where logL1 and logL0 are the log-likelihood under H1 and H0 respectively. Under H0, the test

statistic has an asymptotic χ2 distribution with 2 degrees of freedom. Under H1, the test statistic

has an asymptotic non-central χ2 distribution with 2 degrees of freedom. The score test and Wald

test are similar test procedures that are sometimes used.

2.1.2 Parametric Multiple-maker Methods

Analyzing the genetic association of disease with one individual marker at a time (such as the

single-marker methods in Section 2.1.1) can have limited power due to the relatively small genetic

effects and the ignorance of the interaction between SNPs. Therefore, it is of certain interest to

test multiple SNPs (e.g., all the SNPs in a gene or a pathway) at a time, namely to test whether

any of the SNPs in the set are associated to the disease. The first class of multiple-marker meth-

ods is based on individual-marker methods. In particular, a universal procedure is to apply one

individual-marker method first to test the individual significance of each SNP, and then to cor-

rect the multiple testing via Bonferroni correction, or via Monte Carlo [103], or via estimating

the effective number of tests [30, 134, 124]. For instance, one possible test for joint association

of multiple SNPs in a gene is the maximum of the single SNP χ2 statistic, which is known as

“max-single” [157]. The max-single test is likely to be powerful if there is only a single marker

strongly associated with disease [157]. Still, this class of multiple-marker methods relies heav-

ily on single-marker methods, and cannot accommodate complex genetic effects and interaction

effects, resulting in limited power in certain circumstances.

Another class of multiple-marker methods is based on multivariate regression which allow for

simultaneous analysis of multiple markers. One well-known procedure is multivariate Hotelling’s

T 2 test [45]. However, those methods often offer little benefit over multiple-marker methods based

on individual-marker methods [27, 149] because of the large number of degrees of freedom.

16

2.1.3 Nonparametric Multiple-maker Methods

To improve power over the above standard (parametric) multiple-marker methods (either based on

individual-marker methods or based on multivariate methods), a class of nonparametric or semi-

parametric methods have been proposed. Those methods include the Zglobal test [157], pseudo F

test [199], kernel-based association test (KBAT) [125] and kernel-machine test [201, 202].

The Zglobal test [157] is based on the motivation that the genetic similarity measured on asso-

ciated SNPs should yield higher similarity scores for cases than for controls, whereas the genetic

similarity measured on non-associated SNPs should yield comparable similarity scores for cases

and controls. Therefore, the Zglobal test essentially measures the average genetic score for all pairs

of cases and compares this to the average genetic score for all pairs of controls. This approach

uses the U -statistic to measure genetic similarity within a group. The key steps of deriving the test

statistic are as follows. First, we need to calculate the contrast vector

δ = Ud − Uc, (2.21)

where Ud and Uc are the similarity vectors for the diseased cases and controls respectively. The

length of the δ, Ud and Uc are the same as the number of markers considered in the test. Finally

the test statistic is

Zglobal =w′δ√w′Vow

, (2.22)

where w′ is the optimal weight vector and Vo is the covariance matrix of δ under the null hypoth-

esis. Zglobal is asymptotically standard normally distributed under the null hypothesis.

Wessel and Schork [199] summarize seven different measures for evaluating the genetic simi-

larity (or distance) between any pair of people based on a prespecified number of genetic markers.

Suppose we have complete data for L genetic markers and M phenotype variables (such as dis-

ease status, age, blood pressure) for a group of N people. We use an N ×N matrix D to denote

distance matrix for the group if any of the measures is used. Let H = X(X ′X)−1X ′ be the

17

projection matrix (essentially a similarity of phenotypes) where X is the N ×M covariate matrix

for the phenotypes. Compute the matrix A = (aij) = (−d2ij/2) and its centered matrix G. The

test statistic of the pseudo F test is

F =tr(HGH)

tr[(I −H)G(I −H)]. (2.23)

Generally, permutation tests can be used to determine the significance of the test.

Wu and his colleagues [201] propose the kernel-machine test to test the relevance of a SNP set

under the semiparametric logistic kernel-machine regression models. The test statistic Q, which

involve the similarity matrix K, is scaled χ2 distributed under the null hypothesis with a scale

parameter κ and degrees of freedom ν. Procedures for estimating κ and ν are also provided in the

paper [201].

2.2 Multiple Testing

2.2.1 Error Criteria

When performing multiple tests in many applications, researchers tend to focus on the most sig-

nificant results and use them to support their conclusions. Such unguarded multiplicity selection

always results in increase of the false rejections of the null hypotheses. Many classical multi-

ple testing (correction) procedures (a.k.a. multiple-comparison procedure, or MCPs), such as the

well-known Bonferroni correction, are proposed to control the probability of any Type I error

existing in the multiple tests, also known as familywise error rate (FWER). Suppose we carry

out m tests whose results can be categorized as Table 5.1. The familywise error rate is defined

as P (N10 ≥ 1). By contrast, the uncorrected error rate, namely the per comparison error rate

(PCER), is E(N10/m) [12].

In large-scale multiple testing problems (when m is large), the existence of false rejections

is quite common and the FWER is no longer very informative. Under such circumstances, we

may fail to reject many false null hypotheses if we still want to control the FWER at a certain

18

H0 rejected H0 not rejected Total

H0 true N00 N10 m0

H0 false N01 N11 m1

Total S R m

Table 2.3: The classification of tested hypothesis

level. Furthermore, we need not only consider whether any error is made, but also the number

of incorrect rejections [12]. Accordingly, one important criterion is false discovery rate (FDR),

which depicts the expected proportion of incorrectly rejected null hypotheses (or type I errors)

[12]. In terms of random variables, FDR is defined as E(N10/R|R > 0)P (R > 0) which is

the expected value of false discovery proportion (FDP). False discovery proportion is defined as

N10/R.

Another criterion, false non-discovery rate (FNR), defined as E(N01/S|S > 0)P (S > 0),

depicts the expected proportion of incorrectly non-accepted non-null hypotheses (or type II errors)

[62]. The marginal versions of FDR and FNR were also proposed [62]. The marginal false

discovery rate (mFDR), defined as E(N10)/E(R), is asymptotically equivalent to the FDR [62],

namely

mFDR = FDR+O(m−1/2). (2.24)

Similarly, marginal false non-discovery rate (mFNR) is defined to be E(N01)/E(S). Note

that FNR and mFNR are related to the efficiency of multiple testing procedures whereas other

criteria aforementioned are related to the validity of the procedures.

In different settings, there exist other versions of FDR such as the weighted FDR in weighted

multiple testing [13] and FDR for clusters (FDRcluster) which can be used when the m tests can

be further partitioned into homogeneous clusters [11].

19

2.2.2 P -value Thresholding Methods

A common class of multiple testing procedures is based on p-value thresholding. The Benjamini

and Hochberg’s procedure (usually referred as BH procedure) [12] rejects individual null hypothe-

ses by thresholding the P -value with the objective of maximizing the number of true positives

while controlling the proportion of false positives in all the rejections. The BH procedure is a

distribution-free and finite-sample procedure. Let P(1) < ... < P(m) be the ordered P -values

from the m tests and P(0) = 0. The BH procedure rejects any null hypothesis whose P -value

satisfies P ≤ T with

T = max{P(i)|P(i) ≤αi

m}, (2.25)

which controls the false discovery rate at the level αm0m . The threshold T for the BH procedure

can be also written as

T = sup{t| t

Gm(t)≤ α}, (2.26)

where Gm(t) is the empirical cumulative distribution function of Pi [12, 166, 62].

Note that the FDR is controlled by the threshold T at the level αm0m , which is more stringent

than the nominal level α. Therefore, the efficiency of the BH procedure can be further improved

by a little bit by setting the threshold at

T ′ = sup{t| m0t

mGm(t)≤ α}, (2.27)

at the cost of accurately estimating the number of true null hypotheses m0 [14]. Other P -value

thresholding methods were also proposed, including the adaptive procedures [166, 15], the plug-in

procedure [63] and the augmentation procedure [183].

20

2.2.3 Local False Discovery Rate Methods

Local false discovery rate methods were first introduced in [38]. Suppose that we perform m

tests at the same time with the null hypotheses H1, ..., Hm, and the corresponding test statistics

s1, ..., sm. Consider the Bayesian two-class model, namely the m hypotheses are divided into

two classes (null or nonnull) occurring with probabilities p0 = P (null) and p1 = P (nonnull).

Further assume that the density functions of the test statistics are f0 if the corresponding hypothesis

is null and f1 if nonnull. Therefore, the Bayes posterior probability that a hypothesis is null given

its test statistic s is defined to be the local false discovery rate, namely

fdr(s) = P (null|s) =p0f0(s)

p0f0(s) + p1f1(s). (2.28)

If we use tail areas (such as P -values) to derive the Bayes posterior probability, we end up

with the Benjamini-Hochberg false discovery rate. Let us use F0 and F1 to denote the cumulative

distribution functions (cdf) corresponding to f0 and f1. The posterior probability of a hypothesis

being null given that its test statistic S is less than some value s is

FDR(s) = P (null|S ≤ s) =p0F0(s)

p0F0(s) + p1F1(s). (2.29)

[166] and [37] give the connection of the frequentist FDR control [12] and the Bayesian FDR

in formula (2.29). The key difference between false discovery rate and local false discovery rate

is that FDR is based on the tail distributions whereas local false discovery rate is based on the

densities. For example, suppose we use any P-value thresholding method to reject the 10 null

hypotheses with the most extreme test statistics s(1), ...,s(10). FDR(s(10)) depicts the probability

of false rejection among the 10 rejected hypotheses, whereas fdr(s(i)) (for i = 1, ..., 10) tells

us the probability of false rejection for each of the 10 rejected hypotheses. FDR is a conditional

expectation of fdr; namely FDR(s) is the average of fdr(S) for all S ≤ s [37]. Therefore, we

can regard FDR as one error criterion for P -value thresholding methods, and local false discovery

rate works more like P -values, with which we can make inference decisions. Generally, the local

21

false discovery rate methods are expected to work better than the P -value thresholding methods.

The reason is that when determining the level of significance of a single hypothesis, the P -value

thresholding methods only consider the hypotheses separately whereas the local false discovery

rate methods can consider the m hypotheses simultaneously and incorporate the distributional

information of the m test statistics.

2.2.4 Local Significance Index Methods

Although local false discovery rate methods can incorporate the distributional information of the

m test statistics, they still use the individual test statistic si to determine the significance level of

the null hypothesisHi. Local significance index methods [172] generalize the local false discovery

rate methods by considering all the m test statistics (especially informative ones) when determin-

ing the significance level of single hypotheses. This makes them extremely useful if there exists

a certain level dependency structure among the m hypothesis, such as when null hypotheses or

nonnull hypotheses exist in clumps, chains, graphs and hierarchies. Formally, the local index of

significance (LIS) for hypothesis i is defined as

LISi = Pϑ(Hi is null|all the observations at m hypotheses), (2.30)

where ϑ are the parameters for specifying the dependency structure of them hypotheses. In [172],

Sun and Cai studied the situation in which the dependency structure comes from a chain structure,

and used the hidden Markov models to parameterize the conditional independence. Then Sun

and Cai used the forward-backward procedure, which is an inference algorithm special for hidden

Markov models, to calculate the local significance indices for all the m hypotheses. Finally, their

procedure employed a decision rule of the form δ = [I(LISi < λ) : i = 1, ...,m] as the final out-

put, where λ is the cut-off point. In their paper, they also gave an adaptive procedure to determine

λ for the a given FDR level. Let Rλ =∑m

i=1 I(LISi < λ), Vλ =∑m

i=1 I(LISi < λ,Hi is null),

and Q(λ) = E(Vλ)/E(Rλ) be the number of rejections, the number of false rejections and the

marginal false discovery rate yielded by the decision rule δ = [I(LISi < λ) : i = 1, ...,m]. It can

22

be shown that, if k hypotheses are rejected, the marginal false discovery rate can be approximated

by

Q(k) =1

k

k∑i=1

I(LIS(i)). (2.31)

Note that the approximated mFDR is the average of the LIS for rejected hypotheses, which is

similar to the relation between FDR and local false discovery rate.

However, Sun and Cai only study the situation when the form of dependency is an HMM. We

continue with their ”local-index-of-significance” framework, but generalize the dependency form

to some general form such as pairwise Markov random fields.

2.3 Graphical Models

Graphical models [192] are probabilistic models representing the conditional independence be-

tween random variables via a variety of graphs. General graphical models consist of Bayesian

networks, which are directed [138, 88], Markov random fields (a.k.a. Markov networks) [91],

which are undirected, and factor graphs [109], which emphasizes the factorization of the distri-

bution they depict. Essentially, graphical models represent a joint probability of all the variables

compactly, with the conditional independence expressed by graphs. For a Bayesian network on a

set of d variables x = (x1, ..., xd) with conditional independence specified by a directed acyclic

graph, the joint probability is

P (x) =∏i=1

P (xi|xπi), (2.32)

where P (xi|xπi) is the local conditional probability for xi given its parents, and πi is the set of

indices for the parent nodes of xi. For a Markov random field on a set of variables xi’s with whose

conditional independences specified by a undirected graph, the joint probability is

23

P (x) =1

Z

∏φC(xC), (2.33)

where φC(xC) is a potential function for all variables in a clique xC and 1/Z is a normalization

constant. A Markov random field is said to be pairwise if the potentials between variables are

only for pairs of variables. Pairwise Markov random fields are related to Potts models [200].

In addition, if every variable in a pairwise Markov random field only has two states (possible

values), it is also an Ising model [91]. A part of work on graphical models is learning, such

as parameter learning and structure learning. Another part is inference, such as calculating the

marginal probabilities of variables and finding the most probably states of variables.

2.3.1 Maximum Likelihood Parameter Learning

Undirected graphical models (a.k.a. Markov random fields or Markov networks) are useful models

in many applications, but parameter learning of undirected graphical models is difficult due to the

global normalizing constant (partition function). Suppose for simplicity that we have a pairwise

Markov random field on a random vector X ∈ X d described by an undirected graphG(V,E) with

the node set V and the edge set E. X = {0, 1, ...,m− 1} is a discrete space. The probability of a

sample x given a known parameter vector θ = {θα|α ∈ I} (I is some index set) is

P (x;θ) = exp{θTφ(x)−A(θ)

}, (2.34)

where φ = {φα|α ∈ I} is a vector of sufficient statistics, and A(θ) is the log partition function as

follows,

A(θ) = log∑x∈X d

exp{θTφ(x)

}. (2.35)

There are nice properties of the log partition function A(θ) as follows. First, for any index

α ∈ I,

24

∂A(θ)

∂θα= Eθφα =

∑x∈X d

P (x;θ)φα(x). (2.36)

Second, for any indices α, β ∈ I,

∂2A(θ)

∂θα∂θβ= Eθφαφβ − EθφαEθφβ. (2.37)

Assume that we have n independent samples X = {x1,x2, ...,xn} generated from (2.34), and

we want to estimate the parameters θ. The maximum likelihood estimate (MLE) is the common

method which maximizes the log-likelihood function given as

L(θ|X) ∝ 1

n

n∑j=1

θTφ(xj)−A(θ). (2.38)

The partial derivative of L(θ|X) with respect to θα is

∂L(θ|X)

∂θα=

1

n

n∑j=1

φα(xj)− Eθφα. (2.39)

From (2.37), the Hessian matrix is positive semidefinite because it is essentially the covariance

matrix of φ. Therefore, A(θ) is convex and L(θ|X) is concave. Therefore, we can use gradient

ascent to find the global maximum of the likelihood function and find the MLE of θ. However,

the problem is that A(θ) is usually intractable according to (2.35).

The partial derivative of L(θ|X) with respect to θα is

∂L(θ|X)

∂θα=

1

n

n∑j=1

φα(xj)− Eθφα

= EXφα − Eθφα.

(2.40)

When the partial derivatives arrive at 0, we arrive at the global minimizer of L(θ|X). (There

might be more than one global minimizers if (2.34) is over-complete representation.) From (2.40),

25

we are looking for the estimate of parameters θ which can match the empirical moments from

observed samples, and the method is called moment matching. Therefore, the key question now is

to calculate the moments of statistics for a specific parameter vector θ. If we can do that, we can

use gradient ascent to search the global minimizer of the log-likelihood. However except for sim-

ple models such as the tree structured graphs, exact maximum likelihood learning is intractable,

because exact computation of Eθφα takes time that is exponential in the treewidth of the graph

[153]. Another type of Markov random fields with simple close-form MLE of parameters (with

complete data) is chordal Markov networks [94].

Sampling Based Methods

There have been a few methods proposed to solve the problem of calculating moments for a

specific θ and use gradient ascent to find the MLE of the parameters. They are MCMC-MLE

[66, 218], contrastive divergence [80] and particle-filtered MCMC-MLE [4]. Essentially, all these

methods use iterative gradient ascent search to find the MLE of the parameters, namely in the

iterations the update of parameters is

θi+1 = θi + η∇L(θi|X)

= θi + η(EXφ− Eθiφ),(2.41)

where η is the learning rate.

The key difference among these methods is how to sample (particles) and compute Eθiφ from

the samples. MCMC-MLE uses importance sampling to generate particles and compute Eθiφ as

follows,

Eθiφ ≈1

s

s∑j=1

wjiφ(xj0), (2.42)

where s is the number of particles, and wji is the weight for particle xj0 in the iteration i. It can be

26

shown that

wji =exp{(θi − θ0)Tφ(xj)}

1s

∑sk=1 exp{(θi − θ0)Tφ(xk)}

, (2.43)

where θ0 is the parameters under which the particles xj0 (j = 1, ..., s) are generated. Note that

in the iterations in (2.41), the particles xj0 (j = 1, ..., s) stay the same. However, the weights are

changed according to (2.43) as we update θi. The use of importance sampling allows us to reuse

the particles, but the weights of the particles might suffer from degeneracy when θi is faraway

from θ0.

In contrast, contrastive divergence (CD) methods generate samples (particles) according to

θi using a Markov chain. Usually, the chain need to reach equilibrium to generate an accurate

sample, but CD’s rationale is that only a rough estimate of the gradient is necessary to determine

the direction to update the parameters. Accordingly, two versions of CD methods have been

proposed. One is CD-n which generates a sample by running Markov chain for n steps under

parameter θi (starting from a training sample) in iteration i. The other one is persistent contrastive

divergence or PCD-n [178] which advances the particles (from last iteration under parameters

θi−1) for n step in the new parameters θi. Since n is usually chosen to be 1 in CD-n, the Markov

chains for generating particles are usually far from equilibrium. Because θi is close to θi+1 when

the learning rate is small, persistent Markov chains are attractive. [179] discussed the interaction

between the learning rate of the parameters and mixing rate of the Markov chains, and therefore

proposed to use a set of fast weights to speed up the mixing of the persistent Markov chains.

Particle filtered MCMC-MLE essentially strikes a balance between MCMC-MLE and con-

trastive divergence. It uses sampling-importance-resampling and a rejuvenation step to overcome

the degeneracy of particles. In more details, it uses effective sample size (ESS) to monitor the

quality of the particles. ESS can be calculated as follows

ESS({w1, ..., ws}) =(∑

j wj)2∑

j(wj)2

. (2.44)

27

When ESS drops below a certain threshold, particle filtered MCMC-MLE evokes the sampling-

importance-resampling and a followed rejuvenation step. Note that this does not happen in each

iteration of updating parameters in (2.41), and can potentially save computation in generating

particles which could be very costly in high-dimensional models.

When the original probability distribution is multimodal, the tempered transition can be used

to make the Markov chain to jump among the multiple modes [153]. There has been other work

on improving the efficiency of sampling so as to improve the calculation of Eθiφ, such as by

considering a mixture of proposal distribution [218] which extends the gradient ascent algorithm

by taking into account of the Hessian matrix.

To sum up, all these methods use sampling methods to compute the population moments so as

to calculate the gradient. The difference among all these methods are (1) how frequent to generate

particles, (2) where to start a Markov chain, (3) what parameters to use in the running of the

Markov chain, and (4) how many Markov chain steps to run for generating a particle (CD-n and

PCD-n).

Variational Methods

Fenchel-Legendre Duality: Let f : Rk → R, then the function f∗ : Rk → R, defined as

f∗(y) = supx

yTx− f(x), (2.45)

is the conjugate of the function f . The domain of the conjugate function f∗ consists of all y ∈ Rk

for which the supremum is finite, namely for which the difference yTx− f(x) is upper bounded

on the domain of f .

If f is differentiable and convex, then the supremum can be calculated by taking the derivative

with respect to x, namely ∇f(x) = y. Denote the solution x∗. If f is convex, then x∗ is a

maximum. If the solution is unique, then the pair (x∗,y) = (x∗,∇f(x∗)) is a Legendre conjugate

pair.

The right-hand side of formula (2.38) can be rewritten as

28

1

n

n∑j=1

θTφ(xj)−A(θ) = µTθ −A(θ), (2.46)

where

µ =1

n

n∑j=1

φ(xj) = EXφ. (2.47)

Denote µ(θ) to be the expectation of φ under the parameters θ, namely

µ(θ) = Eθφ. (2.48)

When (2.34) is the minimal representation,A(θ) is strictly convex and θ is identifiable. There-

fore, (θ,µ(θ)) is a Legendre conjugate pair. Denote the conjugate function of A(θ) with A∗(µ).

Therefore, the dual parameterization of the model in terms of µ and A∗(µ) is the mean value

parametrization. The domain of A∗(µ), namely the set of {µ(θ)|θ is valid}, is called marginal

polytope of the exponential family model. Also, it turns out that the functionA∗(µ) is the negative

entropy.

We have

A∗(µ) = supθµTθ −A(θ). (2.49)

Therefore, the conjugate dual problem is

A(θ) = supµθTµ−A∗(µ). (2.50)

There are other methods which bound the log partition function. [188] introduce a new class

of upper bounds on the log partition function, based on convex combinations of distribution in

the exponential domain, that is applicable to an arbitrary undirected graphical model. They show

that when the convex combination is with respect to tree-structured distributions, the variational

29

problems are convex and have a unique global minimum which gives an upper bound on the log

partition function.

Theoretical Aspects

There has been theoretical work on convergence of MLE learning of the parameters in Markov

random fields [208, 153]. Denote θ∗ to be the maximizer of the log-likelihood function L(θ|X) in

(2.38). If we use MCMC to generate s particles and approximate the log-likelihood function by

(2.42) and (2.43), the approximated log-likelihood is denoted as Ls(θ|X). If the Markov chain is

ergodic, then Ls(θ|X)→ L(θ|X) for all θ [153]. It can also be shown that under mild conditions,

if θs is the maximizer of the approximated log-likelihood function Ls(θ|X), then θs → θ∗ almost

surely [153]. The convergence properties of contrastive divergence are discussed by [211, 25,

173].

2.3.2 Bayesian Parameter Learning

Suppose that a Markov random field (MRF) on X (X ∈ X d where X is a discrete space) is pa-

rameterized by θ, and its probability mass function is P (X;θ) = P (X;θ)/Z(θ), where P (X;θ)

is some unnormalized probability measure, and Z(θ) is the normalizing constant or partition

function. Given a prior of θ and n i.i.d. observed data points X = {x1, ...,xn}, Bayesian pa-

rameter estimation provides the posterior distribution of θ, denoted by P (θ|X). This posterior

distribution is very informative, not only because the first moment E(θ|X) (a.k.a. Bayesian es-

timate) is optimal in many problems, but also because the standard deviation also depicts the

variability of θ which is useful for statistical inference. However, Bayesian parameter estima-

tion for general MRFs is known as doubly-intractable [128]. With the prior π(θ), the posterior

is P (θ|X) ∝ π(θ)P (X;θ)/Z(θ). If we use the Metropolis-Hastings (MH) algorithm to generate

posterior samples of θ, then in each MH step we have to calculate the MH ratio for the move from

θ to θ∗

30

a(θ∗|θ) =π(θ∗)P (X;θ∗)Q(θ|θ∗)π(θ)P (X;θ)Q(θ∗|θ)

=π(θ∗)P (X;θ∗)Q(θ|θ∗)Z(θ)

π(θ)P (X;θ)Q(θ∗|θ)Z(θ∗),

(2.51)

where Q(θ∗|θ) is some proposal distribution from θ to θ∗, and with probability min{1, a(θ∗|θ)}

we accept the move from θ to θ∗.

The real hurdle in Bayesian parameter estimation for general MRFs is the intractable MH

ratio in (2.51). There are three methods of calculating it in the literature. The first one is to use

importance sampling to estimate r = Z(θ)/Z(θ∗) [118] by

rIS =s−1

2

∑s2t=1 P (x

(t)2 ;θ)α(x

(t)2 )

s−11

∑s1t=1 P (x

(t)1 ;θ∗)α(x

(t)1 )

, (2.52)

where x(1)1 , ..., x(s1)

1 are sampled from P (X;θ) and x(1)2 , ..., x(s2)

2 are sampled from P (X;θ∗), and

α(X) is an arbitrary function defined on the same support as P . Theoretically, rIS is a consistent

estimator of Z(θ)/Z(θ∗) as long as the sample averages in (2.52) converge to their corresponding

population averages, which is satisfied by Markov chain Monte Carlo under regular conditions.

However, the optimal choice of α depends on the ground truth of r, and [118] provided several

options for α, such as a geometric function α(X) = (P (X;θ)P (X;θ∗))−1/2 which is included

in this paper as a baseline.

The second method is to introduce auxiliary variables and cancel Z(θ)/Z(θ∗) in (2.51). [122]

introduces one auxiliary variable Y on the same space as X, and the state variable is extended to

(θ,Y). They set the new proposal distribution for the extended state

Q(θ,Y|θ∗,Y∗)=Q(θ|θ∗)P (Y;θ)/Z(θ) (2.53)

to cancel Z(θ)/Z(θ∗) in (2.51). Therefore by ignoring Y, we can generate the posterior samples

of θ via Metropolis-Hastings. Technically, this auxiliary variable approach requires perfect sam-

31

pling [143], but [122] pointed out that other simpler Markov chain methods also work with the

proviso that it converges adequately to the equilibrium distribution. [128] extended the single aux-

iliary variable method to multiple auxiliary variables for improved efficiency, as well as pointed

out that the single auxiliary variable method can be simplified as a single-variable exchange al-

gorithm. Both the single auxiliary variable algorithm and the single-variable exchange algorithm

can be interpreted as importance sampling. In the auxiliary variable algorithm, r = Z(θ)/Z(θ∗)

is estimated by

raux =s−1

1

∑s1t=1

P (x(t)1 ;θ)

P (x(t)1 ;θ)

s−12

∑s2t=1

P (x(t)2 ;θ)

P (x(t)2 ;θ∗)

, (2.54)

where x(1)1 , ..., x

(s1)1 are sampled from P (X;θ) and x

(1)2 , ..., x

(s2)2 are sampled from P (X;θ∗),

and θ is some estimate of θ. In the single-variable exchange algorithm, r = Z(θ)/Z(θ∗) is

estimated by

rexch = s−1s∑t=1

P (x(t);θ)

P (x(t);θ∗), (2.55)

where x(1), ..., x(s) are sampled from P (X;θ∗). Both importance sampling and the auxiliary

variable method are computation intensive and do not perform well for large-scale models or high

dimensional parameter space, because in each MH step they require generating samples from

P (X;θ) for a given θ via the computation expensive perfect sampling [143] or standard Gibbs

sampling with long runs.

The third method is to use pseudolikelihood [18] to approximate P (X;θ∗) and P (X;θ) in

(2.51). Pseudolikelihood approximation requires less computation, but its approximation nature

makes the Markov chain no longer in detailed balance and may yield unsatisfactory performance.

32

2.3.3 Inference Algorithms

So far, many inference algorithms have been studied, including variable elimination, belief prop-

agation [206], junction trees [99], sampling methods [60], and variational methods [89]. For

undirected graphs without cycles or for tree-structured directed graphs, message-passing algo-

rithms provide exact inference results with a computational cost linear in the number of variables,

namely the sum-product algorithm for computing the marginal probabilities and the max-product

algorithm for computing the most probable states. For graphical models with cycles, loopy be-

lief propagation [127, 197] and the tree-reweighted algorithm [189] can be used for approximate

inference.

2.4 Feature and Variable Selection

The dimensionality of real-world machine learning problems keeps increasing, and feature se-

lection becomes a necessary procedure in many applications, resulting in improved performance,

greater efficiency and better interpretability [73]. Features can be selected with different goals,

typically either finding all features relevant to the target class variable (termed all-relevant) or

finding a minimal feature subset optimal for classification (termed minimal-optimal) [133]. Fea-

ture selection algorithms can be categorized into three types: simple filters, filters with redundancy

removal, and wrappers (Figure 2.1). Simple filters assume independence between features and

rank them individually according to some relevance criterion. They address all-relevant problems

and are efficient. Filters with redundancy removal typically first try to identify all the relevant

features, similar to simple filters, and then remove redundant features in a second step [209]. They

address the minimal-optimal problem and require more computation. Wrapper methods iteratively

generate a candidate feature subset and test it by a specific learning algorithm’s performance, until

some criterion is satisfied [93]. They target the minimal-optimal problem and are compute in-

tensive. When necessary, simple filters can first reduce dimension by filtering out non-relevant

features before wrappers are used.

33

(b) Filters with redundancy removal

SelectedSubset

RelevantSubset

RedundancyAnalysis

OriginalSet

RelevanceAnalysis

(a) Simple filters

SelectedSubset

OriginalSet

RelevanceAnalysis

Yes

NoCandidate

Subset

Current Best Subset

Original SetOr

Processed Set

SubsetGeneration

OriginalSet

RelevanceAnalysis

(c) Wrappers

SubsetEvaluation

Stopping Criterion

SelectedSubset

Figure 2.1: The workflow of the three different feature selection approaches.

34

Recently a variety of feature and variable selection algorithms appears in both the statistics

and machine learning communities, such as FCBF [209], Relief [92], DISR [119], and MRMR

[139]. With the rapid increase of feature size, some approaches focus on high-dimensional or

ultrahigh-dimensional feature selection [194, 44]. One particular popular family of approaches

is based on penalized least squares or penalized pseudo-likelihood. Specific algorithms include

but are not restricted to LASSO [176], SCAD [43], Lars [36], Dantzig selector[23] and elastic net

[219]. Several recent algorithms also take into account the structure in the covariate space, such as

group lasso [210], fused lasso with a chain structure [177], overlapping group lasso [87, 86] and

graph lasso [86]. However, almost all the penalized least squares or penalized pseudo-likelihood

feature selection methods (except elastic net) are to find a minimal feature subset optimal for

regression or classification, which is a minimal-optimal problem. However if we want to regard

genome-wide association studies as variable selection problems, the goal of feature selection is to

identify all the features relevant to the response variable which is an all-relevant problem.

Chapter 3

High-Dimensional Structured Feature

Screening Using Markov Random

Fields

Feature screening is a useful feature selection approach for high-dimensional data when the goal is

to identify all the features relevant to the response variable. However, common feature screening

methods do not take into account the correlation structure of the covariate space. We propose the

concept of a feature relevance network, a binary Markov random field to represent the relevance

of each individual feature by potentials on the nodes, and represent the correlation structure by

potentials on the edges. By performing inference on the feature relevance network, we can ac-

cordingly select relevant features. The procedure does not yield sparsity, which is different from

the particular popular family of feature selection approaches based on penalized least squares or

penalized pseudo-likelihood. We give one concrete algorithm under this framework and show its

superior performance over common feature selection methods in terms of prediction error and

recovery of the truly relevant features on real-world data and synthetic data.

35

36

3.1 Introduction

The dimensionality of machine learning problems keeps increasing, and feature selection becomes

a necessary procedure in many applications, resulting in improved performance, greater efficiency

and better interpretability [73]. However, feature selection in many applications becomes more

and more challenging due to both the increasing number of features and the complex correla-

tion structure among the features. For instance, in genome-wide association studies (GWAS),

researchers are interested in identifying all relevant genetic makers (single-nucleotide polymor-

phisms, or SNPs) among millions of candidates with hundreds or thousands of samples. Usually

the truly relevant markers are rare and only weakly associated with the response variable. A

screening feature selection procedure is usually the only method computationally feasible because

of the high dimension, but it is typically unreliable and suffers from high false positive rate. On

the other hand, the features are usually correlated with one another. For example in GWAS, most

SNPs are highly correlated with one or more nearby SNPs, with squared Pearson correlation co-

efficients well above 0.8. In the next paragraph, we give a toy example showing that taking into

account the correlation between features can be beneficial.

Suppose that our measured features are correlated because they are all influenced by some

hidden variable. This is often the case in GWAS, where our features are markers that are easy

to measure, but the actual underlying causal genetic variation is not measured. Suppose that our

data are generated from the Bayesian network in Figure 3.1(a). All variables are binary. Hidden

variables are denoted by H1 and H2. H1 is weakly associated with the class variable. H2 is not

associated. Both H1 and H2 have a probability of 0.5 of being 1. Observed variables A and B

are associated with H1. Observed variables C and D are associated with H2. We label the arc

from H1 to A with a 0.8 to denote that A is 0 with probability 0.8 when H1 is 0, and A is 1 with

probability 0.8 when H1 is 1. Under the distribution, the probability that A and the class variable

take the same value is 0.8 × 0.6 + (1 − 0.8) × (1 − 0.6) = 0.56, and it is the same for B. The

probability that H2 takes the same value with the class variable is 0.5. C and D take the same

value with the class variable with probability 0.5 respectively. The probability that A and B take

37

0.8 0.8 0.8 0.8

0.6

Class

H1

A B

H2

C D

0.56

0.52 0.56

0.68 0.56 0.68

Class

H1

A B

H2

C D

(a) (b)

Figure 3.1: One Bayesian network example.

the same value is 0.68, and it is the same for C and D. Suppose that there are more nonassociated

hidden variables than associated ones and we generate a small sample set from this distribution

specified by the Bayesian network. There will be some nonassociated variables (i.e. C) that appear

to be as promising as associated features (i.e. B) if we only look at the sample-based probability

of agreement with the class variable. Suppose that C appears as promising as A and B, with a

probability of 0.56 agreement with the class variable. In Figure 3.1(b), the number on the dotted

edges stands for the sample-based probability of agreement with the class variable. Since D is

expected to show agreement with C with probability 0.68, we expect the sample-based probability

of agreement betweenD and the class variable to be 0.56×0.68+(1−0.56)×(1−0.68) = 0.52.

If we are using any screening method to evaluate the features, it will rank A, B and C equally

high. However in this case, we should make use of the information that C is more likely to be

a false positive because its highly correlated feature D does not appear as relevant as does A’s

(B’s) highly-correlated feature. Therefore, we seek a way of taking into account the correlation

structure in this manner during the procedure of feature selection.

Markov random fields provide a natural way of representing the relevance of each feature

and the correlation structure among the features. The relevance of each feature is represented as a

node that takes the values in {0, 1}. The correlation structure among the features is captured as the

potentials on the edges. We can regard the feature selection problem in the original covariate space

38

as an inference problem on this binary Markov random field which is called a feature relevance

network. Section 3.2 gives a precise description of the feature relevance network and introduces

one feature selection algorithm. Sections 3.3 and 3.4 evaluate the algorithm on synthetic data and

real-world data respectively. We finally conclude in Section 3.5.

3.2 Method

3.2.1 Feature Relevance Network

Suppose that we have a supervised learning problem with d features and n samples (d � n).

A feature relevance network (FRN) is a binary Markov random field on a random vector X =

(X1, ..., Xd) ∈ {0, 1}d described by an undirected graph G(V,E) with the node set V and the

edge set E. The relevance of featurei is represented by the state of nodei in V . Xi = 1 represents

that featurei is relevant to the response variable whereas Xi = 0 represents that featurei is not

relevant. Correlation between Xi and Xj is denoted by an edge connecting nodei and nodej in

E. The potential on nodei, φ(Xi), depicts the relative probability that featurei is relevant to the

response variable when featurei is analyzed individually. The potential on the edge connecting

nodei and nodej , ψ(Xi, Xj), depicts the relative joint probability that featurei and featurej are

relevant to the response variable jointly. For a given FRN, the probability of a given relevance

state x = (x1, ..., xd) is

P (x) =1

Z

|V |∏i=1

φ(xi)∏

(i,j)∈E

ψ(xi, xj)

=1

Zexp

|V |∑i=1

log φ(xi) +∑

(i,j)∈E

logψ(xi, xj)

,

(3.1)

where Z is a normalization constant, and |V | = d.

Performing feature selection with an FRN involves a construction step and an inference step.

39

featurei = 0 featurei = 1 Total

Y = 1 u0 u1 uY = 0 v0 v1 vTotal n0 n1 n

Table 3.1: Empirical counts at featurei with a binary response variable Y .

To construct an FRN, one needs set φ(Xi) for i = 1, ..., |V | and ψ(Xi, Xj) for (i, j) ∈ E. Section

3.2.2 continues to discuss the construction step in detail. In the second step, one has to find the

most probable state (maximum a posterior, or MAP) of the FRN, and the features can be selected

according to its MAP state. For a binary pairwise Markov random field, finding the MAP state

is equivalent to an energy function minimization problem [21] which can be exactly solved by a

graph cut algorithm [95]. Section 3.2.3 discusses the inference step in detail.

3.2.2 The Construction Step

In the construction step, we set the potential functions φ(Xi) and ψ(Xi, Xj). Suppose that we are

using hypothesis testing to evaluate the relevance of each individual feature, and we observe the

test statistic S = (S1, ..., Sd). We assume that Si’s are independent given X . Suppose that the

probability density function of Si given Xi = 0 is f0, and the density of Si given Xi = 1 is f1. If

f0 and f1 are Gaussian, the model is essentially a coupled mixture of Gaussians model[191]. Here

we give one concrete example. Suppose that we are trying to identify whether a binary featurei is

relevant to the binary response variable Y ∈ {0, 1} with the empirical counts from data shown in

Table 3.1.

If we use a two-proportion z-test to test the relevance of featurei with Y , the test statistic is

Si =u1/u− v1/v√

u0u1/u3 + v0v1/v3. (3.2)

Si|Xi = 0 is approximately standard normally distributed. Si|Xi = 1 is approximately nor-

40

mally distributed with variance 1 and some nonzero mean δi. Many GWAS applications employ

logistic regression followed by a likelihood ratio test to identify associated SNPs. We call this

testing procedure LRLR. In this situation, Si|Xi = 0 has an asymptotic χ2 distribution with 2

degrees of freedom and Si|Xi = 1 has an asymptotic non-central χ2 distribution with 2 degrees

of freedom.

In the FRN, we only connect a pair of nodes if their corresponding features are correlated.

After specifying the structure of the FRN, we have a parameter learning problem in the Markov

random field. The parameters include φ(Xi) for i = 1, ..., |V | and ψ(Xi, Xj) for (i, j) ∈ E.

We claim learning all these parameters is extremely difficult and practically unrealistic for three

reasons. First, learning parameters is difficult by nature in undirected graphical models due to the

global normalization constant Z [190, 198]. Second, there are too many parameters to estimate.

Last but not least, X is latent and we only have one training sample which is S. Therefore, we

propose a compromise solution as follows. Although this solution looks arbitrary, it can be easily

applied in practice and has an interpretation given in formula (3.9).

The way of settingψ(Xi, Xj) comes from the observation that the chance thatXi andXj agree

increases as the magnitude of the correlation between featurei and featurej increases. Therefore,

if we can estimate the Pearson correlation coefficient rij between featurei and featurej , we set

ψ(Xi, Xj) = eλ|rij |I(Xi=Xj), (3.3)

where λ (λ > 0) is a tradeoff parameter and I(Xi = Xj) is an indicator variable that indicates

whether Xi and Xj take the same value.

The way of setting φ(Xi) is as follows. We set

φ(Xi) = e|Xi−qi|, (3.4)

where qi = 1 − pi and pi = P (featurei is relevant). With hypothesis testing in (3.2), we

usually set pi to be 1 if the absolute value of the test statistic is greater than or equal to some

41

threshold ξ and 0 if otherwise. We call the pi (from such a “hard” method using some threshold)

pHi , namely

pHi =

1, if |Si| ≥ ξ,

0, otherwise.

We can also set pi by Bayes’ rule if we know f1 and f0. We call it pBi .

pBi =1

αf0(si) + 1, (3.5)

where

α =P (Xi = 0)

f1(si)P (Xi = 1). (3.6)

However in most of the cases, the parameter δi in f1 is unknown to us. In the two-proportion

z-test in (3.2), δi refers to the mean parameter in f1 which is Gaussian. In LRLR, δi refers to

the non-centrality parameter in f1 which is non-central χ2. We can use its data-driven version δ∗i .

This step has a flattening effect on calculating pi because it assumes the values of the test statistic

for relevant features are uniformly distributed. Therefore, we introduce an adaptive procedure for

calculating pi by

pi = γpHi + (1− γ)pBi , (3.7)

where 0 ≤ γ ≤ 1. We choose ξ in pHi to be the test statistic that makes pBi be 0.5 in (3.5).

Eventually, we have three parameters in the construction step, namely λ, γ and α. In practice, one

can tune the three parameters from cross-validation.

42

3.2.3 The Inference Step

For a given FRN, we need to find the most probable state which maximizes the posterior proba-

bility of (3.1) so as to select the relevant features. Finding the MAP state of the Markov random

field specified by (3.1) is equivalent to minimizing its corresponding energy function E, which is

defined as

E(x) = −|V |∑i=1

log φ(xi)−∑

(i,j)∈E

log ψ(xi, xj). (3.8)

As long as − log ψ(Xi, Xj) is submodular, the energy minimizing problem can be exactly

solved by the graph-cut algorithm on a weighted directed graph F (V ′, E′) [95] in polynomial

time. If φ(Xi) and ψ(Xi, Xj) are set as formula (3.4) and formula (3.3), the optimization problem

is

minx

|V |∑i=1

|xi − pi|+ λ

|V |∑i,j=1

I(xi 6= xj)|rij |

, (3.9)

which can be interpreted as seeking a state of the FRN with two different goals. The first goal

is that the MAP state is close to the relevance of the features when evaluated individually, which

is implied by the first term. The second goal is that strongly correlated features arrive at the same

state, which is implied by the second term. We can run a max-flow-min-cut algorithm, such as the

push-relabel algorithm [69] or the augmenting path algorithm [51], to find the minimum-weight

cut of this directed graph; a cut is a set of edges whose removal eliminates all paths between the

source and sink nodes. Finally, after we cut the graph, every feature node is either connected to

the source node or connected to the sink node. We select the features that are connected with the

source node.

43

3.2.4 Related Methods

A variety of feature selection algorithms appear in both the statistics and machine learning com-

munities, such as FCBF [209], Relief [92], DISR [119], MRMR [139], “cat” score [221] and CAR

score [222]. Variables can be selected within SVM [74, 213, 205]. With the rapid increase of fea-

ture size, some approaches focus on high-dimensional or ultrahigh-dimensional feature selection

[194, 44]. One particular popular family of approaches is based on penalized least squares or

penalized pseudo-likelihood. Specific algorithms include but are not restricted to LASSO [176],

SCAD [43], Lars [36], Dantzig selector[23], elastic net [219], adaptive elastic net [220], Bayesian

lasso [78], pairwise elastic net [110], exclusive Lasso [215] and regularization for nonlinear vari-

able selection [151]. Several recent algorithms also take into account the structure in the covariate

space, such as group lasso [210], fused lasso with a chain structure [177], overlapping group lasso

[87, 86], graph lasso [86], group Dantzig selector [104] and EigenNet [144]. However, most of

the penalized least squares or penalized pseudo-likelihood feature selection methods are to find

a minimal feature subset optimal for regression or classification, which is termed the minimal-

optimal problem [133]. However in this chapter, the goal of feature screening is to identify all

the features relevant to the response variable which is termed the all-relevant problem [133]. The

hidden Markov random field model in our FRN has also been used in other problems, such as

image segmentation [26] and gene clustering [185].

3.3 Simulation Experiments

In this section, we generate synthetic data and compare the FRN-based feature selection algorithm

with other feature selection algorithms. We generate binary classification samples with an equal

number (n) of positive samples and negative samples. In order to generate correlated features,

we introduce h hidden Bernoulli random variables H1,...,Hh. For each hidden variable Hi, we

generate m observable Bernoulli random variables Xij (j = 1, ...,m), where Xij takes the same

value as Hi with a probability ti. We set the first πh hidden variables to be the true associated

44

hidden variables and accordingly we have πhm true associated observable features, where π is

the prior probability of association. For associated hidden variable Hi, we set P (Hi = 1) to be

uniformly distributed on the interval [0.01,0.5]. We also set the relative risk, defined as follows,

rr =P (positive|Hi = 1)

P (positive|Hi = 0). (3.10)

For each nonassociated hidden variable Hi we also set P (Hi = 1) to be uniformly distributed

on the interval [0.01,0.5]; this stays the same for the positive samples and negative samples.

FPR

TPR

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Relative Risk=1.3Prior=0.025


0.0 0.2 0.4 0.6 0.8 1.0



0.0 0.2 0.4 0.6 0.8 1.0


0.0

0.2

0.4

0.6

0.8

1.0


Two-proportion z-test Feature Relevance Network Elastic Net

Figure 3.2: ROC curves of two-proportion z-test, FRN and elastic net for different prior probabil-ities and different relative risks.

45

One baseline feature screening method is the two-proportion z-test which is given in formula

(3.2). We rank the features with the P-values from the tests. The other baseline feature selection

method is the elastic net (in the R package “glmnet”). Unlike other penalized least squares or

penalized pseudo-likelihood feature selection methods, the elastic net approach does not select

a sparse subset of features and is usually good at recovery of all the relevant features. For the

elastic net penalty, we set α to be 0.5, and we use a series of 20 values for λ. For our FRN-

energy minimizing algorithm, we exactly follow formula (3.5), formula (3.6) and formula (3.7).

We choose a series of 20 values for α, and set γ to be 0, and λ to be 1. Since we have the ground

truth of which features are relevant to the response variable, we can compare the ROC curves and

the precision-recall curves for feature capture (i.e., we treat associated features as positives).

For the first set of experiments, we set n = 500, h = 1000, m = 5, ti uniformly distributed

on the interval (0.8, 1.0), π = {0.025, 0.05}, and rr = {1.1, 1.2, 1.3}. Because we have 2 values

for π and 3 values for the relative risk rr, we run the simulation a total of 6 times for different

combinations of the two parameters. The results are shown in Figure 3.2 and Figure 3.3. When

the relative risk is 1.1, it is difficult for all three algorithms to recover the relevant features. When

the relative risk is 1.2 or 1.3, our FRN algorithm outperforms the two baseline algorithms. The

prior of association π does not make too much difference for the ROC curves. However for the

precision-recall curves, when π is larger, the precision will be higher for the same recall value in

the same parameter configuration.

For the second set of experiments, we set n = 500, h = 1000, π = 0.05, rr uniformly

distributed on the interval (1.1, 1.3), m = {2, 5, 10}, and ti uniformly distributed on the interval

(τ, 1.0) where τ = {0.5, 0.8, 0.9}. Because we have 3 values for m and 3 choices for ti, we run

the simulation a total of 9 times for different combinations of the two parameters. The results are

shown in Figure 3.4 and Figure 3.5. When the features have a lot of highly correlated neighbors,

the FRN approach shows an advantage over the ordinary screening method and the elastic net.

However, when the features do not have a lot of neighbors or when the neighbors are not highly

correlated, the FRN does not help a lot.

46

Recall

Pre

cisi

on

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0



0.0 0.2 0.4 0.6 0.8 1.0



0.0 0.2 0.4 0.6 0.8 1.0


0.0

0.2

0.4

0.6

0.8

1.0



Figure 3.3: Precision-recall curves of two-proportion z-test, FRN and elastic net for different priorprobabilities and different relative risks.

3.4 Real-world Application

3.4.1 Background

A genome-wide association study analyzes genetic variation across the entire human genome,

searching for variations that are associated with a given heritable disease or trait. The GWAS

dataset on breast cancer for our experiment comes from NCI’s Cancer Genetics Markers of Sus-

ceptibility website (http://cgems.cancer.gov/data/). We name this dataset CGEMS data. It includes

47

FPR

TPR

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

m=2t~U(0.9,1.0)

m=5t~U(0.9,1.0)

0.0 0.2 0.4 0.6 0.8 1.0

m=10t~U(0.9,1.0)

m=2t~U(0.8,1.0)

m=5t~U(0.8,1.0)

0.0

0.2

0.4

0.6

0.8

1.0m=10

t~U(0.8,1.0)0.0

0.2

0.4

0.6

0.8

1.0m=2

t~U(0.5,1.0)

0.0 0.2 0.4 0.6 0.8 1.0

m=5t~U(0.5,1.0)

m=10t~U(0.5,1.0)


Figure 3.4: ROC curves of two-proportion z-test, FRN and elastic net when we choose differentcorrelation structures of covariates.

528, 173 SNPs as features for 1, 145 patients and 1, 142 controls. Details about the data can be

found in the original study [82]. This GWAS also exhibits weak-association, and the relative risk

of the several identified SNPs are between 1.07 and 1.26 [142]. The reasons for weak association

are that (i) it is estimated that genetics only accounts for about 27% of breast cancer risk and the

48

Recall

Pre

cisi

on

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

m=2t~U(0.9,1.0)

m=5t~U(0.9,1.0)

0.0 0.2 0.4 0.6 0.8 1.0

m=10t~U(0.9,1.0)

m=2t~U(0.8,1.0)

m=5t~U(0.8,1.0)

0.2

0.4

0.6

0.8

1.0m=10

t~U(0.8,1.0)

0.2

0.4

0.6

0.8

1.0m=2

t~U(0.5,1.0)

0.0 0.2 0.4 0.6 0.8 1.0

m=5t~U(0.5,1.0)

m=10t~U(0.5,1.0)


Figure 3.5: Precision-recall curves of two-proportion z-test, FRN and elastic net when we choosedifferent correlation structures of covariates.

rest is caused by environment [102] and (ii) breast cancer and many other diseases are polygenic,

namely the genetic component is spread over multiple genes. Therefore, given equal numbers of

breast cancer patients and controls without breast cancer, the highest predictive accuracy we might

reasonably expect from genetic features alone is about 63.5%, obtainable by correctly predicting

49

the controls and correctly recognizing 27% of the cancer cases based on genetics. If we select

SNPs which are already identified to be associated with breast cancer by other studies (for exam-

ple, one study [142] uses a much larger dataset which includes 4,398 cases and 4,316 controls, and

confirms results on 21,860 cases and 22,578 controls), we get a set of 19 SNPs (the closest feature

set we have the ground truth for this task). Using these 19 SNPs as input to leading classification

algorithms, such as support vector machines, results in at most a 55% predictive accuracy.

3.4.2 Experiments on CGEMS Data

Since we do not know which SNPs are truly associated, we are unable to evaluate the recovery of

the truly relevant features as what we do in Section 3.3. Instead, we compare the performance of

supervised learning when coupled with the feature selection algorithms. The baseline feature se-

lection methods include (i) logistic regression with likelihood ratio test (LRLR), (ii) FCBF [209],

(iii) Relief [92] and (iv) lasso penalized logistic regression (LassoLR) [203]. Because SVMs have

been shown to perform particularly well on high-dimensional data such as genetic data [196], we

employ it as our machine learning algorithm to test the performance of feature selection methods.

All the experiments are run in a stratified 10-fold cross-validation fashion, using the same folds

for each approach, and each feature selection method is paired with a linear SVM. For running

the SVM, we convert the SNP value AA into 1, AB into 0, and BB into −1 where A stands for

the common allele at this locus and B stands for the rare allele. For each fold, the entire training

process (feature selection and supervised learning) is repeated using only the training data in that

fold before predictions are made on the test set of that fold, to ensure a fair evaluation. For all

feature selection approaches, we tune the parameters in a nested cross-validation fashion. In each

training-testing experiment of the 10-fold cross-validation, we have 9 folds for training and 1 fold

for testing. On the 9 folds of training data, we carry out a 9-fold cross-validation (8 folds for

training and 1 fold for tuning) to select the best parameters. Since we have almost equal numbers

of cases and controls, we use accuracy to measure the classification performance for both inner

and outer cross-validation.

50

We build the FRN based on LRLR. Namely, we follow the calculation of pi in Section 3.2.2.

Then we exactly use formula (3.4) and formula (3.3) to set the φ(Xi) and ψ(Xi, Xj). α in (3.5)

and (3.6) essentially determines the threshold of the mapping function which maps the test statistic

to the association probability pi. Our tuning considers 5 values of α, namely 500, 1000, 1500,

2500, and 5000. γ in (3.7) determines the slope of the mapping function. We considers 5 values

of γ, namely 0.0, 0.25, 0.5, 0.75, and 1.0. λ in (3.9) is the tradeoff parameter between fitness

and smoothness. Our tuning considers 4 values of λ, namely 0.25, 0.5, 0.75, and 1.0. Usually if

there are multiple parameters to tune in supervised learning, one might use grid search. However,

since we will have in total 100 parameter configurations if we grid-search them, it might overfit

the parameters. Instead, we tune the parameters one by one. We first tune α based on the average

performance over the different γ and λ values. With the best α value, we then tune γ based

on the average performance over different λ values. Finally we tune λ with the selected α and

γ configuration. The computation for correlation between features can result in high run-time

and space requirements if the number of features is large. General push-relabel algorithms and

augmenting-path algorithms both have O(|V |2|E|) time complexity. Owing to these two reasons,

it is necessary to remove a portion of irrelevant SNPs in the first step to reduce the complexity

when applying the FRN-based feature selection algorithm to this GWAS data. Therefore, in the

experiments on the GWAS data we only keep the top k SNPs based on the individual relevance

measurements. Tuning k may lead to better performance. Since we already have three parameters

to tune for the energy minimizing algorithm, we fix k at 50, 000. For the baseline algorithms, there

is one parameter f , the number of features to select for supervised learning. We tune it with 20

values, namely 50, 100, 150, ..., and 1000.

As listed in Table 3.2, linear SVM’s average accuracy is 53.08% when the FRN algorithm is

used. When LRLR, FCBF, Relief and LassoLR are used, linear SVM’s average accuracies are

50.64%, 51.68%, 50.90% and 48.75% respectively. We perform a significance test on the 10

accuracies from the 10-fold cross-validation using a two-sided paired t-test. The FRN algorithm

significantly outperforms the logistic regression with likelihood ratio test algorithm and the lasso

51

Alg LRLR FCBF Relief LassoLR FRN

Acc 50.64 51.68 50.90 48.75 53.08P 0.021 0.367 0.069 0.007 –

Table 3.2: The classification accuracy (%) of linear SVM coupled with different feature selectionmethods, logistic regression with likelihood ratio test (LRLR), FCBF, Relief, lasso penalized lo-gistic regression (LassoLR) and feature relevance network (FRN) followed by the P-values fromsignificance test (two-sided paired t-test) comparing the baseline algorithms with FRN.

penalized logistic regression algorithm at 0.05 level.

3.4.3 Validating Findings on Marshfield Data

The Personalized Medicine Research Project [115], sponsored by Marshfield Clinic, was used

as the sampling frame to identify 162 breast cancer cases and 162 controls. The project was

reviewed and approved by the Marshfield Clinic IRB. Subjects were selected using clinical data

from the Marshfield Clinic Cancer Registry and Data Warehouse. Cases were defined as women

having a confirmed diagnosis of breast cancer. Both the cases and controls had to have at least

one mammogram within 12 months prior to having a biopsy. The subjects also had DNA samples

that were genotyped using the Illumina HumanHap660 array, as part of the eMERGE (electronic

MEdical Records and Genomics) network [116]. In total 522, 204 SNPs have been genotyped after

the quality assurance step. Despite the difference in genotyping chips and the different quality

assurance process, 493, 932 SNPs also appear in the CGEMS breast cancer data. Due to the

small sample size, it is undesirable to repeat the same experiment procedure in Section 3.4.2

on Marshfield data. However, we can use it to validate the results from the experiment on the

CGEMS data. We apply FRN and LRLR on CGEMS data, and compare the log odds-ratio of

the selected SNPs by the two approaches on Marshfield data. The CGEMS dataset was also used

by another study [201]. They proposed a novel multi-SNP test approach logistic kernel-machine

test (LKM-test) and demonstrated that it outperformed individual-SNP analysis method and other

52

state-of-the-art multi-SNP test approaches such as the genomic-similarity-based test [199] and the

kernel-based test [126]. Based on the CGEMS data, LKM-test identified 10 SNP sets (genes) to be

associated with breast cancer. The 10 SNP sets include 195 SNPs. We set FRN to select the same

number of relevant SNPs on the CGEMS data, and we compare the SNPs identified by LKM-test

and the SNPs identified by FRN on a different real-world GWAS dataset on breast cancer so as to

compare the performance of LKM-test and FRN.

We run FRN and LRLR on the entire CGEMS dataset and validate the selected SNPs on

Marshfield data. For FRN, we tune the parameters from the 10-fold cross validation similarly.

The selected parameters for FRN are α = 1000, γ = 0.5, and λ = 0.75. In total, FRN selected

428 SNPs from the CGEMS data; 393 of them appear in the Marshfield data. We pick the top

423 SNPs selected by LRLR which also result in 393 overlapped SNPs with Marshfield data. On

Marshfield data we compare the log odds-ratio of the 393 SNPs selected by FRN and the 393

SNPs selected by LRLR via the quantile-quantile plot (Q-Q plot) which is given in Figure 6(a).

On the CGEMS data the LKM-test selected 195 SNPs, 178 of which appear in Marshfield data. To

ensure a fair comparison, we pick the 194 of the 428 SNPs selected by FRN using their individual

P-values, which also yields 178 SNPs in Marshfield data. We also compare the log odds-ratio of

the 178 SNPs selected by FRN and the 178 SNPs selected by LKM-test via Q-Q plot, which is

given in Figure 6(b). If the log odds-ratios of the SNPs selected by two different methods are from

the same distribution, the points should lay on the 45 degree line (the red straight lines in the plots)

in the Q-Q plot. However in both of the two plots we observe obvious discrepancies at the tails.

When comparing the log odds-ratio on a different cohort, the top SNPs picked up by FRN appear

to be much more relevant to the disease than the top SNPs selected by either LRLR or LKM-test.

3.5 Discussion

We propose the feature relevance network as a further step for feature screening which takes

into account the correlation structure among features. For simulations in Section 3.3, it took a

few hours to finish all runs on a single CPU. For results in Section 3.4, we finished, including

53

0.0 0.5 1.0 1.5

0.0

0.4

0.8

(a)

Log odds−ratio of the SNPs (FRN)

Log

odds

−ra

tio o

f the

SN

Ps

(LR

LR)

0.0 0.5 1.0 1.5

0.0

0.4

0.8

(b)

Log odds−ratio of the SNPs (FRN)

Log

odds

−ra

tio o

f the

SN

Ps

(LK

M−

test

)

Figure 3.6: Q-Q plots for (a) comparing log odds-ratio of the SNPs selected by FRN and the SNPsselected by LRLR and (b) comparing log odds-ratio of the SNPs selected by FRN and the SNPsselected by LKM-test. The selection of SNPs is done on CGEMS data. The log odds-ratio iscalculated on Marshfield data.

tuning parameters, in two weeks in a parallel computing environment (∼ 20 CPUs). Besides

the computation burden, another drawback is that our algorithm only returns the selected variables

according to the MAP state. It doesn’t provide P-values or other measures for each variable. In this

chapter, the correlation structure among the features is pairwise, which is represented as edges in

an undirected graph. However, there are also other types of correlation structure which one might

want to provide as prior knowledge, such as the features coming from groups (may or may not

overlap), chain structures or tree structures. Representing all these types of correlation structure

with the help of Markov random fields will be one important direction for future research.

In this chaper, the goal of feature screening is to identify all the features relevant to the re-

sponse variable, which is termed the all-relevant problem [133], although we also compare the

prediction performance of supervised learning due to the lack of the ground truth in the real-

world GWAS application in Section 3.4. In some other applications, the goal of feature selection

is to find a minimal feature subset optimal for classification or regression, which is termed the

minimal-optimal problem [133]. We do not address the minimal-optimal problem at all in the

54

present chapter. For solving the minimal-optimal problem in high-dimensional structured covari-

ate space, many approaches have been well-studied under the lasso framework [176]. Specific

algorithms include but are not restricted to group lasso [210], fused lasso with a chain structure

[177], overlapping group lasso [87, 86], graph lasso [86] and group Dantzig selector [104].

The material in this chapter first appeared in the 15th International Conference on Artificial

Intelligence and Statistics (AISTATS’2012) as follows:

Jie Liu, Chunming Zhang, Catherine McCarty, Peggy Peissig, Elizabeth Burnside and David

Page. High-Dimensional Structured Feature Screening Using Binary Markov Random Fields. The

15th International Conference on Artificial Intelligence and Statistics (AISTATS), 2012.

The next chapter reformulates this feature selection approach as a multiple testing procedure

that has many elegant properties, including controlling false discovery rate at a specified level and

significantly improving the power of the tests by leveraging dependence.

Chapter 4

Multiple Testing under Dependence via

Parametric Graphical Models

Large-scale multiple testing tasks often exhibit dependence, and leveraging the dependence be-

tween individual tests is still one challenging and important problem in statistics. With recent

advances in graphical models, it is feasible to use them to perform multiple testing under depen-

dence. We propose a multiple testing procedure which is based on a Markov-random-field-coupled

mixture model. The ground truth of hypotheses is represented by a latent binary Markov random

field, and the observed test statistics appear as the coupled mixture variables. The parameters in

our model can be automatically learned by a novel EM algorithm. We use an MCMC algorithm

to infer the posterior probability that each hypothesis is null (termed local index of significance),

and the false discovery rate can be controlled accordingly. Simulations show that the numerical

performance of multiple testing can be improved substantially by using our procedure. We apply

the procedure to a real-world genome-wide association study on breast cancer, and we identify

several SNPs with strong association evidence.

55

56

4.1 Introduction

Observations from large-scale multiple testing problems often exhibit dependence. For instance,

in genome-wide association studies, researchers collect hundreds of thousands of highly corre-

lated genetic markers (single-nucleotide polymorphisms, or SNPs) with the purpose of identifying

the subset of markers associated with a heritable disease or trait. In functional magnetic reso-

nance imaging studies of the brain, thousands of spatially correlated voxels are collected while

subjects are performing certain tasks, with the purpose of detecting the relevant voxels. The most

popular family of large-scale multiple testing procedures is the false discovery rate analysis, such

as the p-value thresholding procedures [12, 14, 63], the local false discovery rate procedure [38],

and the positive false discovery rate procedure [166, 167]. However, all these classical multiple

testing procedures ignore the correlation structure among the individual factors, and the question

is whether we can reduce the false non-discovery rate by leveraging the dependence, while still

controlling the false discovery rate in multiple testing.

Graphical models provide an elegant way of representing dependence. With recent advances

in graphical models, especially more efficient algorithms for inference and parameter learning, it

is feasible to use these models to leverage the dependence between individual tests in multiple

testing problems. One influential paper [172] in the statistics community uses a hidden Markov

model to represent the dependence structure, and has shown its optimality under certain conditions

and its strong empirical performance. It is the first graphical model (and the only one so far) used

in multiple testing problems. However, their procedure can only deal with a sequential dependence

structure, and the dependence parameters are homogenous. In this chapter, we propose a multiple

testing procedure based on a Markov-random-field-coupled mixture model which allows arbitrary

dependence structures and heterogeneous dependence parameters. This extension requires more

sophisticated algorithms for parameter learning and inference. For parameter learning, we design

an EM algorithm with MCMC in the E-step and persistent contrastive divergence algorithm [178]

in the M-step. We use the MCMC algorithm to infer the posterior probability that each hypothesis

is null (termed local index of significance or LIS). Finally, the false discovery rate can be controlled

57

by thresholding the LIS. Section 4.2 introduces related work and our procedure. Sections 4.3 and

4.4 evaluate our procedure on a variety of simulations, and the empirical results show that the

numerical performance can be improved substantially by using our procedure. In Section 4.5, we

apply the procedure to a real-world genome-wide association study (GWAS) on breast cancer, and

we identify several SNPs with strong association evidence. We finally conclude in Section 4.6.

4.2 Method

4.2.1 Terminology and Previous Work

Not rejected Rejected TotalNull N00 N10 m0

Non-null N01 N11 m1

Total S R m

Table 4.1: Classification of tested hypotheses

Suppose that we carry out m tests whose results can be categorized as in Table 4.1. False

discovery rate (FDR), defined as E(N10/R|R > 0)P (R > 0), depicts the expected propor-

tion of incorrectly rejected null hypotheses [12]. False non-discovery rate (FNR), defined as

E(N01/S|S > 0)P (S > 0), depicts the expected proportion of false non-rejections in those tests

whose null hypotheses are not rejected [62]. An FDR procedure is valid if it controls FDR at a

nominal level, and optimal if it has the smallest FNR among all the valid FDR procedures [172].

The effects of correlation on multiple testing have been discussed, under different assumptions,

with a focus on the validity issue [16, 49, 136, 155, 34, 46, 150, 204, 19]. The efficiency issue has

also been investigated [207, 64, 11, 212], indicating FNR could be decreased by considering de-

pendence in multiple testing. Several approaches have been proposed, such as dependence kernels

[100], factor models [53] and principal factor approximation [42]. [172] explicitly use a hidden

Markov model (HMM) to represent the dependence structure and analyze the optimality under the

58

𝜃𝑗

𝑥𝑗

𝜃𝑖

𝑥𝑖

𝜃𝑘

𝑥𝑘

…

…

… 𝜙𝑖𝑗

1 − 𝜙𝑖𝑗 𝜙𝑖𝑗

1 − 𝜙𝑖𝑗

𝜋𝑖

1 − 𝜋𝑖

𝜙𝑖𝑘

1 − 𝜙𝑖𝑘 𝜙𝑖𝑘

1 − 𝜙𝑖𝑘

𝜙𝑗𝑘

1 − 𝜙𝑗𝑘 𝜙𝑗𝑘

1 − 𝜙𝑗𝑘

𝜋𝑗

1 − 𝜋𝑗

𝜋𝑘

1 − 𝜋𝑘

𝜓

𝜓 𝜓

0

1

0

1

0

1 0

1

0

1

0

1

0 1

0 1

0 1

Figure 4.1: The MRF-coupled mixture model for three dependent hypothesesHi,Hj andHk withobserved test statistics (xi, xj and xk) and latent ground truth (θi,θj and θk). The dependenceis captured by potential functions parameterized by φij ,φjk and φik, and coupled mixtures areparameterized by ψ.

compound decision framework [171]. However, their procedure can only deal with sequential de-

pendence, and it uses only a single dependence parameter throughout. In this chapter, we replace

HMM with a Markov-random-field-coupled mixture model, which allows richer and more flexi-

ble dependence structures. The Markov-random-field-coupled mixture models are related to the

hidden Markov random field models used in many image segmentation problems [214, 26, 28].

4.2.2 The Multiple Testing Procedure

Let x = (x1, ..., xm) be a vector of test statistics from a set of hypotheses (H1, ...,Hm). The

ground truth of these hypotheses is denoted by a latent Bernoulli vector θ = (θ1, ..., θm) ∈

{0, 1}m, with θi = 0 denoting that the hypothesis Hi is null and θi = 1 denoting that the hypoth-

esis Hi is non-null. The dependence among these hypotheses is represented as a binary Markov

random field (MRF) on θ. The structure of the MRF can be described by an undirected graph

59

G(V, E) with the node set V and the edge set E . The dependence between Hi and Hj is denoted

by an edge connecting nodei and nodej in E , and the strength of dependence is parameterized by

the potential function on the edge. The degree of prior belief thatHi is null is captured by the node

potential function (parametrized by πi, 0<πi<1). Suppose that the probability density function

of the test statistic xi given θi = 0 is f0, and the density of xi given θi = 1 is f1. Then, x is an

MRF-coupled mixture. The mixture model is parameterized by a parameter set ϑ = (π,φ,ψ),

where π and φ parameterize the binary MRF andψ parameterizes f0 and f1. For example, if f0 is

standard normal N (0, 1) and f1 is noncentered normal N (µ, 1), then ψ only contains parameter

µ. Figure 4.1 shows the MRF-coupled mixture model for three dependent hypothesesHi,Hj and

Hk.

In our MRF-coupled mixture model, x is observable, and θ is hidden. With the parameter set

ϑ = (π,φ,ψ), the joint probability density over x and θ is

P (x,θ|φ,ψ) = P (θ;π,φ)∏m

i=1P (xi|θi;ψ). (4.1)

Define the marginal probability that Hi is null given all the observed statistics x under the

parameters in ϑ, Pϑ(θi = 0|x), to be the local index of significance (LIS) forHi [172]. If we can

accurately calculate the posterior marginal probabilities of θ (or LIS), then we can use a step-up

procedure to control FDR at the nominal level α as follows [172]. We first sort LIS from the

smallest value to the largest value. Suppose LIS(1), LIS(2), ..., and LIS(m) are the ordered LIS,

and the corresponding hypotheses areH(1),H(2),..., andH(m). Let

k = max

{i :

1

i

∑i

j=1LIS(j) ≤ α

}. (4.2)

Then we rejectH(i) for i = 1, ..., k.

Therefore, the key inferential problem that we need to solve is that of computing the posterior

marginal distribution of the hidden variables θi given the test statistics x, namely Pϑ(θi = 0|x),

for i = 1, ...,m. It is a typical inference problem if the parameters in ϑ are known. Section 4.2.3

60

provides possible inference algorithms for calculating Pϑ(θi = 0|x) for given ϑ. However, ϑ is

usually unknown in real-world applications, and we need to estimate it. Section 4.2.4 provides a

novel EM algorithm for parameter learning in our MRF-coupled mixture model.

4.2.3 Posterior Inference

Now we are interested in calculating Pϑ(θi = 0|x) for a given parameter set ϑ. One popular fam-

ily of inference algorithms is the sum-product family [97], also known as belief propagation [206].

For loop-free graphs, belief propagation algorithms provide exact inference results with a compu-

tational cost linear in the number of variables. In our MRF-coupled mixture model, the structure

of the latent MRF is described by a graph G(V, E). When G is chain structured, the instantiation of

belief propagation is the forward-backward algorithm [8]. When G is tree structured, the instan-

tiation of belief propagation is the upward-downward algorithm [32]. For graphical models with

cycles, loopy belief propagation [127, 197] and the tree-reweighted algorithm [189] can be used

for approximate inference. Other inference algorithms for graphical models include junction trees

[99], sampling methods [60], and variational methods [89]. Recent papers [160, 159] discuss exact

inference algorithms on binary Markov random fields which allow loops. In our simulations, we

use belief propagation when the graph G has no loops. When G has loops (e.g. in the simulations

on genetic data and the real-world application), we use a Markov chain Monte Carlo (MCMC)

algorithm to perform inference for Pϑ(θi = 0|x).

4.2.4 Parameters and Parameter Learning

In our procedure, the dependence among these hypotheses is represented by a graphical model on

the latent vector θ parameterized by π and φ, and observed test statistics x are represented by

the coupled mixture parameterized by ψ. In Sun and Cai’s work on HMMs, φ is the transition

parameter and ψ is the emission parameter. One implicit assumption in their work is that the

transition parameter and the emission parameter stay the same for i(i = 1, ...,m). Our extension

to MRFs also allows us to untie these parameters. In the second set of basic simulations in Section

61

4.3, we make φ and ψ heterogeneous and investigate how this affects the numerical performance.

In the simulations on genetic data in Section 4.4 and the real-world GWAS application in Section

4.5, we have different parameters for SNP pairs with different levels of correlation.

In our model, learning (π,φ,ψ) is difficult for two reasons. First, learning parameters is diffi-

cult by nature in undirected graphical models due to the global normalization constant [190, 198].

State-of-the-art MRF parameter learning methods include MCMC-MLE [66], contrastive diver-

gence [80] and variational methods [58]. Several new sampling methods with higher efficiency

have been recently proposed, such as persistent contrastive divergence [178], fast-weight con-

trastive divergence [179], tempered transitions [153], and particle-filtered MCMC-MLE [4]. In

our procedure, we use the persistent contrastive divergence algorithm to estimate parameters π

and φ. Another difficulty is that θ is latent and we only have one observed training sample x.

We use an EM algorithm to solve this problem. In the E-step, we run our MCMC algorithm in

Section 4.2.3 to infer the latent θ based on the currently estimated parameters ϑ = (π,φ,ψ).

In the M-step, we run the persistent contrastive divergence (PCD) algorithm [178] to estimate π

and φ from the currently inferred θ. Note that PCD is also an iterative algorithm, and we run it

until it converges in each M-step. In the M-step, we also do a maximum likelihood estimation

of ψ from the currently inferred θ and observed x. We run the EM algorithm until π, φ and ψ

converge. Although this EM algorithm involves intensive computation in both E-step and M-step,

it converges very quickly in our experiments.

4.3 Basic Simulations

In the basic simulations, we investigate the numerical performance of our multiple testing ap-

proach on different fabricated dependence structures where we can control the ground truth pa-

rameters. We first simulate θ from P (θ;π,φ) and then simulate x from P (x|θ;ψ) under a variety

of settings of ϑ = (π,φ,ψ). Because we have the ground truth parameters, we have two versions

of our multiple testing approach, namely the oracle procedure (OR) and the data-driven procedure

(LIS). The oracle procedure knows the true parameters ϑ in the graphical models, whereas the

62

data-driven procedure does not and has to estimate ϑ. The baseline procedures include the BH

procedure [12] and the adaptive p-value procedure (AP) [14, 63] which are compared by [172].

We include another baseline procedure, the local false discovery rate procedure (localFDR) [38].

The adaptive p-value procedure requires a consistent estimate of the proportion of the true null

hypotheses. The localFDR procedure requires a consistent estimate of the proportion of the true

null hypotheses and the knowledge of the distribution of the test statistics under the null and under

the alternative. In our simulations, we endow AP and localFDR with the ground truth values of

these in order to let these baseline procedures achieve their best performance.

In the simulations, we assume that the observed xi under the null hypothesis (namely θi = 0)

is standard-normally distributed and that xi under the alternative hypothesis (namely θi = 1) is

normally distributed with mean µ and standard deviation 1.0. We choose the setup and parameters

to be consistent with [172] when possible. In total, we consider three MRF models, namely a

chain-structured MRF, tree-structured MRF and grid-structured MRF. For chain-MRF, we choose

the number of hypotheses m = 3, 000. For tree-MRF, we choose perfect binary trees of height 12

which yields a total number of 8, 191 hypotheses. For grid-MRF, we choose the number of rows

and the number of columns to be 100 which yields a total number of 10, 000 hypotheses. In all

the experiments, we choose the number of replications N = 500 which is also the same as [172].

In total, we have three sets of simulations with different goals as follows.

Basic simulation 1: We stay consistent with [172] in the simulations except that we use the three

MRF models. In all three structures, (θi)m1 is generated from the MRFs whose potentials on

the edges are

φ 1− φ

1− φ φ

. Therefore, φ only contains parameter φ, and ψ only includes

parameter µ.

Basic simulation 2: One assumption in basic simulation 1 is that the parameters φ and µ are

homogeneous in the sense that they stay the same for i(i = 1, ...,m). This assumption is car-

ried down from [172]. However in many real-world applications, the transition parameters can

be different across the multiple hypotheses. Similarly, the test statistics for the non-null hy-

potheses, although normally distributed and standardized, could have different µ values. There-

63

fore, we investigate the situation where the parameters can vary in different hypotheses. The

simulations are carried out for all three different dependence structures aforementioned. In the

first set of simulations, instead of fixing φ, we choose φ’s uniformly distributed on the interval

(0.8 −∆(φ)/2, 0.8 + ∆(φ)/2). In the second set of simulations, instead of fixing µ, we choose

µ’s uniformly distributed on the interval (2.0 − ∆(µ)/2, 2.0 + ∆(µ)/2). The oracle procedure

knows the true parameters. The data-driven procedure does not know the parameters, and assumes

the parameters are homogeneous.

Basic simulation 3: Another implicit assumption in basic simulation 1 is that each individual test

in the multiple testing problem is exact. Many widely used hypothesis tests, such as Pearson’s

χ2 test and the likelihood ratio test, are asymptotic in the sense that we only know the limiting

distribution of the test statistics for large samples. As an example, we simulate the two-proportion

z-test in this section and show how the sample size affects the performance of the procedures

when the individual test is asymptotic. Suppose that we have n samples (half of them are positive

samples and half of them are negative samples). For each sample, we havem Bernoulli distributed

attributes. A fraction of the attributes are relevant. If the attributeA is relevant, then the probability

of “heads” in the positive samples (p+A) is different from that in the negative samples (p−A). p+

A and

p−A are the same ifA is nonrelevant. For each individual test, the null hypothesis is that the attribute

is not relevant, and the alternative hypothesis is otherwise. The two-proportion z-test can be used

to test whether p+A − p

−A is zero, which yields an asymptotic N (0, 1) under the null and N (µ, 1)

under the alternative (µ is nonzero). In the simulations, we fix µ, but vary the sample size n, and

apply the aforementioned tree-MRF structure (m = 8, 191). The oracle procedure and localFDR

only know the limiting distribution of the test statistics and assume the test statistics exactly follow

the limiting distributions even when the sample size is small.

Figure 4.2 shows the numerical results in basic simulation 1. Figures (1a)-(1f) are for the

chain structure. Figures (2a)-(2f) are for tree structure. Figures (3a)-(3f) are for the grid structure.

In Figures (1a)-(1c), (2a)-(2c) and (3a)-(3c), we set µ = 2 and plot FDR, FNR and the average

number of true positives (ATP) when we vary φ between 0.2 and 0.8. In Figures (1d)-(1f), (2d)-(2f)

64

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.050.070.09FDR

(1

a)

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.150.250.35FNR

(1

b)

0.2

0.3

0.4

0.5

0.6

0.7

0.8

6008001000ATP

(1

c)

1.0

1.5

2.0

2.5

3.0

3.5

4.0

0.040.060.080.10FDR

(1d)

1.0

1.5

2.0

2.5

3.0

3.5

4.0

0.00.10.20.30.40.5FNR

(1

e)

1.0

1.5

2.0

2.5

3.0

3.5

4.0

050010001500ATP

(1f)

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.050.070.09FDR

(2

a)

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.150.250.35FNR

(2

b)

0.2

0.3

0.4

0.5

0.6

0.7

0.8

150025003500ATP

(2

c)

1.0

1.5

2.0

2.5

3.0

3.5

4.0

0.040.060.080.10FDR

(2d)

1.0

1.5

2.0

2.5

3.0

3.5

4.0

0.00.10.20.30.40.5FNR

(2

e)

1.0

1.5

2.0

2.5

3.0

3.5

4.0

010003000ATP

(2f)

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.050.070.09FDR

(3

a)

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.00.10.20.30.4FNR

(3

b)

0.2

0.3

0.4

0.5

0.6

0.7

0.8

2000300040005000ATP

(3

c)

1.0

1.5

2.0

2.5

3.0

3.5

4.0

0.060.100.14FDR

(3d)

1.0

1.5

2.0

2.5

3.0

3.5

4.0

0.00.20.4FNR

(3

e)

1.0

1.5

2.0

2.5

3.0

3.5

4.0

020004000ATP

(3f)

Figu

re4.

2:C

ompa

riso

nof

BH

(©),

AP(4

),lo

calF

DR

(×),

OR

(+),

and

LIS

(�)

inba

sic

sim

ulat

ion

1:(1

)ch

ain-

MR

F,(2

)tr

ee-M

RF,

(3)g

rid-

MR

F;(a

)FD

Rvsφ

,(b)

FNR

vsφ

,(c)

AT

Pvsφ

,(d)

FDR

vsµ

,(e)

FNR

vsµ

,(f)

AT

Pvsµ

.

65

and (3d)-(3f), we set φ = 0.8 and plot FDR, FNR and ATP when we vary µ between 1.0 and 4.0.

The nominal FDR level is set to be 0.10. From Figure 4.2, we can observe comparable numerical

results between the chain structure and tree structure. The FDR levels of all five procedures are

controlled at 0.10 and BH is conservative. From the plots for FNR and ATP, we can observe that

the data-driven procedure performs almost the same as the oracle procedure, and they dominate the

p-value thresholding procedures BH and AP. The oracle procedure and the data-driven procedure

also dominate localFDR except when φ = 0.5, when they perform comparably. This is to be

expected because the dependence structure is no longer informative when φ is 0.5. In this situation

when the hypotheses are independent, our procedure reduces to the localFDR procedure. As φ

departs from 0.5 and approaches either 0 or 1.0, the difference between OR/LIS and the baselines

gets larger. When the individual hypotheses are easy to test (large µ values), the differences

between them are not substantial. When we turn to the grid structure, the numerical performance

is similar to that in the chain structure and the tree structure except for two observations. First, the

data-driven procedure does not appear to control the FDR at 0.1 when µ is small (e.g. µ = 1.0),

although the oracle procedure does, which indicates the parameter estimation in the EM algorithm

is difficult when µ is small. In other words, with a limited number of hypotheses, it is difficult to

estimate the pairwise potential parameters if the test statistics of the non-nulls do not look much

different from the test statistics of the nulls. The second observation is that the slopes of the FNR

curve and ATP curve for the grid structure are different from those in the chain and tree structures.

The reason is that the connectivity in the grid structure is higher than that in the chain and tree.

Therefore we can observe that even when the individual hypotheses are difficult to test (small µ

values), the FNR is still low because each individual hypothesis has more neighbors in the grid

than in the chain or tree, and the neighbors are informative.

Figure 4.3 shows the numerical performance in basic simulation 2. Figures (1a)-(1f), (2a)-(2f),

and (3a)-(3f) correspond to the chain structure, the tree structure and the grid structure respectively.

In Figures (1a)-(1c), (2a)-(2c), and (3a-3c), we set µ = 2 and vary ∆(φ) between 0 and 0.4. In

Figures (1d)-(1f), (2d)-(2f), and (3d)-(3f), we set φ = 0.8 and vary ∆(µ) between 0 and 4.0.

66

0.1

0.2

0.3

0.4

0.050.070.09FDR

(1a)

0.1

0.2

0.3

0.4

0.100.200.300.40FNR

(1b)

0.1

0.2

0.3

0.4

6008001200ATP

(1c)

12

34

0.050.070.09FDR

(1d)

12

34

0.150.250.35FNR

(1e)

12

34

6008001000ATP

(1f)

0.1

0.2

0.3

0.4

0.050.070.09FDR

(2a)

0.1

0.2

0.3

0.4

0.100.200.300.40FNR

(2b)

0.1

0.2

0.3

0.4

150025003500ATP

(2c)

12

34

0.050.070.09FDR

(2d)

12

34

0.150.250.35FNR

(2e)

12

34

150025003500ATP

(2f)

0.1

0.2

0.3

0.4

0.050.070.09FDR

(3a)

0.1

0.2

0.3

0.4

0.00.10.20.30.4FNR

(3b)

0.1

0.2

0.3

0.4

2000300040005000ATP

(3c)

12

34

0.040.060.080.10FDR

(3d)

12

34

0.00.10.20.30.4FNR

(3e)

12

34

2000300040005000ATP

(3f)

Figu

re4.

3:C

ompa

riso

nof

BH

(©),

AP(4

),lo

calF

DR

(×),

OR

(+),

and

LIS

(�)

inba

sic

sim

ulat

ion

2:(1

)ch

ain-

MR

F,(2

)tr

ee-M

RF,

(3)g

rid-

MR

F;(a

)FD

Rvs

∆(φ

),(b

)FN

Rvs

∆(φ

),(c

)AT

Pvs

∆(φ

),(d

)FD

Rvs

∆(µ

),(e

)FN

Rvs

∆(µ

),(f

)AT

Pvs

∆(µ

).

67

Again, the nominal FDR level is set to be 0.10. From Figure 4.3, we observe that all five proce-

dures control FDR at the nominal level and BH is conservative when the transition parameter φ is

heterogeneous. However, the data-driven procedure becomes more and more conservative as we

increase the variance of φ in the grid-structure. Nevertheless, the data-driven procedure does not

lose much efficiency compared with the oracle procedure based on FNR and ATP. Both the data-

driven procedure and the oracle procedure dominate the three baselines. When the µ parameter

is heterogeneous, all five procedures are still valid, but the data-driven procedure becomes more

and more conservative as we increase the variance of µ. The data-driven procedure can be more

conservative than the BH procedure when ∆(µ) is large enough. The conservativeness appears

most severe in the grid-structure. However when we look at the FNR and ATP, the data-driven

procedure still dominates BH, AP and localFDR substantially in all the situations, although the

data-driven procedure loses a certain amount of efficiency compared with the oracle procedure

when the variance of µ gets large.

100 200 300 400 500

0.05

0.07

0.09

0.11

FD

R

Sample size(a)

100 200 300 400 500

0.10

0.20

0.30

0.40

FN

R

Sample size(b)

100 200 300 400 500

1500

2500

3500

AT

P

Sample size(c)

Figure 4.4: Comparing BH(©), AP(4), localFDR(×), OR(+), and LIS(�) in basic simulation 3:(a)FDR vs n, (b)FNR vs n, (c)ATP vs n.

Figure 4.4 shows the results from basic simulation 3. The oracle procedure and localFDR are

liberal when the sample size is small. This is because when the sample size is small, there exists

a discrepancy between the true distribution of the test statistic and the limiting distribution. Quite

surprisingly, the data-driven procedure stays valid. The reason is that the data-driven procedure

can estimate the parameters from data. The data-driven procedure and the oracle procedure still

68

have comparable performance and enjoy a much lower level of FNR compared with the baselines.

For all the basic simulations, we set the nominal FDR level to be 0.10. We have also replicated the

basic simulations by setting the nominal level to be 0.05, and similar conclusions can be made.

4.4 Simulations on Genetic Data

Unlike the fabricated dependence structures in the basic simulations in Section 4.3, the depen-

dence structure in the simulations on genetic data in this section is real. We simulate the linkage

disequilibrium structure of a segment on human chromosome 22, and treat a test of whether a SNP

is associated as one individual test. We follow the simulation settings in [201]. We use HAPGEN2

[170] and the CEU sample of HapMap [175] (Release 22) to generate SNP genotype data at each

of the 2, 420 loci between bp 14431347 and bp 17999745 on Chromosome 22. A total of 685 out

of 2, 420 SNPs can be genotyped with the Affymetrix 6.0 array. These are the typed SNPs that we

use for our simulations. Within the overall 2, 420 SNPs, we randomly select 10 SNPs to be the

causal SNPs. All the SNPs on the Affymetrix 6.0 array whose r2 values, according to HapMap,

with any of the causal SNPs are above t are set to be the associated SNPs. In the simulations, we

report results for three different t values, namely 0.8, 0.5 and 0.25. We also simulate three differ-

ent genetic models (additive model, dominant model, and recessive model) with different levels

of relative risk (1.2 and 1.3). In total, we simulate 250 cases and 250 controls. The experiment

is replicated for 100 times and the average result is provided. With the simulated data, we apply

our multiple testing procedure (LIS) and three baseline procedures: the BH procedure, the adap-

tive p-value procedure (AP), and the local false discovery rate procedure (localFDR). Because the

dependence structure is real and the ground truth parameters are unknown to us, we do not have

the oracle procedure in the simulations on genetic data.

With the simulated genetic data, we use two commonly used tests in genetic association stud-

ies, namely two-proportion z-test and Cochran-Armitage’s trend test (CATT) [31, 3, 163, 52] as

the individual tests for the association of each SNP. CATT also yields an asymptotic N (0, 1)

under the null and N (µ, 1) under the alternative (µ is nonzero). Therefore, we parameterize

69

ψ = (µ1, σ21) where µ1 and σ2

1 are the mean and variance of the test statistics under alternative.

The graph structure is built as follows. Each SNP becomes a node in the graph. For each SNP,

we connect it with the SNP with the highest r2 value with it. There are in total 490 edges in the

graph. We further categorize the edges into a high correlation edge set Eh (r2 above 0.8), medium

correlation edge set Em (r2 between 0.5 and 0.8) and low correlation edge set El (r2 between 0.25

and 0.5). We have three different parameters (φh, φm, and φl) for the three sets of edges. Then

the density of θ in formula (4.1) takes the form

P (θ;φ) ∝ exp{∑

(i,j)∈Eh

φhI(θi = θj)+

∑(i,j)∈Em

φmI(θi = θj) +∑

(i,j)∈El

φlI(θi = θj)},(4.3)

where I(θi = θj) is an indicator variable that indicates whether θi and θj take the same value.

In the MCMC algorithm, we run the Markov chain for 20, 000 iterations with a burn-in of 100

iterations. In the PCD algorithm, we generate 100 particles. In each iteration of PCD learning,

the particles move forward for 5 iterations (the n parameter in PCD-n). The learning rate in PCD

gradually decreases as suggested by [178]. The EM algorithm converges after about 10 to 20

iterations, which usually take less than 10 minutes on a 3.00GHz CPU.

Figure 4.5 shows the performance of the procedures in the additive models with the homozy-

gous relative risk set to 1.2 and 1.3. The test statistics are from a two-proportion z-test. We

have also replicated the simulations on Cochran-Armitage’s trend test, and the results are almost

the same. In Figure 4.5, table (1) summarizes the empirical FDR and the total number of true

positives (#TP) of our LIS procedure, BH, AP and localFDR (lfdr), in the additive models with

different (homozygous) relative risk levels, when we vary t and when we vary the nominal FDR

level α. We regard a SNP having r2 above t with any causal SNP as an associated SNP, and we

regard a rejection of the null hypothesis for an associated SNP as a true positive. Our LIS pro-

cedure and localFDR are valid while being conservative. BH and AP appear liberal in some of

70

t =

0.8

t = 0

.5

t =

0.2

5

LIS

BH

A

P lfd

r

LIS

BH

A

P lfd

r

LIS

BH

A

P lfd

r

rr =

1.2

α

= 0.

05

FDR

: 0.

018

0.05

9 0.

059

0.01

0

0.01

8 0.

059

0.05

9 0.

010

0.

018

0.05

8 0.

058

0.00

9 #T

P:

12

11

11

1

12

11

11

1

20

18

19

7

α =

0.10

FD

R:

0.07

7 0.

089

0.08

9 0.

010

0.

077

0.08

9 0.

089

0.01

0

0.07

6 0.

079

0.07

9 0.

009

#TP:

13

11

11

1

13

11

11

1

21

20

20

8

rr =

1.3

α

= 0.

05

FDR

: 0.

047

0.04

4 0.

054

0.01

5

0.04

7 0.

044

0.06

4 0.

005

0.

046

0.04

4 0.

064

0.01

4 #T

P:

16

4 4

1

16

4 4

1

22

10

10

6

α =

0.10

FD

R:

0.06

7 0.

104

0.10

4 0.

015

0.

067

0.10

4 0.

104

0.00

5

0.06

6 0.

103

0.10

3 0.

014

#TP:

18

15

15

1

18

15

15

1

27

21

21

6

(1)

0.0

0.2

0.4

0.6

0.8

1.0

0.00.20.40.60.81.0

FPR

TPR

(2a)

RO

C t=

0.8

0.0

0.2

0.4

0.6

0.8

1.0

0.10.20.30.4

Reca

ll

Precision

(2b)

PR

t=0.

8

0.0

0.2

0.4

0.6

0.8

1.0

0.00.20.40.60.81.0

FPR

TPR(2

c) R

OC

t=0.

5

0.0

0.2

0.4

0.6

0.8

1.0

0.10.20.30.4

Reca

ll

Precision

(2d)

PR

t=0.

5

0.0

0.2

0.4

0.6

0.8

1.0

0.00.20.40.60.81.0

FPR

TPR

(2e)

RO

C t=

0.25

0.0

0.2

0.4

0.6

0.8

1.0

0.10.20.30.4

Reca

ll

Precision

(2f)

PR

t=0.

25

0.0

0.2

0.4

0.6

0.8

1.0

0.00.20.40.60.81.0

FPR

TPR

(3a)

RO

C t=

0.8

0.0

0.2

0.4

0.6

0.8

1.0

0.10.20.30.4

Reca

ll

Precision

(3b)

PR

t=0.

8

0.0

0.2

0.4

0.6

0.8

1.0

0.00.20.40.60.81.0

FPR

TPR

(3c)

RO

C t=

0.5

0.0

0.2

0.4

0.6

0.8

1.0

0.10.20.30.4

Reca

ll

Precision

(3d)

PR

t=0.

5

0.0

0.2

0.4

0.6

0.8

1.0

0.00.20.40.60.81.0

FPR

TPR

(3e)

RO

C t=

0.25

0.0

0.2

0.4

0.6

0.8

1.0

0.10.20.30.4

Reca

ll

Precision

(3f)

PR

t=0.

25

Figu

re4.

5:C

ompa

riso

nof

BH

,AP,

loca

lFD

Ran

dL

ISin

the

addi

tive

mod

els

whe

nw

eva

ryre

lativ

eri

skrr

,tan

dth

eno

min

alFD

Rle

velα

.Ta

ble

(1)

sum

mar

izes

resu

lts.

Subfi

gure

s(2

a)-(

2f)

show

sR

OC

and

PRcu

rves

ofL

IS(s

olid

red

lines

)an

din

divi

dualp

-val

ues

(das

hed

gree

nlin

es)w

ithrr

=1.

2.S

ubfig

ures

(3a)

-(3f

)sho

ws

RO

Can

dPR

curv

esof

LIS

(sol

idre

dlin

es)

and

indi

vidu

alp

-val

ues

(das

hed

gree

nlin

es)w

ithrr

=1.

3.

71

the configurations. In any of the circumstances, our LIS procedure can identify more associated

SNPs than the baselines. We can find a clue to why our procedure LIS is being conservative from

the results in Figure 4.3. In basic simulation 2, we observe that when the parameters µ and φ are

heterogeneous and we carry out the data-driven procedure under the homogeneous parameter as-

sumption, the data-driven procedure is conservative. The discrepancy between the nominal FDR

level and the empirical FDR level increases as the parameters move further away from homogene-

ity. Although we assign three different parameters φh, φm, and φl to Eh, Em and El respectively,

the edges within the same set (e.g. El) may still be heterogeneous. The fact that the LIS procedure

recaptures more true positives than the baselines while remaining more conservative in many con-

figurations indicates that the local indices of significance provide a ranking more efficient than the

ranking provided by the p-values from the individual tests. Therefore, we further plot the ROC

curves and precision-recall (PR) curves when we rank SNPs by LIS and by the p-values from the

two-proportion z-test. The ROC curve and PR curve are vertically averaged from 100 replications.

Subfigures (2a)-(2f) are for the additive model with homozygous relative risk level set to be 1.2.

Subfigures (3a)-(3f) are for the additive model with homozygous relative risk level set to be 1.3.

It is observed that the curves from LIS dominate those from the p-values from individual tests in

most places, which further suggests that LIS provides a more efficient ranking of the SNPs than

the individual tests.

Figure 4.6 shows the performance of the procedures in the dominant model and the recessive

model with the homozygous relative risk set to be 1.2. The test statistics are from a two-proportion

z-test. In Figure 4.6, table (1) summarizes the empirical FDR and the total number of true pos-

itives (#TP) of our LIS procedure, BH, AP and localFDR (lfdr) in the dominant model and the

recessive model when we vary t and when we vary the nominal FDR level α. Our LIS proce-

dure and localFDR are valid while being conservative in all configurations, and they appear more

conservative in the recessive model than in the dominant model. On the other hand, BH and AP

appear liberal in the recessive model. Our LIS procedure still confers an advantage over the base-

lines in the dominant model. The LIS procedure also recaptures almost the same number of true

72

positives as BH and AP while maintaining a much lower FDR in the recessive model. Again,

we further plot the ROC curves and precision-recall curves when we rank SNPs by LIS and by

the p-values from individual tests. Subfigures (2a)-(2f) are for the dominant model. Subfigures

(3a)-(3f) are for the recessive model. It is also observed that the curves from LIS dominate those

from the p-values from individual tests in most places, which also suggests that LIS provides a

more efficient ranking.


Our primary GWAS dataset on breast cancer is CGEMS dataset. Details about CGEMS dataset

are provided in Subsection 3.4.1. Our secondary GWAS dataset comes from Marshfield Clinic

dataset. Details about CGEMS dataset are provided in Subsection 3.4.3.

We apply our multiple testing procedure on the CGEMS data. The settings of the procedure are

the same as in the simulations on genetic data in Section 4.4. The individual test is two-proportion

z-test. Our procedure reports 32 SNPs with LIS value of 0.0 (an estimated probability 1.0 of

being associated). We further calculate the per-allele odds-ratio of these SNPs on the Marshfield

data, and 14 of them have an odds-ratio around 1.2 or above. There are two clusters among

them. First, rs3870371, rs7830137 and rs920455 (on chromosome 8) locate near each other

and near the gene hyaluronan synthase 2 (HAS2) which has been shown to be associated with

invasive breast cancer by many studies [182, 101, 17]. The other cluster includes rs11200014,

rs2981579, rs1219648, and rs2420946 on chromosome 10. They are exactly the 4 SNPs reported

by [82]. Their associated gene FGFR2 is also well known to be associated with breast cancer.

SNP rs4866929 on chromosome 5 is also very likely to be associated because it is highly correlated

(r2=0.957) with SNP rs981782 (not included in our data) which was identified from a much larger

dataset (4, 398 cases and 4, 316 controls and a follow-up confirmation stage on 21, 860 cases and

22, 578 controls) by [33].

73

t =

0.8

t = 0

.5

t =

0.2

5

LIS

BH

A

P lfd

r

LIS

BH

A

P lfd

r

LIS

BH

A

P lfd

r

Dom

inan

t α

= 0.

05

FDR

: 0.

026

0.04

0 0.

040

0.01

0

0.02

6 0.

040

0.04

0 0.

010

0.

025

0.03

9 0.

039

0.00

9 #T

P:

14

4 4

2

14

4 4

2

21

10

10

7

α =

0.10

FD

R:

0.05

1 0.

079

0.08

9 0.

010

0.

048

0.07

9 0.

109

0.01

0

0.04

4 0.

079

0.10

9 0.

009

#TP:

20

12

12

3

22

12

12

3

33

19

29

18

Rec

essi

ve

α =

0.05

FD

R:

0.00

9 0.

079

0.07

9 0.

009

0.

009

0.07

9 0.

079

0.00

9

0.00

9 0.

079

0.07

9 0.

009

#TP:

11

11

11

11

11

11

11

11

18

17

18

17

α =

0.10

FD

R:

0.01

8 0.

104

0.10

4 0.

009

0.

018

0.10

4 0.

114

0.00

9

0.01

7 0.

104

0.11

4 0.

009

#TP:

11

12

12

11

11

12

12

11

22

21

21

17

(1)

0.0

0.2

0.4

0.6

0.8

1.0

0.00.20.40.60.81.0

FPR

TPR

(2a)

RO

C t=

0.8

0.0

0.2

0.4

0.6

0.8

1.0

0.10.20.30.4

Reca

ll

Precision

(2b)

PR

t=0.

8

0.0

0.2

0.4

0.6

0.8

1.0

0.00.20.40.60.81.0

FPR

TPR(2

c) R

OC

t=0.

5

0.0

0.2

0.4

0.6

0.8

1.0

0.10.20.30.4

Reca

ll

Precision

(2d)

PR

t=0.

5

0.0

0.2

0.4

0.6

0.8

1.0

0.00.20.40.60.81.0

FPR

TPR

(2e)

RO

C t=

0.25

0.0

0.2

0.4

0.6

0.8

1.0

0.10.20.30.4

Reca

ll

Precision

(2f)

PR

t=0.

25

0.0

0.2

0.4

0.6

0.8

1.0

0.00.20.40.60.81.0

FPR

TPR

(3a)

RO

C t=

0.8

0.0

0.2

0.4

0.6

0.8

1.0

0.10.20.30.4

Reca

ll

Precision

(3b)

PR

t=0.

8

0.0

0.2

0.4

0.6

0.8

1.0

0.00.20.40.60.81.0

FPR

TPR

(3c)

RO

C t=

0.5

0.0

0.2

0.4

0.6

0.8

1.0

0.10.20.30.4

Reca

ll

Precision

(3d)

PR

t=0.

5

0.0

0.2

0.4

0.6

0.8

1.0

0.00.20.40.60.81.0

FPR

TPR

(3e)

RO

C t=

0.25

0.0

0.2

0.4

0.6

0.8

1.0

0.10.20.30.4

Reca

ll

Precision

(3f)

PR

t=0.

25

Figu

re4.

6:C

ompa

riso

nof

BH

,AP,

loca

lFD

Ran

dL

ISin

the

dom

inan

tmod

elan

dth

ere

cess

ive

mod

elw

ithdi

ffer

entt

valu

esan

ddi

ffer

ent

nom

inal

FDRα

valu

es.

Tabl

e(1

)su

mm

ariz

esre

sults

.Su

bfigu

res

(2a)

-(2f

)sh

ows

RO

Can

dPR

curv

esof

LIS

(sol

idre

dlin

es)

and

indi

vidu

alp

-val

ues

(das

hed

gree

nlin

es)

inth

edo

min

antm

odel

.Su

bfigu

res

(3a)

-(3f

)sh

ows

RO

Can

dPR

curv

esof

LIS

and

indi

vidu

alp

-val

ues

inth

ere

cess

ive

mod

el.

74

4.6 Discussion

In this chapter, we use an MRF-coupled mixture model to leverage the dependence in multiple

testing problems, and show the improved numerical performance on a variety of simulations and

its applicability in a real-world GWAS problem. A theoretical question of interest is whether this

graphical model based procedure is optimal in the sense that it has the smallest FNR among all

the valid procedures. The optimality of the oracle procedure can be proved under the compound

decision framework [171, 172], as long as an exact inference algorithm exists or an approximate

inference algorithm can be guaranteed to converge to the correct marginal probabilities. The

asymptotic optimality of the data-driven procedure (the FNR yielded by the data-driven procedure

approaches the FNR yielded by the oracle procedure as the number of tests m → ∞) requires

consistent estimates of the unknown parameters in the graphical models. Parameter learning in

undirected models is more complicated than in directed models due to the normalization constant.

To the best of our knowledge, asymptotic properties of parameter learning for hidden MRFs and

MRF-coupled mixture models have not been investigated. Therefore, we cannot prove the asymp-

totic optimality of the data-driven procedure so far, although we can observe its close-to-oracle

performance in the basic simulations.

The material in this chapter first appeared in the 28th Conference on Uncertainty in Artificial

Intelligence (UAI’2012) as follows:

Jie Liu, Chunming Zhang, Catherine McCarty, Peggy Peissig, Elizabeth Burnside and David

Page. Graphical-model Based Multiple Testing under Dependence, with Applications to Genome-

wide Association Studies. The 28th Conference on Uncertainty in Artificial Intelligence (UAI),

2012.

The graphical model used in this chapter is fully parametric. However in practice, f1 is often

heterogeneous, and cannot be estimated with a simple parametric distribution. The next chapter

proposes a semiparametric graphical model for multiple testing under dependence, which esti-

mates f1 adaptively. This semiparametric approach is still effective to capture the dependence

among multiple hypotheses, and it exactly generalizes the local FDR procedure [38] and connects

75

with the BH procedure [12].

Chapter 5

Multiple Testing under Dependence via

Semiparametric Graphical Models

By extending earlier work of Sun and Cai [172], the previous chapter shows that graphical models

can be used to leverage the dependence in large-scale multiple testing problems, with significantly

improved performance [107]. These graphical models are fully parametric and require that we

know the parameterization of f1 — the density function of the test statistic under the alternative

hypothesis. However in practice, f1 is often heterogeneous, and cannot be estimated with a sim-

ple parametric distribution. This chapter proposes a novel semiparametric approach for multiple

testing under dependence, which estimates f1 adaptively. This semiparametric approach exactly

generalizes the local FDR procedure [38] and connects with the BH procedure [12]. A variety

of simulations show that our semiparametric approach outperforms classical procedures which

assume independence and the parametric approaches which capture dependence.

5.1 Introduction

High-throughput computational biology studies, such as gene expression analysis and genome-

wide association studies, often involve large-scale multiple testing problems which exhibit de-

76

77

pendence in the sense that whether the null hypothesis of one test is true or not depends on the

ground truth of other tests. Recently, new multiple testing procedures have been proposed with

such dependence explicitly captured by graphical models such as hidden Markov models [172]

and Markov-random-field-coupled mixture models [107]. These graphical models are fully para-

metric, and they assume that we know not only the parameterization form of f0, but also the

parameterization form of f1. 1 Eventually, a fully parametric graphical model is learned, and

the multiple testing problem becomes an inference problem on the graphical model. This para-

metric approach is effective in some simple situations, but the assumptions for f1 often make it

impractical, as discussed next.

A long tradition in hypothesis testing is to derive test statistics and calculate P -values all under

the null hypothesis H0. When testing multiple hypotheses, we control familywise error rate via

Bonferroni correction, or we control false discovery rate via the Benjamini-Hochberg (BH) pro-

cedure [12], both of which are P -value thresholding procedures, and all calculation is done under

H0. Statisticians avoid making assumptions about f1 because the distribution of the test statistic

under H1 sometimes can be difficult to derive. Take for instance a two-proportion z-test, which

tests whether two Bernoulli variables have the same parameter (i.e. P (head) in coin-flippings);

the two-proportion z-test is widely used in case-control studies (e.g. comparing the minor allele

frequencies in cases and controls). Under H0 (the two proportions are the same), the test statistic

X asymptotically follows a standard normal N (0, 1). Under H1 (the two proportions are dif-

ferent), X asymptotically follows a standardized non-centered normal N (µ, 1) (µ 6= 0) where

µ depends on the odds-ratio of this genetic marker. When there are multiple genetic markers to

be tested, f0 remains N (0, 1), but f1 becomes a mixture of Gaussians because these associated

markers can have different odds-ratios and therefore different µ values (i.e. different effect sizes).

In this situation, f1 is no longer a simple parametric distribution. In a real-world genome-wide

association study on breast cancer, we plot the estimated f1 in Figure 5.1; obviously it is inap-

1f0 and f1 are the probability density functions of the test statistic under the null hypothesis H0 and the alternativehypothesis H1, respectively. In the HMM model [172] and the MRF-coupled mixture model [107], f0 and f1 are theemitting probabilities for state 0 and state 1 respectively.

78

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Test Statistics

f 1

Figure 5.1: Estimated f1 in a real-world genome-wide association study on breast cancer.

propriate to estimate f1 with a simple parametric distribution. Note that this is not a problem for

classical multiple testing procedures such as the BH procedure, whose calculations of P -values

are done under H0, but this is a serious problem for the graphical-model-based procedures which

require f1 to be estimated parametrically. Therefore, the key question is whether we can still make

use of the graphical models to leverage the dependence among the hypotheses without making

assumptions about f1.

In this paper, we propose a semiparametric graphical model to leverage the dependence among

the hypotheses. In our model, f1 is estimated nonparametrically and the remaining parts are esti-

mated parametrically. More algorithmic details are introduced in Section 5.3 after we summarize

the terminology in Section 5.2. Section 5.4 shows that the two widely-used multiple testing pro-

cedures, the BH procedure [12] and the local FDR procedure [38], estimate their parameters in the

same semiparametric way to avoid assumptions about f1. This unification demonstrates that the

most appropriate way of using graphical models to capture the dependence is the semiparametric

model in our paper rather than the fully parametric models [172, 107]. Simulations in Section

5.5 show that our semiparametric approach controls false discovery rate and reduces false non-

discovery rate, compared with the baseline procedures. We apply the procedure to a real-world

genome-wide association study on breast cancer in Section 5.6 and identify a number of genetic

variants.

79

RETAINED REJECTED TOTAL

H0 IS TRUE N00 N10 m0

H0 IS FALSE N01 N11 m1

TOTAL S R m

Table 5.1: Classification of tested hypotheses

5.2 Preliminaries

FDR, FNR, Validity and Efficiency: When we test m hypotheses simultaneously, various out-

comes can be described by Table 5.1 based on their ground truth and whether the hypotheses are re-

jected. False discovery rate (FDR),E(N10/R|R>0)P (R>0), is the expected proportion of incor-

rectly rejected null hypotheses [12]. False non-discovery rate (FNR), E(N01/S|S>0)P (S>0),

is the expected proportion of false non-rejections in those tests whose null hypotheses are not

rejected [62]. An FDR procedure is valid if it controls FDR at a nominal level α. One valid proce-

dure is more efficient than another if it has a smaller FNR. In multiple testing problems, we would

like to control FDR at the nominal level and reduce FNR as much as possible.

Dependence in Multiple Testing: Classical multiple testing procedures usually assume inde-

pendence among the hypotheses. The effects of dependence on multiple testing have been inves-

tigated with a focus on the validity issue, namely how to control FDR at the nominal level when

dependence exists [16, 49, 147, 136, 155, 34, 46, 150, 169, 204, 19]. Despite FDR-control chal-

langes, dependence also brings opportunities for decreasing FNR. This efficiency issue has been

investigated [207, 64, 11, 212], indicating FNR could be decreased by leveraging the dependence

among hypotheses. Several approaches have been proposed, such as dependence kernels [100],

factor models [53] and principal factor approximation [42]. [172] use a hidden Markov model to

explicitly leverage chain dependence structures [172]. [107] extend such graphical-model-based

approaches to general dependence structures via a Markov-random-field-coupled mixture model.

Capturing the dependence in multiple testing in such an explicit manner is innovative, but it relies

on the strong assumption that we know the parameterization of f1, which is unrealistic in all but

the simplest situations. Improper assumption of f1 may make the testing procedure too liberal,

80

e.g. Figure 4 of [172], or conservative, e.g. Figure 3 of [107]. In this paper, we build on the

approach of [107] and take the major step of relaxing this assumption by estimating f1 adaptively.

5.3 Methods

5.3.1 Graphical models for Multiple Testing

Let x = (x1, ..., xm) be a vector of test statistics from hypotheses (H1, ...,Hm) with their ground

truth denoted by a latent Bernoulli vector θ = (θ1, ..., θm) ∈ {0, 1}m, with θi = 0 denoting

that the hypothesis Hi is null and θi = 1 denoting that the hypothesis Hi is non-null. In [107],

the dependence among these hypotheses is represented as a binary Markov random field (MRF)

on θ. The structure of the MRF is assumed to be known, and described by an undirected graph

G(V, E) with the node set V and the edge set E . The dependence between Hi and Hj is denoted

by an edge connecting nodei and nodej . The strength of dependence is captured by a potential

function (parametrized by φij , 0<φij<1) on this edge. The degree of prior belief that Hi is

null is captured by the node potential function (parametrized by πi, 0<πi<1). Suppose that the

probability density function of the test statistic xi|θi=0 is f0, and the density of xi|θi=1 is f1.

Then (x,θ;π,φ, f0, f1) forms an MRF-coupled mixture model where π and φ are node potential

functions and edge potential functions in the MRF. In the MRF-coupled mixture model, x is

observed, and θ is hidden. We also need to estimate π, φ and f1. 2

For the reasons discussed in Section 5.1, it is often difficult to estimate f1 with a simple para-

metric distribution. In order to avoid the f1 assumption, we estimate f1 adaptively via an indirect,

nonparametric way, as introduced in Section 5.3.2. Then we estimate π and φ via a contrastive di-

vergence style algorithm, as introduced in Section 5.3.3. Therefore the graphical model is learned

semiparametrically — f1 is learned nonparametrically and the MRF part is learned by estimating

parameters φ and π. Finally, we perform marginal inference of θ|x with the learned model and

reject hypotheses with a step-up procedure to control FDR, as introduced in Section 5.3.4. Figure

2f0 is usually known to us in hypothesis testing.

81

…

…

…111

11

11

1

1

, ,

,

Figure 5.2: The semiparametric graphical model for hypotheses Hi, Hj and Hk with observedtest statistics (xi, xj , xk) and latent ground truth (θi, θj , θk).

5.2 shows the semiparametric MRF-coupled mixture model for the three dependent hypotheses

Hi,Hj andHk.

5.3.2 Nonparametric Estimation of f1

We cannot directly estimate f1 from observed x because the ground truth θ is hidden. However,

we can estimate f from observed x nonparametrically via kernel density estimation. Therefore,

we can estimate f1 indirectly using the rule of total probability

f(x) = p0f0(x) + (1− p0)f1(x), (5.1)

where p0 is the proportion of null hypotheses. Since we know f0 in advance (e.g. N (0, 1)),

we only need to estimate f and p0 so as to estimate f1.

Estimating p0: We can estimate p0 with the method in [166], namely

82

p0(λ) =W (λ)

(1− λ)m, (5.2)

where λ ∈ [0, 1) is a tuning parameter, and W (λ) is the total number of hypotheses whose

P -values are above λ. The motivation of this estimation is that the P -values of null hypotheses are

uniformly distributed on the interval (0, 1). If we assume all the hypotheses with P -values greater

than λ are from null hypotheses, then W (λ)/(1 − λ) is the total number of null hypotheses.

Therefore the right hand side of (5.2) is an estimate of p0. Obviously, p0(λ) over-estimates p0

because there might be nonnull hypotheses whose P -values are greater than λ, especially when

λ is small. Therefore, a bias-variance trade-off presents in the choice of λ — a larger λ value

yields less bias but brings in more variance. [168] showed that the BH procedure coupled with

p0(λ) maintains strong control of FDR under mild conditions. In simulations, we test different λ

values, and the results show that the performance of our multiple testing procedure is insensitive

to different choices of λ.

Estimating f : Since we can observe all the test statistics x, we can estimate f directly via

kernel density estimation [152]. One may choose any kernel function and bandwidth parameter

as long as they provide a reasonable estimate. A Gaussian kernel would be a natural choice.

Nevertheless in our experiments, we use the Epanechnikov kernel because its computation burden

is low, and it is optimal in a minimum variance sense [39]. Finally we can get f , the nonparametric

estimate of f .

Estimating f1: With the estimated p0 and f , we estimate f1 as

f1(x) =f(x)− p0f0(x)

1− p0. (5.3)

5.3.3 Parametric Estimation of φ and π

The pairwise potential functions φ and the node potential functions π parametrize the Markov

random field part of the model. In the simulations, we tie all the pairwise potential functions

together, i.e. φ={φ}. In the real-world application in Section 5.6, we assume there are three types

83

of edges (high correlation edges, medium correlation edges and low correlation edges), and there

are three parameters, φ={φh, φm, φl}, corresponding to the three levels of correlation. We also

tie all the node potentials in both the simulations and the real-world application, i.e. π={π}.

Parameter learning for MRFs is generally difficult due to the partition function. So far, the

state-of-the-art parameter learning algorithms are based on contrastive divergence [80], such as the

persistent contrastive divergence (PCD) algorithm [178]. Contrastive divergence algorithms are

iterative algorithms which gradually update parameters by generating particles based on current

estimates of parameters and then comparing the moments from the particles with the moments

from the data. Contrastive divergence is related to pseudo-likelihood [18] and ratio matching

[83, 84]. However, contrastive divergence algorithms cannot be directly applied to our model

because θ is hidden. Therefore, we modify the PCD algorithm as follows. Suppose we already

generate particles for θ in the normal PCD algorithm. We further generate the particles for x

using f0 and f1 conditional on the generated particles for θ. Then we update the parameters by

comparing the moments from particles for x and the moments from the observed x.

5.3.4 Inference of θ and FDR Control

After we estimate f1, φ and π, the MRF-coupled mixture model is fully specified, and the next

importance step is to calculate the posterior probability thatHi is null given all the observed statis-

tics x, namely P (θi=0|x) for i = 1, ...,m. This quantity is termed the local index of significance

(LIS) [172], which reduces to local false discovery rate P (θi=0|xi) when the hypotheses are in-

dependent. In our simulations and the real-world application, we use a Markov chain Monte Carlo

(MCMC) algorithm to perform posterior inference for P (θi=0|x).

After we calculate the posterior marginal probabilities of θ (i.e. LIS), we use a step-up pro-

cedure [172] to decide which of the hypotheses should be rejected so as to control FDR at the

nominal level α. We first sort LIS from the smallest value to the largest value. Suppose LIS(1),

LIS(2), ..., and LIS(m) are the ordered LIS, and the corresponding hypotheses are H(1), H(2),...,

andH(m). Let

84

k = max

{i :

1

i

∑i

j=1LIS(j) ≤ α

}. (5.4)

Then we rejectH(i) for i = 1, ..., k.

5.4 Connections with Classical Multiple Testing Procedures

We show that both the local FDR procedure [38] and the BH procedure [14, 63] can be regarded

as semiparametric graphical models which do not consider dependence among the hypotheses.

The local FDR procedure uses Bayes Theorem to calculate the posterior probability thatHi is null

given its observed test statistic xi, namely

P (Hi is null|Xi=xi) =p0f0(xi)

p0f0(xi) + p1f1(xi). (5.5)

This posterior probability is termed the local false discovery rate [37]. Note that our LIS

reduces to local false discovery rate under the assumption of independence. [37] recommend

using empirical Bayes inference [148] to calculate local false discovery rate as

P (Hi is null|Xi=xi) =p0f0(xi)

f(xi), (5.6)

where f is the empirical density of the test statistic and p0 is an estimate of p0. If we use θi

to denote the ground truth of Hi, its local false discovery rate is P (θi = 0|Xi=xi). Therefore,

we can use the graphical model in Figure 5.3(a) to denote it. Obviously, this model is exactly

our semiparametric model in Figure 5.2, except that there are no pairwise potentials capturing

the dependence because the local FDR procedure assumes independence among the hypotheses.

The model for the local FDR procedure is also semiparametric because f1 is nonparametrically

estimated. Also note that the parameter π in our model reduces to the prior parameter p0 in this

simplified model.

The following shows that the BH procedure is also a semiparametric model, but the observed

85

,1

1, … ,

1

1, … ,

b TheBHprocedurea ThelocalFDRprocedure

,

Figure 5.3: The plate presentation of the semiparametric graphical models for local FDR and theBH procedure.

statistic is modeled by a cumulative distribution function (CDF). Let P(1)<...<P(m) be the ordered

P -values from the m tests and P(0)=0. The BH procedure rejects any hypothesis whose P -value

satisfies P ≤ P ∗ with

P ∗ = max{P(i)|P(i) ≤i

m

α

p0}, (5.7)

which controls FDR at the level α [12, 166, 62]. The inequality in (5.7) can be rewritten as

p0P(i)

i/m≤ α. (5.8)

Because a P -value is the CDF of f0 at the value of its test statistic x, and i/m is the empirical

CDF of f at the test statistic ofH(i), (5.8) is further rewritten as

p0F0(x)

F (x)≤ α, (5.9)

where F0 and F are the CDFs of f0 and f respectively, and F is an empirical version of

F . Note that the left hand side of (5.9) is also an empirical Bayes inference, similar to (5.6).

Therefore, both the BH procedure and the local FDR procedure can be interpreted as empirical

86

1.5 2.0 2.5 3.0 3.5 4.0 4.5

0.00

0.04

0.08

● ● ● ● ● ● ● ● ●

µ

FD

R

(1a) Chain structure, φ = 0.8

1.5 2.0 2.5 3.0 3.5 4.0 4.5

0.06

0.08

0.10

0.12 ●

●

●

●

●

●

●● ●

µ

FN

R

(1b) Chain structure, φ = 0.8

1.5 2.0 2.5 3.0 3.5 4.0 4.5

800

1000

1200

1400

●

●

●

●

●

●

●● ●

µ

ATP

(1c) Chain structure, φ = 0.8

1.5 2.0 2.5 3.0 3.5 4.0 4.5

0.00

0.04

0.08 ●

● ● ● ● ● ● ● ●

µ

FD

R


1.5 2.0 2.5 3.0 3.5 4.0 4.5

0.12

0.16

0.20

0.24 ●

●

●

●

●

●●

● ●

µ

FN

R


1.5 2.0 2.5 3.0 3.5 4.0 4.51400

1800

2200

2600

●

●

●

●

●

●●

● ●

µ

ATP


●

OracleSemiparametricBHlocal FDRParametric

Figure 5.4: Performance of the procedures under Model 1 when (1) φ = 0.8 and (2) φ = 0.6 interms of (a) FDR (b) FNR and (c) ATP when the dependence structure is chain.

Bayes inference, and the difference is that the BH procedure uses the CDFs whereas the local

FDR procedure uses the density functions. Thus, we can present the BH procedure as the graph-

ical model in Figure 5.3(b). This model is also semiparametric because F1 is nonparametrically

estimated. Therefore, both the local FDR procedure and the BH procedure are semiparametric

graphical models which do not consider dependence among the hypotheses.

5.5 Simulations

We explore the empirical performance of our multiple testing procedure and three baseline pro-

cedures, including the local FDR procedure [38], the BH procedure [14, 63] and the procedure

based on a parametric graphical model [107]. Because we have the ground truth parameters, we

have two versions of our multiple testing approach, namely an oracle procedure and a data-driven

procedure. The oracle procedure knows the true parameters in the graphical model (including φ,

π and f1), whereas the data-driven procedure does not and has to estimate the graphical model in

the semiparametric way introduced in Sections 5.3.2 and 5.3.3. Both the BH procedure and the

local FDR procedure need an estimate of p0; we use the same estimating method in Section 5.3.2

87

1.0 1.2 1.4 1.6 1.8 2.0

0.00

0.04

0.08

● ● ● ● ● ●

β

FD

R


1.0 1.2 1.4 1.6 1.8 2.0

0.04

0.08

0.12

●

●

●

●

●

●

β

FN

R


1.0 1.2 1.4 1.6 1.8 2.0

050

010

0015

00

●

●

●

●

●●

β

ATP


1.0 1.2 1.4 1.6 1.8 2.0

0.00

0.04

0.08 ● ● ● ● ● ●

β

FD

R


1.0 1.2 1.4 1.6 1.8 2.0

0.15

0.20

0.25

●

●

●

●

●

●

β

FN

R


1.0 1.2 1.4 1.6 1.8 2.0

050

015

0025

00

●

●

●

●

●●

β

ATP


●


Figure 5.5: Performance of the procedures under Model 2 when (1) φ = 0.8 and (2) φ = 0.6 interms of (a) FDR (b) FNR and (c) ATP when the dependence structure is chain.

for a fair comparison. The local FDR procedure also needs an estimate of f , and we estimate it in

the same way as in our data-driven procedure.

We choose the setup to be consistent with previous work [172, 107] when possible. We con-

sider two dependence structures, namely a chain structure and a grid structure. For the chain

structure, we choose the number of hypotheses m=10,000. For the grid structure, we choose a

100×100 grid, which also yields 10,000 hypotheses. We test two levels of dependence strength,

i.e. φ=0.8 and φ=0.6. We set π to be 0.4. We first simulate the ground truth of the hypotheses θ

from P (θ;φ,π) and then simulate the test statistics x from P (x|θ; f0, f1). We assume that the

observed xi under the null hypothesis (namely θi=0) is from a standard normal N (0, 1). We test

two different models for xi under the alternative hypothesis (namely θi=1) as follows.

Model 1: xi|θi=1 comes from a mixture of Gaussians

1

3N (1, 1) +

1

3N (µ, 1) +

1

3N (5, 1). (5.10)

In total, we test nine values for µ, namely 1.4, 1.8, 2.2, 2.6, 3.0, 3.4, 3.8, 4.2 and 4.6. Different

µ values yield different f1 with different shapes.

88

Model 2: xi|θi=1 comes from a Gaussian N (µ, 1) and µ has a prior of Gamma(2.0, β)

where β is the scale parameter. We test six different values for β, namely 1.0, 1.2, 1.4, 1.6, 1.8

and 2.0. This model is designed to mimic the common situation in GWAS that common genetic

variants have small effect sizes and rare genetic variants have large effect sizes [113].

We compare three measures from these procedures. First, we check whether these procedures

are valid, namely whether the FDR yielded from these procedures is controlled at the nominal level

α. The nominal FDR level α is 0.10, which is consistent with the multiple testing literature [35].

Second, we compare the FNR yielded from these procedures. The third measure is the average

number of true positives (ATP) of these procedures. Valid procedures with a lower FNR and a

higher ATP are considered to be more efficient (or powerful). In the simulations, each experiment

is replicated 500 times and the average results are reported.

Performance under chain structure: The performance of the five procedures under the chain

dependence structure is shown in Figures 5.4 and 5.5, which correspond to Model 1 and Model

2, respectively. It is observed that all five procedures are valid. The parametric procedure [107]

is conservative, which agrees with the observations in Figure 3(1d) of [107]. Our semiparametric

data-driven procedure, the BH procedure and the local FDR procedure are slightly conservative.

The oracle procedure slightly outperforms the semiparametric data-driven procedure based the

plots for FNR and ATP. These two completely dominate the three baselines, which indicates the

benefit of leveraging dependence among the hypotheses via the semiparametric graphical model.

We also observe that the advantage of the oracle procedure and our semiparametric data-driven

procedure over the local FDR procedure is larger when φ = 0.8 than when φ = 0.6. The reason

is that as φ decreases from 0.8 to 0.6, the dependence strength among the hypotheses decreases,

and we benefit less from leveraging the dependence. When φ = 0.5, the edge potentials in our

graphical model are no longer informative, and the node potentials become the priors in the local

FDR procedure, and our procedure exactly reduces to the local FDR procedure.

Performance under grid structure: The performance of the five procedures under the grid

dependence structure is shown in Figures 5.6 and 5.7, which correspond to Model 1 and Model

89

1.5 2.0 2.5 3.0 3.5 4.0 4.5

0.00

0.04

0.08

● ● ● ● ● ● ● ● ●

µ

FD

R

(1a) Grid structure, φ = 0.8

1.5 2.0 2.5 3.0 3.5 4.0 4.5

0.00

0.04

0.08

●●

●

●

●

●●

● ●

µ

FN

R

(1b) Grid structure, φ = 0.8

1.5 2.0 2.5 3.0 3.5 4.0 4.5

800

1200

1600

●

●

●

●

●

●●

● ●

µ

ATP

(1c) Grid structure, φ = 0.8

1.5 2.0 2.5 3.0 3.5 4.0 4.5

0.00

0.04

0.08

● ● ● ● ● ● ● ● ●

µ

FD

R


1.5 2.0 2.5 3.0 3.5 4.0 4.5

0.12

0.16

0.20

0.24

●

●

●

●

●

●

●● ●

µ

FN

R


1.5 2.0 2.5 3.0 3.5 4.0 4.5

1000

1400

1800

●

●

●

●

●

●●

● ●

µ

ATP


●


Figure 5.6: Performance of the procedures under Model 1 when (1) φ = 0.8 and (2) φ = 0.6 interms of (a) FDR (b) FNR and (c) ATP when the dependence structure is grid.

1.0 1.2 1.4 1.6 1.8 2.0

0.00

0.04

0.08

● ● ● ● ● ●

β

FD

R


1.0 1.2 1.4 1.6 1.8 2.0

0.00

0.04

0.08

0.12

●

●

●

●

●

●

β

FN

R


1.0 1.2 1.4 1.6 1.8 2.0

050

010

0015

00

●

●

●

●●

●

β

ATP


1.0 1.2 1.4 1.6 1.8 2.0

0.00

0.04

0.08

● ● ● ● ● ●

β

FD

R


1.0 1.2 1.4 1.6 1.8 2.0

0.08

0.12

0.16

●

●

●

●

●

●

β

FN

R


1.0 1.2 1.4 1.6 1.8 2.0

050

010

0015

0020

00

●

●

●

●

●●

β

ATP


●


Figure 5.7: Performance of the procedures under Model 2 when (1) φ = 0.8 and (2) φ = 0.6 interms of (a) FDR (b) FNR and (c) ATP when the dependence structure is grid.

2, respectively. All five procedures are valid. The parametric procedure is considerably conser-

vative, which agrees with the observations in Figure 3(3d) of [107]. Again, our semiparamet-

ric data-driven procedure significantly outperforms the three baselines in all the configurations,

demonstrating the benefit of leveraging dependence among the hypotheses via the semiparametric

901.5 2.0 2.5 3.0 3.5 4.0 4.5

0.00

0.04

0.08

FD

R

µ(1a)

1.5 2.0 2.5 3.0 3.5 4.0 4.5

0.06

0.10

FN

R

µ(1b)

1.5 2.0 2.5 3.0 3.5 4.0 4.5

800

1000

1400

ATP

µ(1c)

1.5 2.0 2.5 3.0 3.5 4.0 4.5

0.00

0.04

0.08

FD

R

µ(2a)

1.5 2.0 2.5 3.0 3.5 4.0 4.5

0.12

0.16

0.20

0.24

FN

R

µ(2b)

1.5 2.0 2.5 3.0 3.5 4.0 4.5

1400

1800

2200

2600

ATP

µ(2c)

Figure 5.8: Performance of our procedure when λ = 0.2 (dotted lines), 0.5 (dashed lines) and 0.8(solid lines).

graphical model. The difference between our semiparametric data-driven procedure and the base-

lines is even larger compared with simulations under the chain structure. The reason is that in the

grid structure, each hypothesis has more neighbors than in the chain structure, and we can benefit

more from leveraging the dependence among the hypotheses.

Robustness of λ: In the previous simulations, λ is fixed at 0.8. We test another two values for

λ, namely 0.2 and 0.5, and repeat previous simulations. The performance of our semiparametric

procedure under the chain dependence structure and Model 1 with φ = 0.8 is provided in Figure

5.8. Quite surprisingly, our data-driven semiparametric procedure is valid for the three values of

λ and is slightly conservative for most of the configurations. However, the FNR and ATP of our

data-driven procedure for the three different values of λ are almost the same. Therefore although

our approach needs to pick a λ parameter, its performance is robust for different choices of λ. The

robustness of λ was also observed in [166]. The sensitivity analysis of λ in other configurations

yield similar observations.

Efficiency of Ranking: Although ranking the hypotheses by the probability thatH0 is false is

a secondary goal in multiple testing, readers may wonder how well our semiparametric procedure

performs in terms of ranking the hypotheses. For the oracle procedure, the parametric procedure

[107] and our semiparametric procedure, we rank the hypotheses by the posterior probability that

H0 is false, namely 1 − LIS. For BH, we use 1−P -value. For local FDR procedure, we use

1 − lfdr. Here we plot the ROC curves and PR curves yielded by the five procedures in Figure

5.9 for µ = 1.4 and φ = 0.8 in the chain structure under model 1. We observe that the oracle

91

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

ROC Curve

OracleSemiparametricParametricLocal FDRBH

0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0

Recall

Pre

cisi

on

PR Curve

OracleSemiparametricParametricLocal FDRBH

Figure 5.9: ROC/PR curves from these procedures.

procedure produces the most efficient ranking, followed by the semiparametric procedure and the

parametric procedure. The rankings yielded by local FDR and BH procedure are less efficient. The

ROC curves and PR curves of these procedures under other configurations show similar behavior.

Run Time: In the chain-structure simulations, it took our data-driven procedure about 10

hours to finish the 500 replications sequentially (for one µ value in (5.10)) on one 3GHz CPU. In

the grid-structure simulations, it took our procedure around 30 hours to finish the 500 replications

sequentially (for one µ value in (5.10)) on one 3GHz CPU.

5.6 Application

We apply our procedure to a real-world GWAS on breast cancer [82] which involves 528,173 SNPs

for 1,145 cases and 1,142 controls. In total, we test 528,173 hypotheses, and they are dependent

because SNPs nearby tend to be highly correlated. We query the squared correlation coefficients

(r2 values) among the SNPs from HapMap [175], and build the dependence structure as follows.

Each SNP becomes a node in the graph. For each SNP, we connect it with the SNP having the

highest r2 value with it. We further categorize the edges into a high correlation edge set Eh (r2

92

above 0.8), a medium correlation edge set Em (r2 between 0.5 and 0.8) and a low correlation edge

set El (r2 between 0.25 and 0.5). We have three parameters (φh, φm, and φl) for the three sets of

edges.

When we apply our procedure on the dataset, the individual test is a two-proportion z-test.

We set λ= 0.8, and the value of p0 is estimated to be 0.978, which means that about 2.2% of the

SNPs are associated to breast cancer. The estimated f1 in this study is plotted in Figure 5.1. The

whole experiment takes around 30 hours on a single processor. Our procedure reports 20 SNPs

with LIS value below 0.01. There are five clusters covering 18 of them. All 18 SNPs have very

small P -values from the two-proportion z-test and locate near one another in the same cluster.

The first cluster on Chr2, the cluster on Chr4, the cluster on Chr9 and the cluster on Chr10 are

identified in [82] and [156]. The second cluster on Chr2 is associated to a telomere and telomeres

are known to be related to breast cancer [174]. We further use a second cohort to validate the 18

SNPs, and 16 of them show a moderate level of association on the second cohort. We also would

like to mention that there is some work on estimating less conservative significance thresholds for

controlling family-wise error rate in GWAS [154, 76].

5.7 Discussion

We propose a novel semiparametric graphical model to leverage the dependence in multiple testing

problems. Although our semiparametric approach seems incremental over previous fully paramet-

ric approach [172, 107] from the viewpoint of graphical models, such a modification is nontrivial

to the multiple testing area, for both a methodological reason and an application reason. From the

methodological standpoint, our semiparametric approach naturally generalizes the local FDR pro-

cedure and connects with the BH procedure — we show that both the BH procedure and the local

FDR procedure estimate their parameters in the same semiparametric way to avoid assumptions

about f1. The methodological unification demonstrates that such a modification is necessary for

multiple testing. From the application aspect, our semiparametric approach no longer requires the

investigators to know the parameterization of f1, which is generally unknown in practical prob-

93

lems. Improper parameterization assumptions for f1 can make the fully parametric approach either

too liberal which makes the procedure invalid, or too conservative which makes the procedure lose

power, as illustrated by both our simulations and previous work [172, 107]. Our semiparametric

approach better controls FDR and is more powerful. For these reasons, we suggest that investi-

gators choose the semiparametric approach for their large-scale multiple testing problems if (i)

they speculate that there exists dependence among the hypotheses, and (ii) there is no suitable

parametric distribution for f1.

The material in this chapter first appeared in the 31st International Conference on Machine

Learning (ICML’2014) as follows:

Jie Liu, Chunming Zhang, Elizabeth Burnside and David Page. Multiple Testing under De-

pendence via Semiparametric Graphical Models. The 31st International Conference on Machine

Learning (ICML), 2014.

In both the semiparametric graphical model in this chapter and the fully parametric graphical

model in last chapter, we assume that φi’s in the pairwise potential functions are homogeneous.

However, in large-scale graphical models, φi’s can be heterogeneous. There are two situations.

First, there is some background knowledge about how these parameter may change; for example

if the HapMap [175] resource shows an r2 of 0.99 between two SNPs, this background knowledge

provides some evidence that the parameterization should make them more likely to take the same

value than if the r2 were, say, 0.8. Second, these parameters are latently tied; for example, pairs

of SNPs, and consequently the parameters on the pairs, might naturally cluster into four groups–

highly-correlated, intermediately-correlated, weakly-correlated, and uncorrelated–or some other

number of groups based on correlation. Chapter 6 deals with the first situation, namely capturing

the heterogeneity within parameter learning in hidden Markov random fields with the help of

background knowledge. Chapter 7 deals with the second situation, namely estimating latently-

grouped parameters in undirected graphical models.

Chapter 6

Learning Heterogeneous Hidden

Markov Random Fields

Hidden Markov random fields (HMRFs) are conventionally assumed to be homogeneous in the

sense that the potential functions are invariant across different sites. However in some biological

applications, it is desirable to make HMRFs heterogeneous, especially when there exists some

background knowledge about how the potential functions vary. We formally define heterogeneous

HMRFs and propose an EM algorithm whose M-step combines a contrastive divergence learner

with a kernel smoothing step to incorporate the background knowledge. Simulations show that

our algorithm is effective for learning heterogeneous HMRFs and outperforms alternative binning

methods. We learn a heterogeneous HMRF in a real-world study.

6.1 Introduction

Hidden Markov models (HMMs) and hidden Markov random fields (HMRFs) are useful ap-

proaches for modelling structured data such as speech, text, vision and biological data. HMMs

and HMRFs have been extended in many ways, such as the infinite models [9, 54, 29], the factorial

models [67, 90], the high-order models [98] and the nonparametric models [81, 164]. HMMs are

94

95

homogeneous in the sense that the transition matrix stays the same across different sites. HM-

RFs, intensively used in image segmentation tasks [214, 26, 28], are also homogeneous. The

homogeneity assumption for HMRFs in image segmentation tasks is legitimate, because people

usually assume that the neighborhood system on an image is invariant across different regions.

However, it is necessary to bring heterogeneity to HMMs and HMRFs in some biological applica-

tions where the correlation structure can change over different sites. For example, a heterogeneous

HMM is used for segmenting array CGH data [114], and the transition matrix depends on some

background knowledge, i.e. some distance measurement which changes over the sites. A hetero-

geneous HMRF is used to filter SNPs in genome-wide association studies [108], and the pairwise

potential functions depend on some background knowledge, i.e. some correlation measure be-

tween the SNPs which can be different between different pairs. In both of these applications,

the transition matrix and the pairwise potential functions are heterogeneous and are parameter-

ized as monotone parametric functions of the background knowledge. Although the algorithms

tune the parameters in the monotone functions, there is no justification that the parameterization

of the monotone functions is correct. Can we adopt the background knowledge about these het-

erogeneous parameters adaptively during HMRF learning, and recover the relation between the

parameters and the background knowledge nonparametrically?

This chapter is the first to learn HMRFs with heterogeneous parameters by adaptively incor-

porating the background knowledge. It is an EM algorithm whose M-step combines a contrastive

divergence style learner with a kernel smoothing step to incorporate the background knowledge.

Details about our EM-kernel-PCD algorithm are given in Section 6.3 after we formally define

heterogeneous HMRFs in Section 6.2. Simulations in Section 6.4 show that our EM-kernel-PCD

algorithm is effective for learning heterogeneous HMRFs and outperforms alternative methods. In

Section 6.5, we learn a heterogeneous HMRF in a real-world genome-wide association study. We

conclude in Section 6.6.

96

6.2 Models

6.2.1 HMRFs And Homogeneity Assumption

Suppose that X = {0, 1, ...,m−1} is a discrete space, and we have a Markov random field (MRF)

defined on a random vector X ∈ Xd. The conditional independence is described by an undirected

graph G(V,E). The node set V consists of d nodes. The edge set E consists of r edges. The

probability of x from the MRF with parameters θ is

P (x;θ) =Q(x;θ)

Z(θ)=

1

Z(θ)

∏c∈C(G)

φc(x;θc), (6.1)

where Z(θ) is the normalizing constant. Q(x;θ) is some unnormalized measure with C(G) being

some subset of the cliques in G. The potential function φc is defined on the clique c and is

parameterized by θc. For simplicity in this chapter, we consider pairwise MRFs, whose potential

functions are defined on the edges, namely |C(G)| = r. We further assume that each pairwise

potential function is parameterized by a single parameter, i.e. θc = {θc}.

A hidden Markov random field [214, 26, 28] consists of a hidden random field X ∈ Xd and

an observable random field Y ∈ Yd where Y is another space (either continuous or discrete).

The random field X is a Markov random field with density P (x;θ), as defined in Formula (6.1),

and its instantiation x cannot be measured directly. Instead, we can observe the emitted random

field Y with its individual dimension Yi depending on Xi for i = 1, ..., d, namely P (y|x;ϕ) =∏di=1 P (yi|xi;ϕ) where ϕ = {ϕ0, ..., ϕm−1} and ϕxi parameterizes the emitting distribution of

Yi under the state xi. Therefore, the joint probablity of x and y is

P (x,y;θ,ϕ) = P (x;θ)P (y|x;ϕ)

=1

Z(θ)

∏c∈C(G)

φc(x;θc)d∏i=1

P (yi|xi;ϕ).(6.2)

Example 1: One pairwise HMRF model with three latent variables (X1, X2, X3) and three

observable variables (Y1, Y2, Y3) is given in Figure 6.1. Let X = {0, 1}. X1, X2 and X3 are

97

1

11

Figure 6.1: The pairwise HMRF model with three latent nodes (X1,X2,X3) and observable nodes(Y1, Y2, Y3) with parameters θ = {θ1, θ2, θ3} and ϕ = {ϕ0, ϕ1}.

connected by three edges. The pairwise potential function φi on edge i (connecting Xu and

Xv) parameterized by θi (0<θi<1) is φi(X; θi) = θI(Xu=Xv)i (1 − θi)I(Xu 6=Xv) for i = 1, 2, 3,

where I is an indicator variable. Let Y = R. For i = 1, 2, 3, Yi|Xi=0 ∼ N(µ0, σ20) and

Yi|Xi=1 ∼ N(µ1, σ21), namely ϕ0 = {µ0, σ0} and ϕ1 = {µ1, σ1}.

In common applications of HMRFs, we observe only one instantiation y which is emitted

according to the hidden state vector x, and the task is to infer the most probable state configuration

of X, or to compute the marginal probabilities of X. In both tasks, we need to estimate the

parameters θ = {θ1, ..., θr} and ϕ = {ϕ0, ..., ϕm−1}. Usually, we seek maximum likelihood

estimates of θ and ϕ which maximize the log likelihood

L(θ,ϕ) = logP (y;θ,ϕ) = log∑x∈Xd

P (x,y;θ,ϕ). (6.3)

Since we only have one instantiation (x,y), we usually have to assume that θi’s are the same

for i = 1, ..., r for effective parameter learning. This homogeneity assumption is widely used in

computer vision problems because people usually assume that the neighborhood system on an im-

age is invariant across its different regions. Therefore, conventional HMRFs refer to homogeneous

HMRFs, similar to conventional HMMs whose transition matrix is invariant across different sites.

98

6.2.2 Heterogeneous HMRFs

In a heterogeneous HMRF, the potential functions on different cliques can be different. Taking

the model in Figure 6.1 as an example, θ1, θ2 and θ3 can be different if the HMRF is heteroge-

neous. As with conventional HMRFs, we want to be able to address applications that have one

instantiation (x,y) where y is observable and x is hidden. Therefore, learning an HMRF from

one instantiation y is infeasible if we free all θ’s. To partially free the parameters, we assume that

there is some background knowledge k = {k1, ..., kr} about the parameters θ = {θ1, ..., θr} in the

form of some unknown smooth mapping function which maps θi to ki for i = 1, ..., r. The back-

ground knowledge describes how these potential functions are different across different cliques.

Taking pairwise HMRFs for example, the potentials on the edges with similar background knowl-

edge should have similar parameters. We can regard the homogeneity assumption in conventional

HMRFs as an extreme type of background knowledge that k1 = k2 = ... = kr. The problem we

solve in this chapter is to estimate θ and ϕ which maximize the log likelihood L(θ,ϕ) in Formula

(6.3), subject to the condition that the estimate of θ is smooth with respect to k.

6.3 Parameter Learning Methods

Learning heterogeneous HMRFs in above manner involves three difficulties, (i) the intractable

Z(θ), (ii) the latent x, and (iii) the heterogeneous θ. The way we handle the intractable Z(θ)

is similar to using contrastive divergence [80] to learn MRFs. We review contrastive divergence

and its variations in Section 6.3.1. To handle the latent x in HMRF learning, we introduce an

EM algorithm in Section 6.3.2, which is applicable to conventional HMRFs. In Section 6.3.3, we

further address the heterogeneity of θ in the M-step of the EM algorithm.

6.3.1 Contrastive Divergence for MRFs

Assume that we observe s independent samples X = {x1,x2, ...,xs} from (6.1), and we want to

estimate θ. The log likelihood L(θ|X) is concave w.r.t. θ, and we can use gradient ascent to find

the MLE of θ. The partial derivative of L(θ|X) with respect to θi is

99

∂L(θ|X)

∂θi=

1

s

s∑j=1

ψi(xj)− Eθψi = EXψi − Eθψi, (6.4)

where ψi is the sufficient statistic corresponding to θi, and Eθψi is the expectation of ψi with

respect to the distribution specified by θ. In the i-th iteration of gradient ascent, the parameter

update is

θ(i+1) = θ(i) + η∇L(θ(i)|X) = θ(i) + η(EXψ − Eθ(i)ψ),

where η is the learning rate. However the exact computation ofEθψi takes time exponential in the

treewidth of G. A few sampling-based methods have been proposed to solve this problem. The key

differences among these methods are how to draw particles and how to compute Eθψ from the

particles. MCMC-MLE [66, 218] uses importance sampling, but might suffer from degeneracy

when θ(i) is far away from θ(1). Contrastive divergence [80] generates new particles in each

iteration according to the current θ(i) and does not require the particles to reach equilibrium, so

as to save computation. Variations of contrastive divergence include particle-filtered MCMC-

MLE [5], persistent contrastive divergence (PCD) [178] and fast PCD [179]. Because PCD is

efficient and easy to implement, we employ it in this chapter. Its pseudo-code is provided in

Algorithm 1. Other than contrastive divergence, MRF can be learned via ratio matching [85], non-

local contrastive objectives [184], noise-contrastive estimation [72] and minimum KL contraction

[112].

6.3.2 Expectation-Maximization for Learning Conventional HMRFs

We begin with a lower bound of the log likelihood function, and then introduce the EM algorithm

which handles the latent variables in HMRFs. Let qx(x) be any distribution on x∈Xd. It is well

known that there exists a lower bound of the log likelihood L(θ,ϕ) in (6.3), which is provided by

an auxiliary function F(qx(x), {θ,ϕ}) defined as follows,

100

Algorithm 1 PCD-n Algorithm [178]

1: Input: independent samples X = {x1,x2, ...,xs} from P (x;θ), maximum iteration numberT

2: Output: θ from the last iteration3: Procedure:4: Initialize θ(1) and initialize particles5: Calculate EXψ from X6: for i = 1 to T do7: Advance particles n steps under θ(i)

8: Calculate Eθ(i)ψ from the particles9: θ(i+1) = θ(i) + η(EXψ − Eθ(i)ψ)

10: Adjust η11: end for

F(qx(x),{θ,ϕ}) =∑x∈Xd

qx(x) logP (x,y;θ,ϕ)

qx(x)

= L(θ,ϕ)−KL[qx(x)|P (x|y,θ,ϕ)],

(6.5)

where KL[qx(x)|P (x|y,θ,ϕ)] is the Kullback-Leibler divergence between qx(x) andP (x|y,θ,ϕ),

the posterior distribution of the hidden variables. This Kullback-Leibler divergence is the distance

between L(θ,ϕ) and F(qx(x), {θ,ϕ}).

Expectation-Maximization: We maximize L(θ,ϕ) with an expectation-maximization (EM)

algorithm which iteratively maximizes its lower bound F(qx(x), {θ,ϕ}). We first initialize θ(0)

and ϕ(0). In the t-th iteration, the updates in the expectation (E) step and the maximization (M)

step are

q(t)x = arg max

qxF(qx(x), {θ(t−1),ϕ(t−1)}) (E),

θ(t),ϕ(t) = arg max{θ,ϕ}

F(q(t)x , {θ,ϕ}) (M).

In the E-step, we maximize F(qx(x), {θ(t−1),ϕ(t−1)}) with respect to qx(x). Because the

101

difference between F(qx(x), {θ,ϕ}) and L(θ,ϕ) is KL[qx(x)|P (x|y,θ,ϕ)], the maximizer in

the E-step q(t)x is P (x|y,θ(t−1),ϕ(t−1)), namely the posterior distribution of x|y under the current

estimated parameters θ(t−1) and ϕ(t−1). This posterior distribution can be calculated by Markov

chain Monte Carlo for general graphs.

In the M-step, we maximizeF(q(t)x (x), {θ,ϕ}) with respect to {θ,ϕ}, which can be rewritten

as

arg max{θ,ϕ}

F(q(t)x (x), {θ,ϕ})

= arg max{θ,ϕ}

∑x∈Xd

q(t)x (x) logP (x,y;θ,ϕ)

= arg max{θ,ϕ}

∑x∈Xd

q(t)x (x)

{logP (x;θ) + logP (y|x;ϕ)

}.

It is obvious that this function can be maximized with respect to ϕ and θ separately as

θ(t) = arg maxθ

∑x∈Xd

q(t)x (x) logP (x;θ),

ϕ(t) = arg maxϕ

∑x∈Xd

q(t)x (x) logP (y|x;ϕ).

(6.6)

Estimating ϕ: Estimating ϕ in this maximum likelihood manner is straightforward, because

the maximization can be rewritten as follows,

arg maxϕ

∑x∈Xd

q(t)x (x) logP (y|x;ϕ)

= arg maxϕ

d∑i=1

∑xi∈X

q(t)xi (xi) logP (yi|xi;ϕ),

where q(t)x (x) =

∏di=1 q

(t)xi (xi).

102

Estimating θ: Estimating θ in Formula (6.6) is difficult due to the intractable Z(θ). Some

approaches [214, 26] use pseudo-likelihood [18] to estimate θ in the M-step. It can be shown that∑x∈Xd q

(t)x (x) logP (x;θ) is concave with respect to θ. Therefore, we can use gradient ascent to

find the MLE of θ, which is similar to using contrastive divergence [80] to learn MRFs in Section

6.3.1.

Denote∑

x∈Xd q(t)x (x) logP (x;θ) by LM (θ|q(t)

x ). The partial derivative of LM (θ|q(t)x ) with

respect to θi is

∂LM (θ|q(t)x )

∂θi=∑x∈Xd

q(t)x (x)

{ψi(x)− Eθψi(x)

}.

Therefore, the derivative here is similar to the derivative in contrastive divergence in Formula

(6.4) except we have to reweight it to q(t)x . We run the EM algorithm until both θ and ϕ converge.

Note that when learning homogeneous HMRFs with this algorithm, we tie all θ’s all the time,

namely θ = {θ}. Therefore, we name this parameter learning algorithm for conventional HMRFs

the EM-homo-PCD algorithm.

6.3.3 Learning Heterogeneous HMRFs

Learning heterogeneous HMRFs is different from learning conventional homogeneous HMRFs in

two ways. First, we need to free the θ’s in heterogeneous HMRFs. Second, there is some back-

ground knowledge k about how the θ’s are different, as introduced in Section 6.2. Therefore, we

make two modifications to the EM-homo-PCD algorithm in order to learn heterogeneous HMRFs

with background knowledge. First, we estimate the θ’s separately, which obviously brings more

variance in estimation. Second, within each iteration of the contrastive divergence update, we ap-

ply a kernel regression to smooth the estimate of the θ’s with respect to the background knowledge

k. Specifically, in the i-th iteration of PCD update, we advance the particles under θ(i) for n steps,

and calculate the moments Eθ(i)ψ from the particles. Therefore, we can update the estimate as

θ(i+1) = θ(i) + η∇LM (θ|q(t)x ).

103

Then we regress θ(i+1) with respect to k via Nadaraya-Watson kernel regression [129, 195],

and set θ(i+1) to be the fitted values. For ease of notation, we drop the iteration index (i +

1). Suppose that θ = {θ1, ..., θr} is the estimate before kernel smoothing; we set the smoothed

estimate θ = {θ1, ..., θr} as

θj =∑r

i=1γij θi,∀j = 1, ..., r,

where

γij =K(

ki−kjh )∑r

m=1K(km−kjh )

.

For the kernel function K, we use the Epanechnikov kernel, which is usually computationally

more efficient than a Gaussian kernel. We tune the bandwidth h through cross-validation, namely

we select the bandwidth which minimizes the leave-one-out score

1

r

∑r

i=1

(θi − θi1− γii

)2

.

Tuning the bandwidth is usually computation-intensive, so we tune it every t0 iterations to

save computation. We name our parameter learning algorithm for heterogeneous HMRFs the

EM-kernel-PCD algorithm. Its pseudo-code is given in Algorithm 2.

Another intuitive way of handling background knowledge about these heterogeneous param-

eters is to create bins according to the background knowledge and tie the θ’s that are in the

same bin. Suppose that we have b bins after we carefully select the binwidth, namely we have

θ = {θ1, ..., θb}. The rest of the algorithm is the same as the EM-homo-PCD algorithm in Section

6.3.2. We name this parameter learning algorithm via binning the EM-binning-PCD algorithm.

We can also regard our EM-kernel-PCD algorithm as a soft-binning version of EM-binning-PCD.

6.3.4 Geometric Interpretation

Before providing empirical evaluations of the algorithms, we first present an example showing

why adopting the background knowledge helps when we are learning the heterogeneous parame-

104

Algorithm 2 EM-kernel-PCD Algorithm

1: Input: sample y, background knowledge k, max iteration number T , initial bandwidth h2: Output: θ from the last iteration3: Procedure:4: Initialize θ, ϕ and particles5: while not converge do6: E-step: infer x from y7: Calculate Exψ from x8: for i = 1 to T do9: Advance particles for n steps under θ(i)

10: Calculate Eθ(i)ψ from the particles

11: θ(i+1) = θ(i) + η∇LM (θ|q(t)x )

12: θ(i+1) = kernelRegF it(θ(i+1),k, h)13: Adjust η and tune bandwidth h14: end for15: MLE ϕ from x and y16: end while

X1 X2 X31 0 01 1 11 1 10 1 11 0 00 0 00 1 10 0 10 0 11 1 0

Data

11

0 1

0

1

11

0 1

0

1

Figure 6.2: Geometric interpretation of the parameter learning algorithms with a small Markovrandom field model on {X1, X2, X3} parameterized by {θ1, θ2}; we observe ten samples X. Theplot on the right is the log likelihood of the parameters L(θ1, θ2|X).

ters via gradient ascent. Suppose that we have a Markov random field on (X1, X2, X3) parame-

terized by θ1 and θ2 (0 < θ1 < 1, 0 < θ2 < 1), and we observe ten samples X generated from the

model, as shown in Figure 6.2. The ground truth is θ1 = 0.65 and θ2 = 0.55, and we have the

105

background knowledge θ1 > θ2. The plot on the right is the log likelihood L(θ1, θ2|X) which is

concave with respect to (θ1, θ2). If the parameterization of the model is minimal, the global max-

imum is unique. The global maximum is at the point (0.6, 0.7) for our observed data X because

X1 agrees with X2 six times, and X2 agrees with X3 seven times among the ten samples. We can

run the standard contrastive divergence algorithm, namely gradient ascent in the feasible region

{(θ1, θ2)|0 < θ1 < 1, 0 < θ2 < 1} to reach the global maximum point (0.6, 0.7), although it is

far from the ground truth point (0.65, 0.55) due to the small sample size (recall that in HMRFs we

only have one example). If we make the homogeneity assumption θ1 = θ2, we actually perform

the gradient ascent on the blue curve, which is the intersection of the log-likelihood surface and

the hyperplane θ1 = θ2. There is one maximum point on the blue curve and we can also achieve

it; we usually get better gradient information and reach the maximum point faster because we

pool the data. However, this maximum point on the blue curve can be far from the ground truth

if the strong assumption θ1 = θ2 is far from correct (see the performance of the EM-homo-PCD

algorithm in Figure 6.3). Now if we adopt the background knowledge θ1 > θ2, we are oper-

ating in the region {(θ1, θ2)|0 < θ2 < 1, θ2 < θ1 < 1} which is smaller than the full region

{(θ1, θ2)|0 < θ1 < 1, 0 < θ2 < 1}. If the background knowledge only contains order constraints

such as θ1 > θ2, the feasible region is still convex and we are guaranteed to find a global maxi-

mum in that region. In the meantime, if the background knowledge is correct, we are guaranteed

to find a better solution than searching in the full region, and therefore we can get a more accurate

estimate.

6.4 SimulationsWe investigate the performance of our EM-kernel-PCD algorithm on heterogeneous HMRFs with

different structures, namely a tree-structure HMRF and a grid-structure HMRF. In the simulations,

we first set the ground truth of the parameters, and then set the background knowledge. We then

generate one example x and then generate one example y|x. With the observable y, we apply

EM-kernel-PCD, EM-binning-PCD and EM-homo-PCD to learn the parameters θ. We eventually

106

compare the three algorithms by their average absolute estimate error 1/r∑r

i=1 |θi− θi| where θi

is the estimate of θi.

For the HMRFs, each dimension of X takes values in {0, 1}. The pairwise potential func-

tion φi on edge i (connecting Xu and Xv) parameterized by θi (0 < θi < 1) is φi(X; θi) =

θI(Xu=Xv)i (1− θi)I(Xu 6=Xv), where I is an indicator variable. For the tree structure, we choose a

perfect binary tree of height 12, which yields a total number of 8,191 nodes and 8,190 parameters,

i.e. d = 8,191 and r = 8,190. For the grid-structure HMRFs, we choose a grid of 100 rows and

100 columns, which yields a total number of 10,000 nodes and 19,800 parameters, i.e. d = 10,000

and r = 19,800. For both of the two models, we generate θi ∼ U(0.5, 1) independently and then

generate the background knowledge ki. We have two types of background knowledge. In the

first type of background knowledge, we set ki = sin θi + ε. In the second type of background

knowledge, we set ki = θ2i + ε, where ε is some random Gaussian noise from N(0, σ2

ε ). We try

three values for σε, namely 0.0, 0.01 and 0.02. Then we generate one instantiation x. Finally,

we generate one observable y from a d dimensional multivariate normal distribution N(µx, σ2I)

where µ = 2 is the strength of signal, and σ2 = 1.0 is the variance of the manifestation, and I is

the identity matrix of dimension d. For our EM-kernel-PCD algorithm, we use an Epanechnikov

kernel with α = β = 5. For tuning bandwidth h, we try 100 values in total, namely 0.005, 0.01,

0.015, ..., 0.5. For the EM-binning-PCD algorithm, we set the binwidth to be 0.005. The rest of

the parameter settings for the three algorithms are the same, including the n parameter in PCD

which is set to be 1 and the number of particles which is set to be 100. We also replicate each

experiment 20 times, and the averaged results are reported.

Performance of the algorithms The results from the tree-structure HMRFs and the grid-

structure HMRFs are reported in Figure 6.3. We plot the average absolute error of the estimate of

the three algorithms against the number of iterations of PCD update. We have separate plots for

background knowledge ki = sin θi + ε, and background knowledge ki = θ2i + ε. Since there are

three noise levels for background knowledge, both the EM-kernel-PCD algorithm and the EM-

binning-PCD algorithm have three variations. All the three algorithms converge as they iterate.

107

0 500 1000 1500 2000

0.05

0.10

0.15

0.20

Abs

olut

e E

stim

ate

Err

or

Iteration Number

(1a) tree structure, ki = sin(θi) + ε

EM−homo−PCDEM−binning−PCD, σε =0.02EM−binning−PCD, σε =0.01EM−binning−PCD, σε =0EM−kernel−PCD, σε =0.02EM−kernel−PCD, σε =0.01EM−kernel−PCD, σε =0

0 500 1000 1500 2000

0.05

0.10

0.15

0.20

Abs

olut

e E

stim

ate

Err

or

Iteration Number

(1b) tree structure, ki = θi2 + ε


0 500 1000 1500 2000

0.05

0.10

0.15

Abs

olut

e E

stim

ate

Err

or

Iteration Number

(2a) grid structure, ki = sin(θi) + ε


0 500 1000 1500 2000

0.05

0.10

0.15

Abs

olut

e E

stim

ate

Err

or

Iteration Number

(2b) grid structure, ki = θi2 + ε


Figure 6.3: Performance of EM-homo-PCD, EM-binning-PCD and EM-kernel-PCD in tree-HMRFs and grid-HMRFs for two types of background knowledge: (a) ki = sin θi + ε, and (b)ki = θ2

i + ε.

108

It is observed that the absolute estimate error of the EM-homo-PCD algorithm reduces to 0.125

as it converges. Since the parameters θi’s are drawn independently from the uniform distribution

on the interval [0.5, 1], the EM-homo-PCD algorithm ties all the θi’s and estimates them to be

0.75. Therefore, the averaged absolute error is∫ 1.0

0.5 2|x − 0.75|dx = 0.125. Our EM-kernel-

PCD algorithm significantly outperforms the EM-binning-PCD algorithm and the EM-homo-PCD

algorithm. It is also observed that as the noise level of background knowledge increases, the

performance of the EM-kernel-PCD algorithm and the EM-binning-PCD algorithm deteriorates.

However, as long as the noise level is moderate, the performance of our EM-kernel-PCD algorithm

is satisfactory. The results from the tree-structure HMRFs and the grid-structure HMRFs are

comparable except that it takes more iterations to converge in grid-structure HMRFs than in tree-

structure HMRFs.

Behavior of the algorithms We then plot the estimated parameters against their background

knowledge in the iterations of our EM-kernel-PCD algorithm. We provide plots for after 100

iterations, after 200 iterations and after convergence respectively, to show how the EM-kernel-

PCD algorithm behaves during the gradient ascent. Figure 6.4 shows the plots for the background

knowledge ki = sin θi + ε and the background knowledge ki = θ2i + ε with three levels of noise

(namely σε=0, 0.01 and 0.02) for both the tree-structure HMRFs and the grid-structure HMRFs.

It is observed that as the algorithm iterates, it gradually recovers the relationship between the

parameters and the background knowledge. There is still a gap between our estimate and the

ground truth. This is because we only have one hidden instantiation x and we have to infer x from

the observed y in the E-step. Especially at the boundaries, we can observe a certain amount of

estimate bias. The boundary bias is very common in kernel regression problems because there are

fewer data points at the boundaries [41].

Choosing parameter n One parameter in contrastive divergence algorithms is n, the number

of MCMC steps we need to perform under the current parameters in order to generate the particles.

The rationale of contrastive divergence is that it is enough to find the direction to update the

parameters by a few MCMC steps using the current parameters, and we do not have to reach the

109

Figure 6.4: The behavior of the EM-kernel-PCD algorithm during gradient ascent for differenttypes of background knowledge with different levels of noise in the tree-structure HMRFs and thegrid-structure HMRFs. The red dots show the mapping pattern between the ground truth of theparameters and their background knowledge.

equilibrium. Therefore, the parameter n is usually set to be very small to save computation when

we are learning general Markov random fields. Here we explore how we should choose the n

parameter in our EM-kernel-PCD algorithm for learning HMRFs. We choose three values for n

in the simulations, namely 1, 5 and 10. In Figure 6.5, the running time and absolute estimate

110

error are plotted for the three choices in the tree-structure HMRFs and grid-structure HMRFs

under different levels of noise in the background knowledge ki = sin θi + ε and the background

knowledge ki = θ2i + ε. The running time increases as n increases, but the estimation accuracy

does not increase. This observation stays the same for different structures and different levels of

noise in different types of background knowledge. This suggests that we can simply choose n = 1

in our EM-kernel-PCD algorithm.

6.5 Real-world ApplicationWe use our EM-kernel-PCD algorithm to learn a heterogeneous HMRF model in a real-world

genome-wide association study on breast cancer. The dataset is from NCI’s Cancer Genetics

Markers of Susceptibility (CGEMS) study [82]. Details about CGEMS dataset are provided in

Subsection 3.4.1. We build a heterogeneous HMRF model to identify the associated SNPs. In

the HMRF model, the hidden vector X ∈ {0, 1}d denotes whether the SNPs are associated with

breast cancer, i.e. Xi = 1 means that the SNPi is associated with breast cancer. For each SNP,

we can perform a two-proportion z-test from the minor allele count in cases and the minor allele

count in controls. Denote Yi to be the test statistic from the two-proportion z-test for SNPi. It

can be derived that Yi|Xi=0 ∼ N(0, 1) and Yi|Xi=1 ∼ N(µ1, 1) for some unknown µ1 (µ1 6= 0).

We assume that X forms a pairwise Markov random field with respect to the graph G. The

graph G is built as follows. We query the squared correlation coefficients (r2 values) among

the SNPs from HapMap [175]. Each SNP becomes a node in the graph. For each SNP, we

connect it with the SNP having the highest r2 value with it. We also remove the edges whose

r2 values are below 0.25. There are in total 340,601 edges in the graph. The pairwise potential

function φi on edge i (connecting Xu and Xv) parameterized by θi (0 < θi < 1) is φi(X; θi) =

θI(Xu=Xv)i (1− θi)I(Xu 6=Xv) for i = 1, ..., 340,601, where I is an indicator variable. It is believed

that two SNPs with a higher level of correlation are more likely to agree in their association with

breast cancer. Therefore, we set the background knowledge k about the parameters to be the

r2 values between the SNPs on the edge. We first perform the two-proportion z-test and set y

111

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

020

0040

0060

0080

00

Run

ning

Tim

e (s

econ

ds)

0.01

60.

018

0.02

00.

022

0.02

4

Abs

olut

e E

stim

ate

Err

or

0 100 200 300 400 500 600 700 800 900 1100Iteration Number

(1a) tree structure, ki = sin(θi) + ε , σε =0

n=1n=5n=10

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

020

0040

0060

0080

00

Run

ning

Tim

e (s

econ

ds)

0.02

40.

025

0.02

60.

027

0.02

8

Abs

olut

e E

stim

ate

Err

or


(1b) tree structure, ki = sin(θi) + ε , σε =0.01

n=1n=5n=10

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

020

0040

0060

0080

00

Run

ning

Tim

e (s

econ

ds)

0.03

150.

0325

0.03

350.

0345

Abs

olut

e E

stim

ate

Err

or


(1c) tree structure, ki = sin(θi) + ε , σε =0.02

n=1n=5n=10

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

050

0010

000

1500

020

000

Run

ning

Tim

e (s

econ

ds)

0.01

70.

018

0.01

90.

020

0.02

10.

022

Abs

olut

e E

stim

ate

Err

or

0 500 1000 1500 2000 2500 3000 3500Iteration Number

(2a) tree structure, ki = θi2 + ε , σε =0

n=1n=5n=10

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●

●

●

●

●

●

●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

050

0010

000

1500

020

000

Run

ning

Tim

e (s

econ

ds)

0.01

90.

020

0.02

10.

022

0.02

3

Abs

olut

e E

stim

ate

Err

or


(2b) tree structure, ki = θi2 + ε , σε =0.01

n=1n=5n=10

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

050

0010

000

1500

020

000

Run

ning

Tim

e (s

econ

ds)

0.02

200.

0230

0.02

400.

0250

Abs

olut

e E

stim

ate

Err

or


(2c) tree structure, ki = θi2 + ε , σε =0.02

n=1n=5n=10

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

010

000

2000

030

000

4000

0

Run

ning

Tim

e (s

econ

ds)

0.03

60.

038

0.04

00.

042

0.04

40.

046

0.04

8

Abs

olut

e E

stim

ate

Err

or


(3a) grid structure, ki = sin(θi) + ε , σε =0

n=1n=5n=10

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

010

000

2000

030

000

4000

0

Run

ning

Tim

e (s

econ

ds)

0.04

00.

042

0.04

40.

046

0.04

80.

050

Abs

olut

e E

stim

ate

Err

or


(3b) grid structure, ki = sin(θi) + ε , σε =0.01

n=1n=5n=10

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

010

000

2000

030

000

4000

0

Run

ning

Tim

e (s

econ

ds)

0.04

60.

048

0.05

00.

052

0.05

4

Abs

olut

e E

stim

ate

Err

or


(3c) grid structure, ki = sin(θi) + ε , σε =0.02

n=1n=5n=10

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

010

000

2000

030

000

4000

050

000

Run

ning

Tim

e (s

econ

ds)

0.03

50.

040

0.04

5

Abs

olut

e E

stim

ate

Err

or


(4a) grid structure, ki = θi2 + ε , σε =0

n=1n=5n=10

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

010

000

2000

030

000

4000

050

000

Run

ning

Tim

e (s

econ

ds)

0.03

50.

040

0.04

50.

050

Abs

olut

e E

stim

ate

Err

or


(4b) grid structure, ki = θi2 + ε , σε =0.01

n=1n=5n=10

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

010

000

2000

030

000

4000

050

000

Run

ning

Tim

e (s

econ

ds)

0.03

80.

040

0.04

20.

044

0.04

60.

048

0.05

00.

052

Abs

olut

e E

stim

ate

Err

or


(4c) grid structure, ki = θi2 + ε , σε =0.02

n=1n=5n=10

Figure 6.5: Absolute estimate error (plotted in blue, in the units on the right axes) and runningtime (plotted in black, in minutes on the left axes) of the EM-kernel-PCD algorithm in the tree-structure HMRFs and the grid-structure HMRFs when we choose different n values; n is thenumber of MCMC steps for advancing particles in the PCD algorithm. The absolute estimateerror in the first 400 iterations is not shown in the plots.

to be the calculated test statistics. Then we estimate θ|y,k in the heterogeneous HMRF with

respect to G using our EM-kernel-PCD algorithm. After we estimate θ and µ1, we calculate the

marginal probabilities of the hidden X. Eventually, we rank the SNPs by the marginal probabilities

112

Figure 6.6: The estimated parameters against their background knowledge, namely the r2 valuesbetween the pairs of SNPs.

P (Xi = 1|y; θ, µ1), and select the SNPs with the largest marginal probabilities.

The algorithm ran for 46 days on a single processor (AMD Opteron Processor, 3300 MHz)

before it converged. We plotted the estimated parameters against their background knowledge,

namely the r2 values between the pairs of SNPs on the edges. The plot is provided in Figure 6.6. It

is observed that the mapping between the estimated parameters and the background knowledge is

monotone increasing, as we expect. Finally we calculated the marginal probabilities of the hidden

X, and ranked the SNPs by the marginal probabilities P (Xi = 1|y; θ, µ1). There are in total five

SNPs with P (Xi = 1|y; θ, µ1) greater than 0.99, which means they are associated with breast

cancer with a probability greater than 0.99 given the observed test statistics y under the estimated

parameters θ and µ1. There is strong evidence in the literature that supports the association with

breast cancer for three of them. The two SNPs rs2420946 and rs1219648 on chromosome 10 are

reported by Hunter et al (2007), and have been further validated by 1,776 cases and 2,072 controls

from three additional studies. Their associated gene FGFR2 is very well known to be associated

with breast cancer in the literature. There is also strong evidence supporting the association of the

SNP rs7712949 on chromosome 5. The SNP rs7712949 is highly correlated (r2=0.948) with SNP

rs4415084 which has been identified to be associated with breast cancer by another six large-scale

113

studies. 1

6.6 DiscussionCapturing parameter heterogeneity is an important issue in machine learning and statistics, and

it is particular challenging in HMRFs due to both the intractable Z(θ) and the latent x. In this

chapter, we propose the EM-kernel-PCD algorithm for learning the heterogeneous parameters

with background knowledge. Our algorithm is built upon the PCD algorithm which handles the

intractable Z(θ). The EM part we add is for dealing with the hidden x. The kernel smoothing

part we add is to adaptively incorporate the background knowledge about the heterogeneity in

parameters in the gradient ascent learning. Eventually, the relation between the parameters and

the background knowledge is recovered in a nonparametric way, which is also adaptive to the

data. Simulations show that our algorithm is effective for learning heterogeneous HMRFs and

outperforms alternative binning methods.

Similar to other EM algorithms, our algorithm only converges to a local maximum of the like-

lihood L(θ,ϕ), although the lower bound F(qx(x), {θ,ϕ}) nondecreases over the EM iterations

(except for some MCMC error introduced in the E-step). Our algorithm also suffers from long run

time due to computationally expensive PCD algorithm within each M-step. These two issues are

important directions for future work.

The material in this chapter first appeared in the 17th International Conference on Artificial

Intelligence and Statistics (AISTATS’2014) as follows:

Jie Liu, Chunming Zhang, Elizabeth Burnside and David Page. Learning Heterogeneous

Hidden Markov Random Fields. The 17th International Conference on Artificial Intelligence and

Statistics (AISTATS), 2014.

This chapter discusses learning heterogeneous parameters in graphical model with some back-

ground knowledge about these parameters. When there is no such background knowledge, it can

be beneficial to group the parameters for more efficient learning. The next chapter imposes Dirich-

1http://snpedia.com/index.php/rs4415084

114

let process priors over the parameters to specify these latent parameter groups, and estimates the

parameters in a Bayesian framework. The chapter demonstrates that it can indeed be beneficial to

group the parameters, even if we do not have domain-specific background knowledge about what

the grouping should be.

Chapter 7

Bayesian Estimation of

Latently-grouped Parameters in

Graphical Models

In large-scale applications of undirected graphical models, such as social networks and biological

networks, similar patterns occur frequently and give rise to similar parameters. In this situation,

it is beneficial to group the parameters for more efficient learning. We show that even when

the grouping is unknown, we can infer these parameter groups during learning via a Bayesian

approach. We impose a Dirichlet process prior on the parameters. Posterior inference usually

involves calculating intractable terms, and we propose two approximation algorithms, namely

a Metropolis-Hastings algorithm with auxiliary variables and a Gibbs sampling algorithm with

“stripped” Beta approximation (Gibbs SBA). Simulations show that both algorithms outperform

conventional maximum likelihood estimation (MLE). Gibbs SBA’s performance is close to Gibbs

sampling with exact likelihood calculation. Models learned with Gibbs SBA also generalize better

than the models learned by MLE on real-world Senate voting data.

115

116

7.1 Introduction

Undirected graphical models, a.k.a. Markov random fields (MRFs), have many real-world ap-

plications such as social networks and biological networks. In these large-scale networks, similar

kinds of relations can occur frequently and give rise to repeated occurrences of similar parameters,

but the grouping pattern among the parameters is usually unknown. For a social network example,

suppose that we collect voting data over the last 20 years from a group of 1,000 people who are

related to each other through different types of relations (such as family, co-workers, classmates,

friends and so on), but the relation types are usually unknown. If we use a binary pairwise MRF

to model the data, each binary node denotes one person’s vote, and two nodes are connected if the

two people are linked in the social network. Eventually we want to estimate the pairwise potential

functions on edges, which can provide insights about how the relations between people affect their

decisions. This can be done via standard maximum likelihood estimation (MLE), but the latent

grouping pattern among the parameters is totally ignored, and the model can be overparametrized.

Therefore, two questions naturally arise. Can MRF parameter learners automatically identify

these latent parameter groups during learning? Will this further abstraction make the model gen-

eralize better, analogous to the lessons we have learned from hierarchical modeling [61] and topic

modeling [20]?

This chapter shows that it is feasible and potentially beneficial to identify the latent parameter

groups during MRF parameter learning. Specifically, we impose a Dirichlet process prior on the

parameters to accommodate our uncertainty about the number of the parameter groups. Posterior

inference can be done by Markov chain Monte Carlo with proper approximations. We propose

two approximation algorithms, a Metropolis-Hastings algorithm with auxiliary variables and a

Gibbs sampling algorithm with stripped Beta approximation (Gibbs SBA). Algorithmic details

are provided in Section 7.3 after we review related parameter estimation methods in Section 7.2.

In Section 7.4, we evaluate our Bayesian estimates and the classical MLE on different models,

and both algorithms outperform classical MLE. The Gibbs SBA algorithm performs very close to

the Gibbs sampling algorithm with exact likelihood calculation. Models learned with Gibbs SBA

117

also generalize better than the models learned by MLE on real-world Senate voting data in Section

7.5. We finally conclude in Section 7.6.

7.2 Maximum Likelihood Estimation and Bayesian Estimation for

MRFs

Let X = {0, 1, ...,m−1} be a discrete space. Suppose that we have an MRF defined on a random

vector X ∈ X d described by an undirected graph G(V, E) with d nodes in the node set V and r

edges in the edge set E . The probability of one sample x from the MRF parameterized by θ is

P (x;θ) = P (x;θ)/Z(θ), (7.1)

where Z(θ) is the partition function. P (x;θ)=∏c∈C(G) φc(x;θc) is some unnormalized mea-

sure, and C(G) is some subset of cliques in G, and φc is the potential function defined on the

clique c parameterized by θc. In this chapter, we consider binary pairwise MRFs for simplicity,

i.e. C(G)=E andm=2. We also assume that each potential function φc is parameterized by one pa-

rameter θc, namely φc(X; θc)=θcI(Xu=Xv)(1−θc)I(Xu 6=Xv) where I(Xu=Xv) indicates whether

the two nodes u and v connected by edge c take the same value, and 0<θc<1,∀c=1, ...,r. Thus,

θ={θ1, ..., θr}. Suppose that we have n independent samples X={x1, ...,xn} from (7.1), and we

want to estimate θ.

Maximum Likelihood Estimate: The MLE of θmaximizes the log-likelihood functionL(θ|X)

which is concave w.r.t. θ. Therefore, we can use gradient ascent to find the global maximum of the

likelihood function and find the MLE of θ. The partial derivative of L(θ|X) with respect to θi is∂L(θ|X)∂θi

= 1n

∑nj=1 ψi(x

j)−Eθψi=EXψi−Eθψi where ψi is the sufficient statistic corresponding

to θi after we rewrite the density into the exponential family form, and Eθψi is the expectation of

ψi with respect to the distribution specified by θ. However the exact computation of Eθψi takes

time exponential in the treewidth of G. A few sampling-based methods have been proposed, with

different ways of generating particles and computing Eθψ from the particles, including MCMC-

118

MLE [66, 218], particle-filtered MCMC-MLE [5], contrastive divergence [80] and its variations

such as persistent contrastive divergence (PCD) [178] and fast PCD [179]. Note that contrastive

divergence is related to pseudo-likelihood [18], ratio matching [83, 84], and together with other

MRF parameter estimators [72, 184, 71] can be unified as minimum KL contraction [112].

Bayesian Estimate: Let π(θ) be a prior of θ; then its posterior isP (θ|X) ∝ π(θ)P (X;θ)/Z(θ).

The Bayesian estimate of θ is its posterior mean. Exact sampling from P (θ|X) is known as

doubly-intractable for general MRFs [128]. If we use the Metropolis-Hastings algorithm, then

Metropolis-Hastings ratio is

a(θ∗|θ) =π(θ∗)P (X;θ∗)Q(θ|θ∗)/Z(θ∗)

π(θ)P (X;θ)Q(θ∗|θ)/Z(θ), (7.2)

where Q(θ∗|θ) is some proposal distribution from θ to θ∗, and with probability min{1, a(θ∗|θ)}

we accept the move from θ to θ∗. The real hurdle is that we have to evaluate the intractable

Z(θ)/Z(θ∗) in the ratio. In [123], Møller et al. introduce one auxiliary variable y on the same

space as x, and the state variable is extended to (θ,y). They set the new proposal distribution for

the extended state Q(θ,y|θ∗,y∗)=Q(θ|θ∗)P (y;θ)/Z(θ) to cancel Z(θ)/Z(θ∗) in (7.2). There-

fore by ignoring y, we can generate the posterior samples of θ via Metropolis-Hastings. Techni-

cally, this auxiliary variable approach requires perfect sampling [143], but [123] pointed out that

other simpler Markov chain methods also work with the proviso that it converges adequately to

the equilibrium distribution.

7.3 Bayesian Parameter Estimation for MRFs with Dirichlet Process

Prior

In order to model the latent parameter groups, we impose a Dirichlet process prior on θ, which

accommodates our uncertainty about the number of groups. Then, the generating model is

119

G ∼ DP(α0, G0)

θi|G ∼ G, i = 1, ..., r

xj |θ ∼ F (θ), j = 1, ..., n,

(7.3)

where F (θ) is the distribution specified by (7.1). G0 is the base distribution (e.g. Unif(0, 1)),

and α0 is the concentration parameter. With probability 1.0, the distribution G drawn from

DP(α0, G0) is discrete, and places its mass on a countably infinite collection of atoms drawn

from G0. In this model, X={x1, ...,xn} is observed, and we want to perform posterior inference

for θ = (θ1, θ2, ..., θr), and regard its posterior mean as its Bayesian estimate. We propose two

Markov chain Monte Carlo (MCMC) methods. One is a Metropolis-Hastings algorithm with aux-

iliary variables, as introduced in Section 7.3.1. The second is a Gibbs sampling algorithm with

stripped Beta approximation, as introduced in Section 7.3.2. In both methods, the state of the

Markov chain is specified by two vectors, c and φ. In vector c = (c1, ..., cr), ci denotes the group

to which θi belongs. φ = (φ1, ..., φk) records the k distinct values in {θ1, ..., θr} with φci = θi

for i = 1, ..., r. This way of specifying the Markov chain is more efficient than setting the state

variable directly to be (θ1, θ2, ..., θr) [131].

7.3.1 Metropolis-Hastings (MH) with Auxiliary Variables

In the MH algorithm (see Algorithm 3), the initial state of the Markov chain is set by performing

K-means clustering on MLE of θ (e.g. from the PCD algorithm [178]) with K=bα0 ln rc. The

Markov chain resembles Algorithm 5 in [131], and it is ergodic. We move the Markov chain

forward for T steps. In each step, we update c first and then update φ. We update each element

of c in turn; when resampling ci, we fix c−i, all elements in c other than ci. When updating ci,

we repeatedly for M times propose a new value c∗i according to proposal Q(c∗i |ci) and accept the

move with probability min{1, a(c∗i |ci)} where a(c∗i |ci) is the MH ratio. After we update every

element of c in the current iteration, we draw a posterior sample of φ according to the current

120

grouping c. We iterate T times, and get T posterior samples of θ. Unlike the tractable Algorithm

5 in [131], we need to introduce auxiliary variables to bypass MRF’s intractable likelihood in two

places, namely calculating the MH ratio (in Section 7.3.1) and drawing samples of φ|c (in Section

7.3.1).

Calculating Metropolis-Hastings Ratio

Algorithm 3 The Metropolis-Hastings algorithmInput: observed data X={x1, ...,xn}

Output: θ(1)

, ..., θ(T )

; T samples of θ|X

Procedure:

Perform PCD algorithm to get θ, MLE of θ

Init. c and φ via K-means on θ; K=bα0 ln rc

for t = 1 to T do

for i = 1 to r do

for l = 1 to M do

Draw a candidate c∗i from Q(ci|c∗i )

If c∗i 6∈ c, draw a value for φci from G0

Set ci=c∗i with prob min{1, a(c∗i |ci)}

end for

end for

Draw a posterior sample of φ according to current c, and set θ(t)i =φci for i=1, ..., r.

end for

The MH ratio of proposing a new value c∗i for ci according to proposal Q(c∗i |ci) is

121

a(c∗i |ci) =π(c∗i , c−i)P (X;θ.∗i )Q(ci|c∗i )π(ci, c−i)P (X;θ)Q(c∗i |ci)

=π(c∗i |c−i)P (X;θ.∗i )Q(ci|c∗i )/Z(θ.∗i )

π(ci|c−i)P (X;θ)Q(c∗i |ci)/Z(θ),

where θ.∗i is the same as θ except its i-th element is replaced with φc∗i . The conditional prior

π(c∗i |c−i) is

π(ci=c|c−i)=

n−i,c

r−1+α0, if c ∈ c−i

α0r−1+α0

, if c 6∈ c−i

where n−i,c is the number of cj for j 6=i and cj=c. We choose proposal Q(c∗i |ci) to be the condi-

tional prior π(c∗i |c−i), and the Metropolis-Hastings ratio can be further simplified as

a(c∗i |ci)=P (X;θ.∗i )Z(θ)/P (X;θ)Z(θ.∗i ).

However, Z(θ)/Z(θ.∗i ) is intractable. Similar to [123], we introduce an auxiliary variable

Z on the same space as X, and the state variable is extended to (c,Z). When proposing a

move, we propose c∗i first and then propose Z∗ with proposal P (Z;θ.∗i ) to cancel the intractable

Z(θ)/Z(θ.∗i ). We set the target distribution of Z to be P (Z; θ) where θ is some estimate of θ

(e.g. from PCD [178]). Then, the MH ratio with the auxiliary variable is

a(c∗i ,Z∗|ci,Z) =P (Z∗; θ)P (X;θ.∗i )P (Z;θ)

P (Z; θ)P (X;θ)P (Z∗;θ.∗i )=P (Z∗; θ)P (X;θ.∗i )P (Z;θ)

P (Z; θ)P (X;θ)P (Z∗;θ.∗i ).

Thus, the intractable computation of the MH ratio is replaced by generating particles Z∗ and

Z under θ.∗i and θ respectively. Ideally, we should use perfect sampling [143], but it is intractable

for general MRFs. As a compromise, we use standard Gibbs sampling with long runs to generate

these particles.

122

Drawing Posterior Samples of φ|c

We draw posterior samples of φ under grouping c via the MH algorithm, again following [123].

The state of the Markov chain is φ. The initial state of the Markov chain is set by running

PCD [178] with parameters tied according to c. The proposal Q(φ∗|φ) is a k-variate Gaussian

N (φ, σ2QIk) where σ2

QIk is the covariance matrix. The auxiliary variable Y is on the same space

as X, and the state is extended to (φ,Y). The proposal distribution for the extended state variable

is Q(φ,Y|φ∗,Y∗) = Q(φ|φ∗)P (Y;φ)/Z(φ). We set the target distribution of Y to be P (Y; φ)

where φ is some estimate of φ such as the estimate from the PCD algorithm [178]. Then, the MH

ratio for the extended state is

a(φ∗,Y∗|φ,Y) = I(φ∗∈Θ)P (Y∗; φ)P (X;φ∗)P (Y;φ)

P (Y; φ)P (X;φ)P (Y∗;φ∗),

where I(φ∗ ∈Θ) indicates that every dimension of φ∗ is in the domain of G0. We set the state

to be the new values with probability min{1, a(φ∗,Y∗|φ,Y)}. We move the Markov chain for S

steps, and get S samples ofφ by ignoring Y. Eventually we draw one sample from them randomly.

123

7.3.2 Gibbs Sampling with Stripped Beta Approximation

Algorithm 4 The Gibbs sampling algorithmInput: observed data X = {x1,x2, ...,xn}

Output: θ(1)

, ..., θ(T )

; T posterior samples of θ|X

Procedure:

Perform PCD algorithm to get MLE θ

Init. c and φ via K-means on θ; K=bα0 ln rc

for t = 1 to T do

for i = 1 to r do

If current ci is unique in c, remove φci from φ

Update ci according to (7.4).

If new ci 6∈c, draw a value for φci and add to φ

end for

Draw a posterior sample of φ according to current c, and set θ(t)i = φci for i = 1, ..., r

end for

In the Gibbs sampling algorithm (see Algorithm 4), the initialization of the Markov chain is

exactly the same as in the MH algorithm in Section 7.3.1. The Markov chain resembles Algorithm

2 in [131] and it can be shown to be ergodic. We move the Markov chain forward for T steps. In

each of the T steps, we update c first and then update φ. When we update c, we fix the values in

φ, except we may add one new value to φ or remove a value from φ. We update each element of c

in turn. When we update ci, we first examine whether ci is unique in c. If so, we remove φci from

φ first. We then update ci by assigning it to an existing group or a new group with a probability

proportional to a product of two quantities, namely

124

P (ci = c|c−i,X, φc−i) ∝

n−i,c

r−1+α0P (X;φc, φc−i), if c ∈ c−i

α0r−1+α0

∫P (X; θi, φc−i) dG0(θi), if c 6∈ c−i.

(7.4)

The first quantity is n−i,c, the number of members already in group c. For starting a new

group, the quantity is α0. The second quantity is the likelihood of X after assigning ci to the new

value c conditional on φc−i . When considering a new group, we integrate the likelihood w.r.t.

G0. After ci is resampled, it is either set to be an existing group or a new group. If a new group

is assigned, we draw a new value for φci , and add it to φ. After updating every element of c

in the current iteration, we draw a posterior sample of φ under the current grouping c. In total,

we run T iterations, and get T posterior samples of θ. This Gibbs sampling algorithm involves

two intractable calculations, namely (i) calculating P (X;φc, φc−i) and∫P (X; θi, φc−i) dG0(θi)

in (7.4) and (ii) drawing posterior samples for φ. We use a stripped Beta approximation in both

places, as in Sections 7.3.2 and 7.3.2.

Calculating P (X;φc, φc−i) and∫P (X; θi, φc−i) dG0(θi) in (7.4)

In Formula (7.4), we evaluate P (X;φc, φc−i) for different φc values with φc−i fixed and X =

{x1,x2, ...,xn} observed. For ease of notation, we rewrite this quantity as a likelihood function

of θi, L(θi|X,θ−i), where θ−i = {θ1, ..., θi−1, θi+1, ..., θr} is fixed. Suppose that the edge i

connects variables Xu and Xv, and we denote X−uv to be the variables other than Xu and Xv.

Then

L(θi|X,θ−i)=∏n

j=1P (xju, x

jv|x

j−uv; θi,θ−i)P (xj−uv; θi,θ−i)

≈∏n

j=1P (xju, x

jv|x

j−uv; θi,θ−i)P (xj−uv;θ−i) ∝

∏n

j=1P (xju, x

jv|x

j−uv; θi,θ−i).

Above we approximate P (xj−uv; θi,θ−i) with P (xj−uv;θ−i) because the density of X−uv

125

mostly depends on θ−i. The term P (xj−uv;θ−i) can be dropped since θ−i is fixed, and we only

have to consider P (xju, xjv|xj−uv; θi,θ−i). Since θ−i is fixed and we are conditioning on xj−uv,

they together can be regarded as a fixed potential function telling how likely the rest of the graph

thinks Xu and Xv should take the same value. Suppose that this fixed potential function (the

message from the rest of the network xj−uv) is parameterized as ηi (0<ηi<1). Then

n∏j=1

P (xju, xjv|x

j−uv; θi,θ−i)∝

n∏j=1

λI(xju=xjv)(1−λ)I(x

ju 6=xjv)=λ

n∑j=1

I(xju=xjv)

(1−λ)

n∑j=1

I(xju 6=xjv)

(7.5)

where λ=θiηi/{θiηi+(1−θi)(1−ηi)}. The end of (7.5) resembles a Beta distribution with pa-

rameters (∑n

j=1 I(xju=xjv)+1, n−

∑nj=1 I(x

ju=xjv)+1) except that only part of λ, namely θi, is

random. Now we want to use a Beta distribution to approximate the likelihood with respect to

θi, and we need to remove the contribution of ηi and only consider the contribution from θi. We

choose Beta(bnθic+1, n−bnθic+1) where θi is MLE of θi (e.g. from the PCD algorithm). This

approximation is named the Stripped Beta Approximation. The simulation results in Section 7.4.2

indicate that the performance of the stripped Beta approximation is very close to using exact calcu-

lation. Also this approximation only requires as much computation as in the tractable tree-structure

MRFs, and it does not require generating expensive particles as in the MH algorithm with auxil-

iary variables. The integral∫P (X; θi, φc−i) dG0(θi) in (7.4) can be calculated via Monte Carlo

approximation. We draw a number of samples of θi fromG0, and evaluate P (X; θi, φc−i) and take

the average.

Drawing Posterior Samples of φ|c

The stripped Beta approximation also allows us to draw posterior samples from φ|c approxi-

mately. Suppose that there are k groups according to c, and we have estimates for φ, denoted as

φ = (φ1, ..., φk). We denote the numbers of elements in the k groups by m = {m1, ...,mk}. For

group i, we draw a posterior sample for φi from Beta(bminφic+1,min−bminφic+1).

126

7.4 Simulations

We investigate the performance of our Bayesian estimators on three models: (i) a tree-MRF, (ii)

a small grid-MRF whose likelihood is tractable, and (iii) a large grid-MRF whose likelihood is

intractable. We first set the ground truth of the parameters, and then generate training and testing

samples. On training data, we apply our grouping-aware Bayesian estimators and two baseline

estimators, namely a grouping-blind estimator and an oracle estimator. The grouping-blind esti-

mator does not know groups exist in the parameters, and estimates the parameters in the normal

MLE fashion. The oracle estimator knows the ground truth of the groupings, and ties the parame-

ters from the same group and estimates them via MLE. For the tree-MRF, our Bayesian estimator

is exact since the likelihood is tractable. For the small grid-MRF, we have three variations for the

Bayesian estimator, namely Gibbs sampling with exact likelihood computation, MH with auxil-

iary variables, and Gibbs sampling with stripped Beta approximation. For the large grid-MRF, the

computational burden only allows us to apply Gibbs sampling with stripped Beta approximation.

We compare the estimators by three measures. The first is the average absolute error of esti-

mate 1/r∑r

i=1 |θi − θi| where θi is the estimate of θi. The second measure is the log likelihood

of the testing data, or the log pseudo-likelihood [18] of the testing data when exact likelihood is

intractable. Thirdly, we evaluate how informative the grouping yielded by the Bayesian estima-

tor is. We use the variation of information metric [117] between the inferred grouping C and the

ground truth groupingC, namely VI(C,C). Since VI(C,C) is sensitive to the number of groups

in C, we contrast it with VI(C,C) where C is a random grouping with its number of groups the

same as C. Eventually, we evaluate C via the VI difference, namely VI(C,C)−VI(C,C). A

larger value of VI difference indicates a more informative grouping yielded by our Bayesian esti-

mator. Because we have one grouping in each of the T MCMC steps, we average the VI difference

yielded in each of the T steps.

127

●

●

●

●

●

●●

●●

●

0.0000.0100.0200.030

Error of Estimate

100

200

300

400

500

600

700

800

900

1000

Trai

ning

Sam

ple

Siz

e(a

)

●M

LEO

racl

eB

ayes

ian

●

●

●

●

●●

●●

●●

−4160−4150−4140−4130−4120

Log−likelihood of Test Data

100

200

300

400

500

600

700

800

900

1000

Trai

ning

Sam

ple

Siz

e(b

)

●M

LEO

racl

eB

ayes

ian

5.05.56.06.57.0

VI Difference

100

200

300

400

500

600

700

800

900

1000

Trai

ning

Sam

ple

Siz

e(c

)

Bay

esia

n

Figu

re7.

1:Pe

rfor

man

ceof

the

grou

ping

-blin

dM

LE

,the

orac

leM

LE

and

our

Bay

esia

nes

timat

oron

tree

-str

uctu

reM

RFs

inte

rms

of(a

)er

ror

ofes

timat

ean

d(b

)lo

g-lik

elih

ood

ofte

stda

ta.

Subfi

gure

(c)

show

sth

eV

Idi

ffer

ence

betw

een

the

grou

ping

yiel

ded

byou

rBay

esia

nes

timat

oran

dra

ndom

grou

ping

.

128

7.4.1 Simulations on Tree-structure MRFs

For the structure of the MRF, we choose a perfect binary tree of height 12 (i.e. 8,191 nodes and

8,190 edges). We assume there are 25 groups among the 8,190 parameters. The base distribution

G0 is Unif(0, 1). We first generate the true parameters for the 25 groups from Unif(0, 1). We

then randomly assign each of the 8,190 parameters to one of the 25 groups. We then generate

1,000 testing samples and n training samples (n=100, 200, ..., 1,000). Eventually, we apply the

grouping-blind MLE, the oracle MLE, and our grouping-aware Bayesian estimator on the training

samples. For tree-structure MRFs, both MLE and Bayesian estimation have a closed form solu-

tion. For the Bayesian estimator, we set the number of Gibbs sampling steps to be 500 and set

α0=1.0. We replicate the experiment 500 times, and the averaged results are in Figure 7.1.

Our grouping-aware Bayesian estimator has a lower estimate error and a higher log likelihood

of test data, compared with the grouping-blind MLE, demonstrating the “blessing of abstraction”.

Our Bayesian estimator performs worse than oracle MLE, as we expect. In addition, as the training

sample size increases, the performance of our Bayesian estimator approaches that of the oracle

MLE. The VI difference in Figure 7.1(c) indicates that the Bayesian estimator also recovers the

latent grouping to some extent, and the inferred groupings become more and more reliable as the

training size increases. The number of groups inferred by the Bayesian estimator and its running

time are in Figure 7.2.

7.4.2 Simulations on Small Grid-MRFs

For the structure of the MRF, we choose a 4×4 grid with 16 nodes and 24 edges. Exact likeli-

hood is tractable in this small model, which allows us to investigate how good the two types of

approximation are. We apply the grouping-blind MLE (the PCD algorithm), the oracle MLE (the

PCD algorithm with the parameters from same group tied) and three Bayesian estimators: Gibbs

sampling with exact likelihood computation (Gibbs ExactL), Metropolis-Hastings with auxiliary

variables (MH AuxVar), and Gibbs sampling with stripped Beta approximation (Gibbs SBA). We

assume there are five parameter groups. The base distribution is Unif(0, 1). We first generate

129

Traing Sample Size

Num

ber

of G

roup

s In

ferr

ed

15

20

25

30

100 200 300 400 500 600 700 800 900 1000

●

●

●

●●

●●

● ● ●

●

●●

●

●

●●

●●

●●

●●●●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●●

400

410

420

430

440

450

460

Run

Tim

e (in

sec

onds

)

100 200 300 400 500 600 700 800 900 1000Training Sample Size

Figure 7.2: Number of groups inferred by the Bayesian estimator and its run time.

130

(a) Gibbs_ExactL

Traing Sample Size

# G

roup

s In

ferr

ed

4

6

8

10

100 200 300 400 500 600 700 800 9001000

● ● ● ● ● ● ● ● ● ●

●

● ●●

● ● ●

(b) MH_AuxVar

Traing Sample Size

4

6

8

10

100 200 300 400 500 600 700 800 9001000

● ● ● ● ● ● ● ● ● ●

●

(c) Gibbs_SBA

Traing Sample Size

4

6

8

10

100 200 300 400 500 600 700 800 9001000

● ● ● ● ● ● ● ● ● ●

●●

●●

●●

●

●●

●●

●

●

● ● ●

●

●

●

●

●

Figure 7.3: The number of groups inferred by Gibbs ExactL, MH AuxVar and Gibbs SBA.

the true parameters for the five groups from Unif(0, 1). We then randomly assign each of the

24 parameters to one of the five groups. We then generate 1,000 testing samples and n training

samples (n=100, 200, ..., 1,000). For Gibbs ExactL and Gibbs SBA, we set the number of Gibbs

sampling steps to be 100. For MH AuxVar, we set the number of MH steps to be 500 and its pro-

posal number M to be 5. The parameter σQ in Section 7.3.1 is set to be 0.001 and the parameter

S is set to be 100. For all three Bayesian estimators, we set α0=1.0. We replicate the experiment

50 times, and the averaged results are in Figure 7.4.

Our grouping-aware Bayesian estimators have a lower estimate error and a higher log like-

lihood of test data, compared with the grouping-blind MLE, demonstrating the blessing of ab-

straction. All three Bayesian estimators perform worse than oracle MLE, as we expect. The VI

difference in Figure 7.4(c) indicates that the Bayesian estimators also recover the grouping to

some extent, and the inferred groupings become more and more reliable as the training size in-

creases. In Figure 7.3, we provide the boxplots of the number of groups inferred by Gibbs ExactL,

MH AuxVar and Gibbs SBA. All three methods recover a reasonable number of groups, and

Gibbs SBA slightly over-estimates the number of groups.

Among the three Bayesian estimators, Gibbs ExactL has the lowest estimate error and the

highest log likelihood of test data. Gibbs SBA also performs considerably well, and its perfor-

mance is close to the performance of Gibbs ExactL. MH AuxVar works slightly worse, especially

when there is less training data. However, MH AuxVar recovers better groupings than Gibbs SBA

131

●

●

●

●

●

●●

●●

●

0.0050.0150.0250.035

Error of Estimate

100

200

300

400

500

600

700

800

900

1000

Trai

ning

Sam

ple

Siz

e(a

)

●M

LEO

racl

eG

ibbs

_Exa

ctL

Gib

bs_S

BA

MH

_Aux

Var

●

●

●

●●

●●

●●

●

−6920−6880−6840−6800

Log−likelihood of Test Data

100

200

300

400

500

600

700

800

900

1000

Trai

ning

Sam

ple

Siz

e(b

)

●M

LEO

racl

eG

ibbs

_Exa

ctL

Gib

bs_S

BA

MH

_Aux

Var

1.01.21.41.61.82.02.2

VI Difference

100

200

300

400

500

600

700

800

900

1000

Trai

ning

Sam

ple

Siz

e(c

)

Gib

bs_E

xact

LG

ibbs

_SB

AM

H_A

uxV

ar

Figu

re7.

4:Pe

rfor

man

ceof

grou

ping

-blin

dM

LE

,ora

cle

ML

E,G

ibbs

Exa

ctL

,MH

Aux

Var

,and

Gib

bsSB

Aon

the

smal

lgri

d-st

ruct

ure

MR

Fsin

term

sof

(a)

erro

rof

estim

ate

and

(b)

log-

likel

ihoo

dof

test

data

.Su

bfigu

re(c

)sh

ows

the

VI

diff

eren

cebe

twee

nth

egr

oupi

ngyi

elde

dby

ourB

ayes

ian

estim

ator

san

dra

ndom

grou

ping

.

132

n=100 n=500 n=1,000GIBBS EXACTL 88,136.3 91,055.0 92,503.4MH AUXVAR 540.2 3,342.2 4,546.7GIBBS SBA 8.1 10.8 14.2

Table 7.1: The run time (in seconds) of Gibbs ExactL, MH AuxVar and Gibbs SBA when trainingsize is n.

when there are more training data. The run times of the three Bayesian estimators are listed in Ta-

ble 7.1. Gibbs ExactL has a computational complexity that is exponential in the dimensionality d,

and cannot be applied to situations when d > 20. MH AuxVar is also computationally intensive

because it has to generate expensive particles. Gibbs SBA runs fast, with its burden mainly from

running PCD under a specific grouping in each Gibbs sampling step, and it scales well.

7.4.3 Simulations on Large Grid-MRFs

The large grid consists of 30 rows and 30 columns (i.e. 900 nodes and 1,740 edges). Exact like-

lihood is intractable for this large model, and we cannot run Gibbs ExactL. The high dimension

also prohibits MH AuxVar. Therefore, we only run the Gibbs SBA algorithm on this large grid-

structure MRF. We assume that there are 10 groups among the 1,740 parameters. We also evaluate

the estimators by the log pseudo-likelihood of testing data. The other settings of the experiments

stay the same as Section 7.4.2. We replicate the experiment 50 times, and the averaged results are

in Figure 7.5.

For all 10 training sets, our Bayesian estimator Gibbs SBA has a lower estimate error and a

higher log likelihood of test data, compared with the grouping-blind MLE (via the PCD algorithm).

Gibbs SBA has a higher estimate error and a lower pseudo-likelihood of test data than the oracle

MLE. The VI difference in Figure 7.5(c) indicates that Gibbs SBA gradually recovers the grouping

as the training size increases. The number of groups inferred by Gibbs SBA and its running

time are provided in Figure 7.6. Similarly to the observation in Section 7.4.2, Gibbs SBA over-

estimates the number of groups. Gibbs SBA finishes the simulations on 900 nodes and 1,740

133

●

●

●

●

●●

●●

●●

0.010.020.030.04

Error of Estimate

100

200

300

400

500

600

700

800

900

1000

Trai

ning

Sam

ple

Siz

e(a

)

●M

LEO

racl

eG

ibbs

_SB

A

●

●

●

●

●●

●●

●●

−210000−206000−202000−198000

Log−pseudolikelihood of Test Data

100

200

300

400

500

600

700

800

900

1000

Trai

ning

Sam

ple

Siz

e(b

)

●M

LEO

racl

eG

ibbs

_SB

A

0.51.01.52.0

VI Difference

100

200

300

400

500

600

700

800

900

1000

Trai

ning

Sam

ple

Siz

e(c

)

Gib

bs_S

BA

Figu

re7.

5:Pe

rfor

man

ceof

the

grou

ping

-blin

dM

LE

,the

orac

leM

LE

and

the

Bay

esia

nes

timat

or(G

ibbs

SBA

)on

larg

egr

id-

stru

ctur

eM

RFs

inte

rms

of(a

)er

ror

ofes

timat

ean

d(b

)lo

g-lik

elih

ood

ofte

stda

ta.

Subfi

gure

(c)

show

sth

eV

Idi

ffer

ence

betw

een

the

grou

ping

yiel

ded

byou

rBay

esia

nes

timat

oran

dra

ndom

grou

ping

.

134

Traing Sample Size

Num

ber

of G

roup

s In

ferr

ed

20

40

60

80

100 200 300 400 500 600 700 800 900 1000

●

●

●

●●

●

●●

●●●

●

● ●

●●●

●

●15

000

2000

025

000

3000

0

Run

Tim

e (in

sec

onds

)

100 200 300 400 500 600 700 800 900 1000Training Sample Size

Figure 7.6: Number of groups inferred by Gibbs SBA and its run time.

135

LPL-TRAIN LPL-TESTMLE GIBBS SBA MLE GIBBS SBA # GROUPS RUNTIME (MINS)

EXP1 -10716.75 -10721.34 -9022.01 -8989.87 7.89 204EXP2 -8306.17 -8322.34 -11490.47 -11446.45 7.29 183

Table 7.2: Log pseudo-likelihood (LPL) of training and testing data from MLE (PCD) andBayesian estimate (Gibbs SBA), the number of groups inferred by Gibbs SBA, and its run time inthe Senate voting experiments.

edges in hundreds of minutes (depending on the training size), which is considered to be very fast.


We apply the Gibbs SBA algorithm on US Senate voting data from the 109th Congress (available

at www.senate.gov). The 109th Congress has two sessions, the first session in 2005 and the second

session in 2006. There are 366 votes and 278 votes in the two sessions, respectively. There are 100

senators in both sessions, but Senator Corzine only served the first session and Senator Menendez

only served the second session. We remove them. In total, we have 99 senators in our experiments,

and we treat the votes from the 99 senators as the 99 variables in the MRF. We only consider

contested votes, namely we remove the votes with less than ten or more than ninety supporters.

In total, there are 292 votes and 221 votes left in the two sessions, respectively. The structure of

the MRF is from Figure 13 in [7]. There are in total 279 edges. The votes are coded as −1 for no

and 1 for yes. We replace all missing votes with −1, staying consistent with [7]. We perform two

experiments. First, we train the MRF using the first session data, and test on the second session

data. Then, we train on the second session and test on the first session. We compare our Bayesian

estimator (via Gibbs SBA) and MLE (via PCD) by the log pseudo-likelihood of testing data since

exact likelihood is intractable. We set the number of Gibbs sampling steps to be 3,000. Both of the

two experiments are finished in around three hours on a single CPU. The results are summarized

in Table 7.2. In the first experiment, the log pseudo-likelihood of test data is−9022.01 from MLE,

whereas it is −8989.87 from our Bayesian estimate. In the second experiment, the log pseudo-

136

likelihood of test data is −11490.47 from MLE, whereas it is −11446.45 from our Bayesian

estimate. The increase of log pseudo-likelihood is comparable to the increase of log (pseudo-

)likelihood we gain in the simulations (please refer to Figures 7.1b, 7.4b and 7.5b at the points

when we simulate 200 and 300 training samples). Both experiments indicate that the models

trained with the Gibbs SBA algorithm generalize considerably better than the models trained with

MLE. Gibbs SBA also infers there are around eight different types of relations among the senators.

The estimated parameters in the two models are consistent.

7.6 Discussion

Bayesian nonparametric approaches [135, 65], such as the Dirichlet process [48], provide an ele-

gant way of modeling mixtures with an unknown number of components. These approaches have

yielded advances in different machine learning areas, such as the infinite Gaussian mixture models

[145], the infinite mixture of Gaussian processes [146], infinite HMMs [9, 54], infinite HMRFs

[29], DP-nonlinear models [161], DP-mixture GLMs [77], infinite SVMs [217, 216], and the in-

finite latent attribute models [137]. In this chapter, we play the same trick of replacing the prior

distribution with a prior stochastic process to accommodate our uncertainty about the number of

parameter groups. To the best of our knowledge, this is the first time a Bayesian nonparamet-

ric approach is applied to models whose likelihood is intractable. Accordingly, we propose two

types of approximation, namely a Metropolis-Hastings algorithm with auxiliary variables and a

Gibbs sampling algorithm with stripped Beta approximation. Both algorithms show superior per-

formance over conventional MLE, and Gibbs SBA can also scale well to large-scale MRFs. The

Markov chains in both algorithms are ergodic, but may not be in detailed balance because we rely

on approximation. Thus, we guarantee that both algorithms converge for general MRFs, but they

may not exactly converge to the target distribution.

In this chapter, we only consider the situation where the potential functions are pairwise and

there is only one parameter in each potential function. For graphical models with more than

one parameter in the potential functions, it is appropriate to group the parameters on the level

137

of potential functions. A more sophisticated base distribution G0 (such as some multivariate

distribution) needs to be considered. In this chapter, we also assume the structures of the MRFs

are given. When the structures are unknown, we still need to perform structure learning. Allowing

structure learners to automatically identify structure modules will be another very interesting topic

to explore in the future research.

The material in this chapter first appeared in the Advances in Neural Information Processing

Systems (NIPS’2013) as follows:

Jie Liu and David Page. Bayesian Estimation of Latently-grouped Parameters in Undirected

Graphical Models. Advances in Neural Information Processing Systems (NIPS), 2013.

Chapters 3, 4 and 5 focus statistical inference aspect of GWAS with the help of graphical

models. Chapters 6 and 7 further discuss issues related to learning graphical models. The next

chapter shifts gears to the application aspect of GWAS, namely the clinical translation of GWAS

discoveries to personalized breast cancer diagnosis.

Chapter 8

Genetic Variants Improve Personalized

Breast Cancer Diagnosis

Recently, a number of genome-wide association studies have identified genetic variants associated

with breast cancer. However, the degree to which these genetic variants improve breast cancer

diagnosis in concert with mammography remains unknown. We conducted a retrospective case-

control study, collecting mammographic findings and high-frequency/low-penetrance genetic vari-

ants from an existing personalized medicine data repository. A Bayesian network was developed

on the mammographic findings, with and without the genetic variants collected. We analyzed the

predictive performance using the area under the ROC curve, and found that the genetic variants

significantly improved breast cancer diagnosis on mammograms.

8.1 Introduction

Large multi-relational databases containing variables that confer disease risk are increasingly

available, providing the opportunity for informatics tools to better stratify individuals for appropri-

ate healthcare decisions and explore disease mechanism and behavior. Coincident to this, policy-

makers have recommended that interventions, like breast cancer screening with mammography,

138

139

be increasingly based on individualized risk and shared decision-making [132, 158]. Targeting

at risk individuals for intervention after mammographic screening has the potential to decrease

recommendations for breast biopsy in women most likely to have an unnecessary procedure for

benign findings. Recent large-scale genome-wide association studies have identified 77 suscepti-

bility loci associated with breast cancer. In addition, there is a long history of development and

codification of features observed by radiologists on mammography that also predict a woman’s

risk of breast cancer. However, genetics and mammography abnormality findings have not yet

been used together to predict risk. Furthermore, the opportunity to use this data to interpret geno-

type/phenotype association, explain family aggregation of breast cancer, and shed light on disease

mechanism or natural history is just becoming possible.

There have been several attempts to incorporate these genetic variants into the Gail model [57]

which is a standard clinical breast cancer risk model including the number of first-degree relatives

with a diagnosis of breast cancer, age at menarche, age at first live birth and the number of previous

breast biopsies. Seven associated SNPs, when added to the Gail model, increase the area under the

receiver operating characteristic (ROC) curve from 0.607 to 0.632 [55, 56]. When ten associated

SNPs are added to the Gail model, the area under the ROC curve of the risk model increases

from 0.580 to 0.618 on another dataset [186]. However, the Gail model does not include any

mammography features which are clinically used by radiologists. Therefore, it is still unknown

how much these genetic variants improve breast cancer diagnosis and clinical decision-making

after an abnormal mammogram.

The main purpose of this chapter is to examine the impact of genetic information on im-

proving breast cancer risk prediction on mammograms. We incorporate genetic polymorphisms

with the descriptors that radiologists observe on mammograms while making medical decisions,

the American College of Radiology Breast Imaging Reporting and Data System (BI-RADS) [1],

version 4, including the shape and the margin of masses, the shape and the distribution of micro-

calcifications, background breast density and other associated findings as defined by this standard

lexicon in breast imaging. We also include a small number of predictive variables not included in

140

BI-RADS currently. Specifically, we employ these mammographic findings (49 mammography

descriptors) and the 77 genetic variants associated with breast cancer in a personalized medicine

data repository at the Marshfield Clinic. We train a Bayesian network on the mammographic

findings, with and without the 77 genetic variants.

8.2 Materials and Methods

8.2.1 Data

Subjects

The Personalized Medicine Research Project [115] at the Marshfield Clinic was used as the sam-

pling frame to identify breast cancer cases and controls. The project was reviewed and approved

by the Marshfield Clinic IRB. Subjects were selected using clinical data from Marshfield Clinic

Cancer Registry and Data Warehouse. We employed a retrospective case-control design. Women

with a plasma sample available, a mammogram, and a breast biopsy within 12 months after the

mammogram were included in the study. Cases were defined as women having a confirmed di-

agnosis of breast cancer obtained from the institutional cancer registry. Controls were confirmed

through the electronic medical records (and absence from the cancer registry) as never having had

a breast cancer diagnosis. In our case cohort, we included both invasive breast cancer (ductal

and lobular) as well as ductal carcinoma in situ. In order to construct case and control cohorts

that were similar in age distribution, we employed an age matching strategy. Specifically, we se-

lected a control whose age was within five years of the age of each case. Of note, we decided to

focus on high-frequency/low-penetrance genes that affect breast cancer risk as opposed to low fre-

quency genes with high penetrance (BRCA1 and BRCA2) or intermediate penetrance (CHEK-2).

High-frequency/low-penetrance SNPs generally have frequencies for the rarest allele of> 25% as

opposed to the low-frequency, high-penetrance mappings with population frequencies of < 1%.

We excluded individuals who had a known high penetrance genetic mutation.

141

Genetic Variants

Our study included 77 genetic variants which have been identified by recent large-scale genome-

wide association studies. Table 1 in [106] summarizes detailed information about the 77 SNPs,

including the IDs, the original publications associating them with breast cancer and their chro-

mosomes. The seven SNPs used in Gail study [55, 56] were also included in our study. Nine of

the ten SNPs used in Wacholder et al study [186] were included in our study, and the remaining

SNP rs7716600 from that study had a proxy rs10941679 in our study. We observed that each SNP

only confers a slight increase or decrease in the risk of breast cancer, in accordance with prior

literature. Among the 77 SNPs, 22 were evaluated in the previous study [105]. Among the 55

new SNPs, 41 were identified by COGS [120], and 14 SNPs were included based on several other

recent studies [193, 165, 162, 68, 50, 180, 2]. It is estimated that the current list of SNPs explains

14% of familial breast cancer risk [120].

Mammography Features

The American College of Radiology developed the BI-RADS lexicon [1] to homogenize mammo-

graphic findings and recommendations. The BI-RADS lexicon consists of a number of mammog-

raphy descriptors, including the characteristics of masses and microcalcifications, background

breast density and other associated findings, which can be organized in a hierarchy as shown

in Figure 8.1. Datasets containing mammography descriptors have been used to build several

successful breast cancer risk models and classifiers [6, 22]. Mammography data was originally

recorded as free text reports in the Marshfield database, and thus it was difficult to directly access

the information contained therein. We used a parser to extract mammography features from the

text reports; the parser has been shown to outperform manual extraction [130, 140]. After ex-

traction, every mammography feature takes the value “present” or “not present” except that the

variable mass size is discretized into three values, “not present”, “small” and “large”, depending

whether there is a reported mass size and whether any dimension of the reported mass size is larger

than 30mm.

142

Mas

s

Hig

hEq

ual

Low

Fat

Rou

ndO

val

Lobu

lar

Irre

gula

r

Circ

umsc

ribed

Mic

rolo

bula

ted

Obs

cure

dIn

dist

inct

Spic

ulat

ed

Ass

ocia

ted

Find

ings

Spec

ial

Cas

esA

rchi

tect

ural

Dis

tort

ion

Cal

cific

atio

ns

Coa

rse/

popc

orn

Milk

of c

alci

umR

od-li

keEg

gshe

ll/rim

Dys

troph

icLu

cent

-cen

tere

dSk

inR

ound

Punc

tate

Am

orph

ous

Pleo

mor

phic

Fine

line

arV

ascu

lar

Sutu

re

Mam

mog

raph

y Fe

atur

es

Skin

Thi

cken

ing

Skin

Ret

ract

ion

Nip

ple

Ret

ract

ion

Trab

ecul

arTh

icke

ning

Skin

Les

ion

Axi

llary

Ade

nopa

thy

Clu

ster

edLi

near

Segm

enta

lR

egio

nal

Scat

tere

d

Mar

gins

Shap

e

Den

sity

Dis

tribu

tion

ImagingObservation

Imaging ObservationFeatures

49 BI-RADS Descriptors

Arc

hite

ctur

al D

isto

rtion

Mor

phol

ogy

Size

*

Lym

ph N

ode

Foca

l Asy

mm

etry

Tubu

lar D

ensi

tyA

sym

met

ric ti

ssueB

reas

tC

ompo

sitio

n

Pred

omin

antly

Fat

Scat

tere

d Fi

brog

land

ular

Den

sitie

sH

eter

ogen

eous

ly D

ense

Extre

mel

y D

ense

Num

eric

al s

ize*

Bre

ast

Com

posi

tion

Spec

ial

Cas

esA

ssoc

iate

d Fi

ndin

gsA

rchi

tect

ural

Dis

torti

onPa

lpab

ility

*

Palp

able

*

*re

pres

ents

pre

dict

ive

feat

ures

not

incl

uded

in B

I-R

AD

S

Figu

re8.

1:M

amm

ogra

phy

feat

ures

adop

ted

from

the

Am

eric

anC

olle

geof

Rad

iolo

gy(B

I-R

AD

Sle

xico

n).

143

Each mammogram also has a BI-RADS category assigned by the radiologist who read the

mammogram. The BI-RADS category indicates the radiologist’s opinion of the absence or pres-

ence of breast cancer. In our study, the BI-RADS assessment category can take values, with an

order of increasing probability of malignancy, of 1, 2, 3, 0, 4a, 4, 4b, 4c and 5. We used the

BI-RADS assessment category as the predictions from the radiologists. Our experiment only in-

cluded diagnostic mammograms, and all the screening mammograms were excluded. Since most

of the subjects have multiple diagnostic mammograms in the electronic medical records, we se-

lected one mammogram for each subject as follows, to mimic the scenario of the most important

doctor visit before diagnosis. For cases, we selected the mammograms within one year prior to di-

agnosis. For controls, we selected the mammograms within one year prior to biopsy. If there were

still multiple mammograms left for each subject, we selected the mammogram with a more suspi-

cious BI-RADS category, with subsequent tiebreakers being, in order, recency and the number of

extracted mammography features.

8.2.2 Model

We built breast cancer diagnosis models using Naive Bayes, which can be regarded as the weighted

average of risk factors. Naive Bayes assumes that all features are conditionally independent of one

another given the class [111]. Although this assumption seems strong, it generally works well in

practical problems and provides easy interpretation of the risk contribution from different factors.

In our experiments, we used the Naive Bayes implementation in WEKA [75].

In total, we constructed three types of models on different sets of features. The first model was

built purely on the 49 mammography features, namely the Breast Imaging model. The second type

of model was based purely on genetic variants, namely the genetic models. Since we would like

to align our study with previous work [105], we tested three sets of genetic variants. The first set

consisted of the 10 SNPs in [186]. The second included the 22 SNPs in the study [105]. The last

set was our full list of the 77 SNPs. We denote the three genetic models as Genetic-10, Genetic-22

and Genetic-77 models. The third type of model was built on the 49 mammography features and

144

the genetic variants together, namely the combined models. Since we had three sets of genetic

variants with different sizes, we had three combined models, namely Combined-10, Combined-22

and Combined-77 models. In both the genetic models and the combined models, we handled the

genetic variants in the following way rather than using original genotypes of each SNP. We only

introduced one additional variable, the total count of risky alleles the person carries in the DNA.

This way of coding genetic variants was used in several models such as [186], and is helpful to

build risk models when each SNP only has a small contribution to the risk.

We treated the BI-RADS category scores from the radiologists as the predictions from the radi-

ologists, namely the baseline clinical assessment. We constructed ROC curves for each model, and

used the area under the curve (AUC) as a measure of performance. We also provided the precision-

recall (PR) curves for the models. We evaluated the models using 10-fold cross-validation.

8.3 Results

We identified 362 cases and 377 controls. Among the cases, there were 358 Caucasians, three

non-Caucasians and one case whose race information was unknown. Among the controls, there

were 373 Caucasians and four non-Caucasians. We do not disclose the race/ethnicity information

of these non-Caucasians for privacy concerns. Subject characteristics including age distribution

and family history of breast cancer are described in Table 8.1. There were more young people

(age < 50) in the case group than in the control group, and the proportion of elderly people (age

≥ 65) was roughly the same in the case group and in the control group. For the family history of

breast cancer, we observed a considerable larger proportion of people with family history in the

case group (45.3%) than in the control group (33.7%), which demonstrated the family aggregation

of breast cancer.

145

AGE CASES CONTROLS ALL

< 50 81 (22.4%) 58 (15.4%) 139 (18.8%)≥ 50, < 65 123 (34.0%) 168 (44.6%) 291 (39.4%)≥ 65 158 (43.6%) 151(40.0%) 309 (41.8%)

FAMILY HISTORY CASES CONTROLS ALL

YES 164 (45.3%) 127 (33.7%) 291 (39.4%)NO 188 (51.9%) 236 (62.6%) 424 (57.4%)

UNKNOWN 10 (2.8%) 14 (3.7%) 24 (3.2%)

Table 8.1: The distribution of age at mammogram and family breast cancer history in the casesand the controls.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

●●

●●

●●

●

●●●

ROC Curve

Combined−77 (0.760)Combined−22 (0.733)Combined−10 (0.712)Breast Imaging Model (0.693)Baseline Clinical Assessment

0.0 0.2 0.4 0.6 0.8 1.0

0.5

0.6

0.7

0.8

0.9

1.0

Recall

Pre

cisi

on

●●●●

●●

●

●●●

PR Curve

Combined−77 (0.775)Combined−22 (0.754)Combined−10 (0.739)

Breast Imaging Model (0.730)

Baseline Clinical Assessment

Figure 8.2: The ROC curves and PR curves for the baseline clinical assessment, the Breast Imagingmodel the three combined models.

8.3.1 Performance of Combined Models

The ROC and the PR curves for the baseline clinical assessment, the Breast Imaging model and

the three combined models are provided in Figure 8.2. For each model, we vertically average [47]

the ROC curves from the ten replications of the 10-fold cross-validation to obtain the final curve;

we do likewise for the PR curves. The area under the ROC curves for the Breast Imaging model,

146

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

ROC Curve

Genetic−77 (0.684)Genetic−22 (0.622)Genetic−10 (0.591)

0.0 0.2 0.4 0.6 0.8 1.0

0.5

0.6

0.7

0.8

0.9

Recall

Pre

cisi

on

PR Curve

Genetic−77 (0.668)Genetic−22 (0.613)Genetic−10 (0.578)

Figure 8.3: The ROC and PR curves for the three genetic models.

the Combined-10 model, the Combined-22 model and the Combined-77 model are 0.693, 0.712,

0.733 and 0.760. The ROC curve of the Combined-77 model almost completely dominates the

ROC curve of the Breast Imaging model, which suggests that the 77 genetic variants can help to

improve breast cancer diagnosis based on mammographic findings. We perform a two-sided paired

t-test on the area under the ten ROC curves of the Breast Imaging model and the area under the ten

ROC curves of the combined model from the 10-fold cross-validation, and the difference between

them is significant with a P-value 0.00047. We further compare the AUROC of the Combined-

77 model and the Combined-22 model with a two-sided paired t-test, and the difference between

them is significant with a P-value 0.0046, which demonstrates the discriminative power of the

55 recently identified SNPs. From PR curves, we note that the combined models dominate the

Breast Imaging model and the baseline clinical assessment in the high recall region (¿0.8) in

which clinicians operate, and therefore we want to optimize.

147

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

ROC Curve

Combined−77 (0.760)Breast Imaging Model (0.693)Genetic−77 (0.684)

0.0 0.2 0.4 0.6 0.8 1.0

0.5

0.6

0.7

0.8

0.9

1.0

Recall

Pre

cisi

on

PR Curve

Combined−77(0.775)Breast Imaging Model (0.730)Genetic−77 (0.668)

Figure 8.4: The ROC curves and PR curves for the Breast Imaging model, the Genetic-77 modeland the Combined-77 model.

8.3.2 Performance of Genetic Models

Furthermore, we compare the discriminative power of the three genetic models, namely the Genetic-

10 model, the Genetic-22 model and the Genetic-77 model. The ROC curves and the PR curves

for the three genetic models are provided in Figure 8.3, respectively. For each model, we verti-

cally average the curves from the 10-fold cross-valuation to obtain the final curve. The area under

the ROC curves for the Genetic-10 model, the Genetic-22 model and the Genetic-77 model are

0.591, 0.622 and 0.684, which demonstrates that the more associated SNPs the genetic model

includes, the more discriminative the model becomes. We also use a two-sided paired t-test to

compare the area under the ROC curves yielded by the three genetic models. The Genetic-77

model outperforms both the Genetic-22 model (P=0.028) and the Genetic-10 model (P=0.0068).

8.3.3 Comparing Breast Imaging Model and Genetic Model

We compare the performance of the Breast Imaging model, the Genetic-77 model and the Combined-

77 model. The corresponding ROC curves and the PR curves for the three models are shown in

148

Figure 8.4. We observe that the mammography features are more predictive for women with a

high probability of cancer (low FPR region in ROC space) whereas genetic variants are more pre-

dictive for women with a low probability of cancer (mid/high FPR region in ROC space). Note

that the Genetic-77 model describes the patients inherited breast cancer risk in DNA. However,

after the patient starts developing malignant features on mammograms, mammographic findings

(Breast Imaging model) provide superior discrimination. Still, knowing the genetic information

can further improve the accuracy of breast cancer diagnosis even at higher baseline risk.

8.4 Discussion

The primary contribution of this chapter is to show that the genetic variants can significantly

improve breast cancer diagnosis on mammographic findings, resulting in reduced false positives

and alleviated risk of overdiagnosis. This result indicates promise for translating discoveries from

massive collaborative GWAS into clinical breast cancer diagnosis. Our study includes the most up-

to-date breast cancer associated SNPs, the majority identified and/or verified through the massive

COGS (over 55k cases and over 54k controls), and therefore these new SNPs are credible and can

explain a larger proportion of familial breast cancer risk. Indeed, we observe that the Combined-

77 model significantly outperforms the Combined-22 model used in our previous study [105].

We also demonstrate that the Genetic-77 model significantly outperforms the Genetic-22 model.

The increased discriminative power derived from the new 55 SNPs identified by recent published

studies [186] highlights the rapid progress the breast cancer GWAS community has made since

2010. Furthermore, we make a novel discovery that mammography features are more predictive

for high-risk women whereas genetic variants are more predictive for low-risk women, which

explains the benefit of combining genetic variants and mammographic findings for personalized

breast cancer diagnosis.

Our study, in a novel way, differs from the previous study of Wacholder et al. (2010) [186]

which adds ten genetic variants to the Gail model, a risk model based on self-reported demographic

and personal risk factors. The unique contribution in our study is that we include mammography

149

features which represent richer phenotypic data directly relevant to breast cancer diagnosis and

thus provide high signal. Therefore, our study contributes the potential clinical impact of trans-

lating exciting discoveries from GWAS to the patient experience at diagnosis. The additional

discriminative power from these genetic variants can significantly rule out the false positives of

mammogram screening, and therefore has the potential to decrease recommendations for unneces-

sary breast biopsies. Of course, it will be interesting to combine the epidemiology features in Gail

model, the mammography features and the SNPs for more accurate personalized breast cancer

diagnosis.

Limitations of our study include small sample size and the pitfalls of data extraction from text

reports. We understand that parsing mammography features from text reports may introduce noise

into the data. However, despite the challenges inherent in extracting accurate data, which may

affect our results, we are encouraged that improvements in predictive accuracy remain, especially

after observing the discriminative power of genetic factors alone in the genetic models. Further-

more, we recognize that methodological issues in our study may represent shortcomings but also

signify opportunities for future investigation. First, we do not explicitly model how individual

SNPs function to alter breast cancer risk, nor do we model potential SNP interactions [181]. Our

current model only adds one extra feature which simply counts the totally number of risky alleles,

assuming that the effect size of the genetic variants are the same and that the genetic effect of the

genetic variants is non-mechanistic and additive. We do not model the individual SNPs for the

curse of dimensionality concern; each individual SNP only confers a fairly mild relative risk and

if we model them individually, the model will perform poorly on test data unless a larger cohort

of training data is available. Modeling SNP-SNP interaction is even harder and requires more

training data.

Second, we do not differentiate the different subtypes of breast cancers (for example, the

estrogen-receptor status and progesterone-receptor status) in the current study. Breast cancer is

a complex and heterogeneous disease with different subtypes, including two main subtypes of

estrogen receptor (ER) negative tumors (basal-like and human epidermal growth factor receptor-2

150

positive/ER- subtype) and at least two types of ER positive tumors (luminal A and luminal B)

[24, 141]. These molecular subtypes are important predictors of breast cancer mortality [79] and

have different genetic susceptibility [59]. Therefore it is desirable to tease them apart in the pursuit

of increasingly personalized breast cancer care.

Nevertheless, we are encouraged by these promising results in our current study, especially

after the disappointment [70] and caution [96] in the early years of translating GWAS discoveries

to personalized risk prediction. We hope that the rapid progress being made through these massive

collaborative studies together with our growing knowledge about breast cancer mechanisms and

genotype-phenotype relationships will bring us even closer to the practical personalized breast

cancer diagnosis and treatment.

The material in this chapter first appeared in AMIA’2013 and AMIA-TBI’2014 as follows:

Jie Liu, David Page, Houssam Nassif, Jude Shavlik, Peggy Peissig, Catherine McCarty, Ade-

dayo A. Onitilo and Elizabeth Burnside. Genetic Variants Improve Breast Cancer Risk Prediction

on Mammograms. American Medical Informatics Association Symposium (AMIA), 2013.

Jie Liu, David Page, Peggy Peissig, Catherine McCarty, Adedayo A. Onitilo, Amy Trentham-

Dietz and Elizabeth Burnside. New Genetic Variants Improve Personalized Breast Cancer Diag-

nosis. AMIA Summit on Translational Bioinformatics (AMIA-TBI), 2014.

Chapter 9

Future Work

This thesis develops approaches for statistical and probabilistic methods which are designed for

genome-wide genetic variation data (single-nucleotide polymorphisms), and will become insuf-

ficient for the forthcoming next-generation genomics data due to their lack of scalability and in-

ability to handle heterogeneous structured data from multiple sources. Therefore, it is urgent to

extend these methods in synchronization with the rapid development of biotechnologies. There

are three important directions for future research on the next generation genomics data, including

integration, probabilistic modeling and statistical inference.

First, there is an emerging problem related to big data analysis: integration methods for multi-

source, multi-assay next generation genomics data. Nowadays, biotechnology is moving forward

at a speed much faster than the capacity we can process and understand. On one side, the volume

of the generated data keeps increasing, introducing the ever-worsening data-rich/information-poor

dilemma. On the other side, these new technologies bring in new types of genomics data from

new perspectives, making it extremely challenging to analyze the data jointly and coherently. For

example, the next generation sequencing technologies will assay a host of meta-genomic features,

including DNA methylation, nucleosome position, binding of transcription factors to genomic

DNA, histone modifications and the 3D structure of the DNA. On the transcriptomics and pro-

teomics levels, expression data such as RNA-seq data and mass spectrometry data, respectively,

151

152

provide more comprehensive molecular portraits of cells. How to analyze the biological data from

all these different types of data, in isolation and in combination, to better understand basic molec-

ular biology will be extremely important and exciting. One key step is to integrate the information

from individual types of assays, not only producing a representation containing all important infor-

mation from the original individual platforms, but also preserving the inter-platform information

and facilitating downstream analysis. This area is new and extremely important for analyzing next

generation panomic data at the right granularity.

Second, it is desirable to develop machine learning methods, especially probabilistic methods

that are useful for modeling heterogeneous, hierarchical and dynamic information within struc-

tured data. Graphical models are powerful and elegant tools modeling joint probabilities by com-

pactly embedding the structured dependencies among random variables, and have been one of the

most popular areas in machine learning in the last 20 years. However, there are a few new chal-

lenges in learning graphical models from genomics, transcriptomics and proteomics data (even

after the data have been integrated), such as (1) capturing heterogeneity in the data from different

cell types, different types of assays, and different species; (2) adaptively recovering the hierar-

chical structure embedded with genomics, transcriptomics and proteomics data; (3) modeling dy-

namic information such as time series of cellular process. In order to deal with heterogeneity and

hierarchical structures, Bayesian techniques and nonparametric approaches can be used to strike

a balance between richness and simplicity of the models, while maintaining the interpretability

of the resulting graphical models. For capturing the dynamic information in the data, we can use

time series models, continuous time models and their nonparametric variations.

Last but not least, it is desirable to continue with the work in this thesis on large-scale sta-

tistical inference (a.k.a. multiple testing), especially the unsolved challenges such as dependence

among the hypotheses and massive scale hypotheses testing. Multiple testing has emerged as one

of the most active research areas in statistics over the last 15 years, currently contributing about

8% of the articles in the leading methodological statistics journals [10]. We still need multiple

testing in the era of big data. In practice, hypothesis testing remains the standard way for biol-

153

ogists and geneticists to report their scientific discoveries, and multiple testing (such as control

of the false discovery rate) is still one important tool to quantify the credibility of the discoveries

from genomics and proteomics data. However, more and more genomics and proteomics problems

involve moderate or strong dependencies among the hypotheses, due to the correlation structure

in the biological data. This dependence has been ignored or under-utilized historically. Therefore,

multiple testing under dependence is and will continue to be the most important direction in mul-

tiple testing research, and improving the power of testing via leveraging dependencies will be very

important to make discoveries from next generation genomics and proteomics data. In addition,

emerging genomics and proteomics data also present new types of heterogeneity and latent struc-

ture. Furthermore, the area of large-scale inference still contains many open questions of great

theoretical interest, such as optimality and consistency of multiple testing procedures.

All the three proposed directions will prominently enhance the linkage between genomics and

proteomics on the one hand and the field of big data analytics on the other. With the advent of

the next generation meta-genomic and meta-proteomic data, we will no longer be able to ana-

lyze the raw data directly without a further step of integration. The intrinsic heterogeneity and

ever-growing complex structures within data make probabilistic modeling and statistical inference

increasingly more difficult. These methodology changes and analytic challenges are likely to hap-

pen in other big data analytic problems such as social networks and business enterprise networks.

I hope that the advances achieved in the future work can further motivate and stimulate the related

research topics in the general big data analytic areas.

Bibliography

[1] American College of Radiology and American College of Radiology. BI-RADS Committee. Breast imaging

reporting and data system. American College of Radiology, 1998.

[2] Antonis C Antoniou, Xianshu Wang, Zachary S Fredericksen, Lesley McGuffog, Robert Tarrell, Olga M Sinil-

nikova, Sue Healey, Jonathan Morrison, Christiana Kartsonaki, Timothy Lesnick, et al. A locus on 19p13

modifies risk of breast cancer in brca1 mutation carriers and is associated with hormone receptor-negative breast

cancer in the general population. Nature genetics, 42(10):885–892, 2010.

[3] Peter Armitage. Tests for linear trends in proportions and frequencies. Biometrics, 11:375C386, 1955.

[4] Arthur U. Asuncion, Qiang Liu, Alexander T. Ihler, and Padhraic Smyth. Particle filtered MCMC-MLE with

connections to contrastive divergence. In ICML, 2010.

[5] Arthur U. Asuncion, Qiang Liu, Alexander T. Ihler, and Padhraic Smyth. Particle filtered MCMC-MLE with

connections to contrastive divergence. In ICML, 2010.

[6] Jay A Baker, Phyllis J Kornguth, Joseph Y Lo, Margaret E Williford, and Carey E Floyd. Breast cancer:

prediction with artificial neural network based on bi-rads standardized lexicon. Radiology, 196(3):817–822,

1995.

[7] Onureena Banerjee, Laurent El Ghaoui, and Alexandre d’Aspremont. Model selection through sparse maximum

likelihood estimation for multivariate gaussian or binary data. J. Mach. Learn. Res., 9:485–516, 2008.

[8] Leonard E. Baum, Ted Petrie, George Soules, and Norman Weiss. A maximization technique occurring in the

statistical analysis of probabilistic functions of Markov chains. ANN MATH STAT, 41(1):164–171, 1970.

[9] Matthew J. Beal, Zoubin Ghahramani, and Carl E. Rasmussen. The infinite hidden Markov model. In NIPS,

2002.

154

155

[10] Yoav Benjamini. Simultaneous and selective inference: current successes and future challenges. Biometrical

Journal, 52(6):708–721, 2010.

[11] Yoav Benjamini and Ruth Heller. False discovery rates for spatial signals. Journal of the American Statistical

Association, 102:1272–1281, 2007.

[12] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach

to multiple testing. Journal of The Royal Statistical Society Series B-Statistical Methodology, 57(1):289–300,

1995.

[13] Yoav Benjamini and Yosef Hochberg. Multiple hypotheses testing with weights. Scandinavian Journal of

Statistics, 24:407–418, 1997.

[14] Yoav Benjamini and Yosef Hochberg. On the adaptive control of the false discovery rate in multiple testing with

independent statistics. Journal of Educational and Behavioral Statistics, 25(1):60–83, 2000.

[15] Yoav Benjamini, Abba M. Krieger, and Daniel Yekutieli. Adaptive linear step-up procedures that control the

false discovery rate. Biometrika, 93:491–507, 2006.

[16] Yoav Benjamini and Daniel Yekutieli. The control of the false discovery rate in multiple testing under depen-

dency. Annals of Statistics, 29:1165–1188, 2001.

[17] Berit Bernert, Helena Porsch, and Paraskevi Heldin. Hyaluronan synthase 2 (HAS2) promotes breast cancer

cell invasion by suppression of tissue metalloproteinase inhibitor 1 (TIMP-1). J BIOL CHEM, 286(49):42349–

42359, 2011.

[18] Julian Besag. Statistical analysis of non-lattice data. JRSS-D, 24(3):179–195, 1975.

[19] Gilles Blanchard and Etienne Roquain. Adaptive false discovery rate control under independence and depen-

dence. J MACH LEARN RES, 10:2837–2871, December 2009.

[20] D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet allocation. JMLR, 3:993–1022, 2003.

[21] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy minimization via graph cuts. IEEE

Trans. Pattern Anal. Mach. Intell., 23(11):1222–1239, November 2001.

[22] Elizabeth S. Burnside, Jesse Davis, Jagpreet Chhatwal, Oguzhan Alagoz, Mary J. Lindstrom, Berta M. Geller,

Benjamin Littenberg, Katherine A. Shaffer, Charles E. Kahn, and C. David Page. A probabilistic computer

model developed from clinical data in the national mammography database format to classify mammographic

findings. Radiology, 251:663–672, 2009.

156

[23] Emmanuel Candes and Terence Tao. Rejoinder: the Dantzig selector: statistical estimation when p is much

larger than n , 2007.

[24] Lisa A Carey, Charles M Perou, Chad A Livasy, Lynn G Dressler, David Cowan, Kathleen Conway, Gamze

Karaca, Melissa A Troester, Chiu Kit Tse, Sharon Edmiston, et al. Race, breast cancer subtypes, and survival in

the carolina breast cancer study. Jama, 295(21):2492–2502, 2006.

[25] Miguel A. Carreira-Perpinan and Geoffrey E. Hinton. On contrastive divergence learning. In AISTATS, 2005.

[26] Gilles Celeux, Florence Forbes, and Nathalie Peyrard. EM procedures using mean field-like approximations for

Markov model-based image segmentation. Pattern Recognition, 36:131–144, 2003.

[27] J. M. Chapman, J. D. Cooper, J. A. Todd, and D. G. Clayton. Detecting disease associations due to linkage

disequilibrium using haplotype tags: A class of tests and the determinants of statistical power. Human Heredity,

56:18–31, 2003.

[28] Sotirio P. Chatzis and Theodora A. Varvarigou. A fuzzy clustering approach toward hidden Markov random field

models for enhanced spatially constrained image segmentation. IEEE Transactions on Fuzzy Systems, 16:1351

– 1361, 2008.

[29] Sotirios P. Chatzis and Gabriel Tsechpenakis. The infinite hidden Markov random field model. In ICCV, 2009.

[30] James M Cheverud. A simple correction for multiple comparisons in interval mapping genome scans. Heredity,

87:52–58, 2001.

[31] William G. Cochran. Some methods for strengthening the common chi-square tests. Biometrics, 10:417–451,

1954.

[32] M. S. Crouse, R. D. Nowak, and R. G. Baraniuk. Wavelet-based statistical signal processing using hidden

Markov models. IEEE T SIGNAL PROCES, 46(4):886–902, April 1998.

[33] Douglas F. Easton, Karen A. Pooley, Alison M. Dunning, Paul D. P. Pharoah, Deborah Thompson, Dennis G.

Ballinger, Jeffery P. Struewing, Jonathan Morrison, Helen Field, Robert Luben, Nicholas Wareham, Shahana

Ahmed, Catherine S. Healey, Richard Bowman, Kerstin B. Meyer, Christopher A. Haiman, Laurence K. Kolonel,

Brian E. Henderson, Loic Le Marchand, Paul Brennan, Suleeporn Sangrajrang, Valerie Gaborieau, Fabrice Ode-

frey, Chen-Yang Shen, Pei-Ei Wu, Hui-Chun Wang, Diana Eccles, Gareth D. Evans, Julian Peto, Olivia Fletcher,

Nichola Johnson, Sheila Seal, Michael R. Stratton, Nazneen Rahman, Georgia Chenevix-Trench, Stig E. Bo-

jesen, Børge G. Nordestgaard, Christen K. Axelsson, Montserrat Garcia-Closas, Louise Brinton, Stephen

157

Chanock, Jolanta Lissowska, Beata Peplonska, Heli Nevanlinna, Rainer Fagerholm, Hannaleena Eerola, Dae-

hee Kang, Keun-Young Yoo, Dong-Young Noh, Sei-Hyun Ahn, David J. Hunter, Susan E. Hankinson, David G.

Cox, Per Hall, Sara Wedren, Jianjun Liu, Yen-Ling Low, Natalia Bogdanova, Peter Schurmann, Thilo Dork, Rob

A. E. M. Tollenaar, Catharina E. Jacobi, Peter Devilee, Jan G. M. Klijn, Alice J. Sigurdson, Michele M. Doody,

Bruce H. Alexander, Jinghui Zhang, Angela Cox, Ian W. Brock, Gordon Macpherson, Malcolm W. R. Reed,

Fergus J. Couch, Ellen L. Goode, Janet E. Olson, Hanne Meijers-Heijboer, Ans van den Ouweland, Andre Uit-

terlinden, Fernando Rivadeneira, Roger L. Milne, Gloria Ribas, Anna Gonzalez-Neira, Javier Benitez, John L.

Hopper, Margaret Mccredie, Melissa Southey, Graham G. Giles, Chris Schroen, Christina Justenhoven, Hiltrud

Brauch, Ute Hamann, Yon-Dschun Ko, Amanda B. Spurdle, Jonathan Beesley, Xiaoqing Chen, Arto Manner-

maa, Veli-Matti Kosma, Vesa Kataja, Jaana Hartikainen, Nicholas E. Day, David R. Cox, and Bruce A. J. Ponder.

Genome-wide association study identifies novel breast cancer susceptibility loci. Nature, 447:1087–1093, May

2007.

[34] Bradley Efron. Correlation and large-scale simultaneous significance testing. J AM STAT ASSOC, 102(477):93–

103, March 2007.

[35] Bradley Efron. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cam-

bridge University Press, 2010.

[36] Bradley Efron, Trevor Hastie, Lain Johnstone, and Robert Tibshirani. Least angle regression. Annals of Statistics,

32:407–499, 2004.

[37] Bradley Efron and Robert Tibshirani. Empirical bayes methods and false discovery rates for microarrays. Ge-

netic Epidemiology, 23(1):70–86, 2002.

[38] Bradley Efron, Robert Tibshirani, John D. Storey, and Virginia Tusher. Empirical bayes analysis of a microarray

experiment. Journal of The American Statistical Association, 96:1151–1160, 2001.

[39] V. A. Epanechnikov. Non-parametric estimation of a multivariate probability density. THEOR PROBAB APPL,

14(1):153–158, 1969.

[40] Eleazar Eskin. Increasing power in association studies by using linkage disequilibrium structure and molecular

function as prior information. Genome Research, 18(4):653–660, 2008.

[41] Jianqing Fan. Design-adaptive nonparametric regression. Journal of the American Statistical Association,

87(420):998–1004, 1992.

[42] Jianqing Fan, Xu Han, and Weijie Gu. Control of the false discovery rate under arbitrary covariance dependence.

(to appear) J AM STAT ASSOC, 2012.

158

[43] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties.

Journal of the American Statistical Association, 96(456):1348–1360, 2001.

[44] Jianqing Fan, Richard Samworth, and Yichao Wu. Ultrahigh dimensional feature selection: Beyond the linear

model. Journal of Machine Learning Research, 10:2013–2038, 2009.

[45] Ruzong Fan and Michael Knapp. Genome association studies of complex diseases by case-control designs.

American Journal of Human Genetics, 72:850–868, 2003.

[46] Alessio Farcomeni. Some results on the control of the false discovery rate under dependence. SCAND J STAT,

34(2):275–297, June 2007.

[47] Tom Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8):861–874, 2006.

[48] Thomas S. Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2):209–

230, 1973.

[49] H. Finner and M. Roters. Multiple hypotheses testing and expected number of type I errors. ANN STAT, 30:220–

238, 2002.

[50] Olivia Fletcher, Nichola Johnson, Nick Orr, Fay J Hosking, Lorna J Gibson, Kate Walker, Diana Zelenika,

Ivo Gut, Simon Heath, Claire Palles, et al. Novel breast cancer susceptibility locus at 9q31. 2: results of a

genome-wide association study. Journal of the National Cancer Institute, 103(5):425–435, 2011.

[51] L. R. Ford and D. R. Fulkerson. Constructing maximal dynamic flows from static flows. Operations Research,

6(3):419–433, 1958.

[52] B. Freidlin, G. Zheng, Z. Li, and J. L. Gastwirth. Trend tests for case-control studies of genetic markers: power,

sample size and robustness. Human Heredity, 53(3):146–152, 2002.

[53] Chloe Friguet, Maela Kloareg, and David Causeur. A factor model approach to multiple testing under depen-

dence. J AM STAT ASSOC, 104(488):1406–1415, December 2009.

[54] Jurgen Van Gael, Yunus Saatci, Yee Whye Teh, and Zoubin Ghahramani. Beam sampling for the infinite hidden

Markov model. In ICML, 2008.

[55] M. H. Gail. Discriminatory accuracy from single-nucleotide polymorphisms in models to predict breast cancer

risk. J Natl Cancer Inst, 100(14):1037–1041, 2008.

159

[56] M. H. Gail. Value of adding single-nucleotide polymorphism genotypes to a breast cancer risk model. J Natl

Cancer Inst, 101(13):959–963, 2009.

[57] Mitchell H Gail, Louise A Brinton, David P Byar, Donald K Corle, Sylvan B Green, Catherine Schairer, and

John J Mulvihill. Projecting individualized probabilities of developing breast cancer for white females who are

being examined annually. J Natl Cancer Inst, 81(24):1879–1886, 1989.

[58] Varun Ganapathi, David Vickrey, John Duchi, and Daphne Koller. Constrained approximate maximum entropy

learning of Markov random fields. In UAI, 2008.

[59] Montserrat Garcia-Closas, Fergus J Couch, Sara Lindstrom, Kyriaki Michailidou, Marjanka K Schmidt, Mark N

Brook, Nick Orr, Suhn Kyong Rhie, Elio Riboli, Heather S Feigelson, et al. Genome-wide association studies

identify four er negative-specific breast cancer risk loci. Nature genetics, 45(4):392–398, 2013.

[60] Alan E. Gelfand and Adrian F. M. Smith. Sampling-based approaches to calculating marginal densities. Journal

of the American Statistical Association, 85(410):398–409, 1990.

[61] Andrew Gelman and Jennifer Hill. Data analysis using regression and multilevel/hierarchical models. Cam-

bridge University Press, New York, 2007.

[62] Christopher Genovese and Larry Wasserman. Operating characteristics and extensions of the false discovery

rate procedure. Journal of The Royal Statistical Society Series B-Statistical Methodology, 64:499–517, 2002.

[63] Christopher Genovese and Larry Wasserman. A stochastic process approach to false discovery control. Annals

of Statistics, 32:1035–1061, 2004.

[64] R. Genovese, Kathryn Roeder, and Larry Wasserman. False discovery control with p-value weighting.

Biometrika, 93:509–524, 2006.

[65] Samuel J. Gershman and David M. Blei. A tutorial on Bayesian nonparametric models. Journal of Mathematical

Psychology, 56(1):1–12, 2012.

[66] Charles J. Geyer. Markov chain Monte Carlo maximum likelihood. COMP SCI STAT, pages 156–163, 1991.

[67] Zoubin Ghahramani and Michael I. Jordan. Factorial Hidden markov models. Machine Learning, 29:245–273,

1997.

[68] Maya Ghoussaini, Olivia Fletcher, Kyriaki Michailidou, Clare Turnbull, Marjanka K Schmidt, Ed Dicks, Joe

Dennis, Qin Wang, Manjeet K Humphreys, Craig Luccarini, et al. Genome-wide association analysis identifies

three new breast cancer susceptibility loci. Nature genetics, 44(3):312–318, 2012.

160

[69] A. V. Goldberg and R. E. Tarjan. A new approach to the maximum flow problem. In Proceedings of the 18th

ACM Symposium on Theory of Computing, pages 136–146, 1986.

[70] DB Goldstein. Common genetic variation and human traits. The New England journal of medicine,

360(17):1696, 2009.

[71] Michael Gutmann and Jun-ichiro Hirayama. Bregman divergence as general framework to estimate unnormal-

ized statistical models. In UAI, pages 283–290, Corvallis, Oregon, 2011. AUAI Press.

[72] Michael Gutmann and Aapo Hyvarinen. Noise-contrastive estimation: A new estimation principle for unnor-

malized statistical models. In AISTATS, 2010.

[73] Isabelle Guyon and Andre Elisseeff. An introduction to variable and feature selection. Journal of Machine

Learning Research, 3:1157–1182, 2003.

[74] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene selection for cancer classification

using support vector machines. MACH LEARN, 46(1-3):389–422, 2002.

[75] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. The weka

data mining software: an update. SIGKDD Explor. Newsl., 11(1):10–18, 2009.

[76] Buhm Han and Eleazar Eskin. Multiple testing in genetic epidemiology. Encyclopedia of Life Sciences, 2010.

[77] Lauren A. Hannah, David M. Blei, and Warren B. Powell. Dirichlet process mixtures of generalized linear

models. Journal of Machine Learning Research, 12:1923–1953, 2011.

[78] Chris Hans. Bayesian lasso regression. BIOMETRIKA, 96(4):835–845, December 2009.

[79] Reina Haque, Syed A Ahmed, Galina Inzhakova, Jiaxiao Shi, Chantal Avila, Jonathan Polikoff, Leslie Bernstein,

Shelley M Enger, and Michael F Press. Impact of breast cancer subtypes and treatment on survival: an analysis

spanning two decades. Cancer Epidemiology Biomarkers & Prevention, 21(10):1848–1855, 2012.

[80] Geoffrey Hinton. Training products of experts by minimizing contrastive divergence. NEURAL COMPUT,

14:1771–1800, 2002.

[81] Daniel Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm for learning hidden Markov models. In

COLT, 2009.

[82] David J. Hunter, Peter Kraft, Kevin B. Jacobs, David G. Cox, Meredith Yeager, Susan E. Hankinson, Sholom

Wacholder, Zhaoming Wang, Robert Welch, Amy Hutchinson, Junwen Wang, Kai Yu, Nilanjan Chatterjee,

161

Nick Orr, Walter C. Willett, Graham A. Colditz, Regina G. Ziegler, Christine D. Berg, Saundra S. Buys, Cather-

ine A. Mccarty, Heather S. Feigelson, Eugenia E. Calle, Michael J. Thun, Richard B. Hayes, Margaret Tucker,

Daniela S. Gerhard, Joseph F. Fraumeni, Robert N. Hoover, Gilles Thomas, and Stephen J. Chanock. A genome-

wide association study identifies alleles in fgfr2 associated with risk of sporadic postmenopausal breast cancer.

Nature Genetics, 39(7):870–874, 2007.

[83] Aapo Hyvarinen. Connections between score matching, contrastive divergence, and pseudolikelihood for

continuous-valued variables. IEEE T NEURAL NETWOR, 18(5):1529–1531, 2007.

[84] Aapo Hyvarinen. Some extensions of score matching. COMPUT STAT DATA AN, 51(5):2499–2512, 2007.

[85] Aapo Hyvarinen. Some extensions of score matching. Computational Statistics & Data Analysis, 51(5):2499–

2512, 2007.

[86] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group lasso with overlap and graph lasso. In

Proceedings of the International Conference on Machine Learning, 2009.

[87] Rodolphe Jenatton, Jean-Yves Audibert, and Francis Bach. Structured variable selection with sparsity-inducing

norms. Technical report, 2009.

[88] Finn V. Jensen. An Introduction to Bayesian Networks. UCL Press, London, 1996.

[89] Michael I. Jordan, Zoubin Ghahramani, Tommi Jaakkola, and Lawrence K. Saul. An introduction to variational

methods for graphical models. Machine Learning, 37:183–233, 1999.

[90] Junhwan Kim and Ramin Zabih. Factorial Markov random fields. In ECCV, pages 321–334, 2002.

[91] Ross Kindermann and J. Laurie Snell. Markov Random Fields and Their Applications (Contemporary Mathe-

matics ; V. 1). Amer Mathematical Society, 1980.

[92] Kenji Kira and Larry A. Rendell. A practical approach to feature selection. In Proceedings of the ninth interna-

tional workshop on Machine learning, pages 249–256, 1992.

[93] Ron Kohavi and George H. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1-2):273–324,

1997.

[94] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.

[95] V. Kolmogorov and R. Zabin. What energy functions can be minimized via graph cuts? IEEE Trans. Pattern

Anal. Mach. Intell., 26(2):147–159, 2004.

162

[96] Peter Kraft and David J Hunter. Genetic risk prediction: are we there yet? The New England journal of medicine,

360(17):1701–1703, 2009.

[97] F.R. Kschischang, B.J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE T INFORM

THEORY, 47(2):498 –519, feb 2001.

[98] Xiangyang Lan, Stefan Roth, Daniel Huttenlocher, and Michael J. Black. Efficient belief propagation with

learned higher-order markov random fields. In ECCV, pages 269–282, 2006.

[99] Steffen L. Lauritzen and David J. Spiegelhalter. Local computations with probabilities on graphical structures

and their application to expert systems. Journal of The Royal Statistical Society Series B-Statistical Methodology,

50(2):157–224, 1988.

[100] Jeffrey T. Leek and John D. Storey. A general framework for multiple testing dependence. P NATL ACAD SCI

USA, 105(48):18718–18723, 2008.

[101] Yuejuan Li, Lingli Li, Tracey J. Brown, and Paraskevi Heldin. Silencing of hyaluronan synthase 2 suppresses

the malignant phenotype of invasive breast cancer cells. INT J CANCER, 120(12):2557–2567, 2007.

[102] Paul Lichtenstein, Niels V. Holm, Pia K. Verkasalo, Anastasia Iliadou, Jaakko Kaprio, Markku Koskenvuo, Eero

Pukkala, Axel Skytthe, and Kari Hemminki. Environmental and heritable factors in the causation of cancer–

analyses of cohorts of twins from sweden, denmark, and finland. New England Journal of Medicine, 343:78–85,

2000.

[103] D. Y. Lin. An efficient monte carlo approach to assessing statistical significance in genomic studies. Bioinfor-

matics, 21:781–787, 2005.

[104] Han Liu, Jian Zhang, Xiaoye Jiang, and Jun Liu. The group Dantzig selector. In AISTATS, 2010.

[105] Jie Liu, David Page, Houssam Nassif, Jude Shavlik, Peggy Peissig, Catherine McCarty, Adedayo A Onitilo, and

Elizabeth Burnside. Genetic variants improve breast cancer risk prediction on mammograms. In AMIA Summit

on Translational Bioinformatics (AMIA-TBI), 2014.

[106] Jie Liu, David Page, Peggy Peissig, Catherine McCarty, Adedayo A Onitilo, Amy Trentham-Dietz, and Elizabeth

Burnside. New genetic variants improve personalized breast cancer diagnosis. In AMIA Summit on Translational

Bioinformatics (AMIA-TBI), 2014.

[107] Jie Liu, Chunming Zhang, Catherine McCarty, Peggy Peissig, Elizabeth Burnside, and David Page. Graphical-

model based multiple testing under dependence, with applications to genome-wide association studies. In UAI,

2012.

163

[108] Jie Liu, Chunming Zhang, Catherine McCarty, Peggy Peissig, Elizabeth Burnside, and David Page. High-

dimensional structured feature screening using binary Markov random fields. In AISTATS, 2012.

[109] Hans-Andrea Loeliger. An introduction to factor graphs. IEEE Signal Processing Magazine, 21:28–41, 2004.

[110] A. Lorbert, D. Eis, V. Kostina, D. Blei, and P. Ramadge. Exploiting covariate similarity in sparse regression via

the pairwise elastic net. In AISTATS, 2010.

[111] Daniel Lowd and Pedro Domingos. Naive bayes models for probability estimation. In Proceedings of the 22nd

international conference on machine learning, pages 529–536, 2005.

[112] Siwei Lyu. Unifying non-maximum likelihood learning objectives with minimum KL contraction. In NIPS,

pages 64–72, 2011.

[113] Teri A Manolio, Francis S Collins, Nancy J Cox, David B Goldstein, Lucia A Hindorff, David J Hunter, Mark I

McCarthy, Erin M Ramos, Lon R Cardon, Aravinda Chakravarti, et al. Finding the missing heritability of

complex diseases. Nature, 461(7265):747–753, 2009.

[114] J. C. Marioni, N. P. Thorne, and S. Tavare. BioHMM: a heterogeneous hidden Markov model for segmenting

array CGH data. Bioinformatics, 22:1144–1146, 2006.

[115] CA McCarty, RA Wilke, PF Giampietro, SD Wesbrook, and MD Caldwell. Marshfield Clinic Personalized

Medicine Research Project (PMRP): design, methods and recruitment for a large population-based biobank.

Personalized Med, 2:49–79, 2005.

[116] Catherine A McCarty, Rex L Chisholm, Christopher G Chute, Iftikhar J Kullo, Gail P Jarvik, Eric B Larson,

Rongling Li, Daniel R Masys, Marylyn D Ritchie, Dan M Roden, et al. The emerge network: a consortium

of biorepositories linked to electronic medical records data for conducting genomic studies. BMC medical

genomics, 4(1):13, 2011.

[117] Marina Meila. Comparing clusterings by the variation of information. In COLT, pages 173–187, 2003.

[118] Xiao-Li Meng and Wing Hung Wong. Simulating ratios of normalizing constants via a simple identity: a

theoretical exploration. Statistica Sinica, 6(4):831–860, 1996.

[119] Patrick Emmanuel Meyer, Colas Schretter, and Gianluca Bontempi. Information-theoretic feature selection in

microarray data using variable complementarity. Selected Topics in Signal Processing, IEEE Journal of, 2:261–

274, 2008.

164

[120] Kyriaki Michailidou, Per Hall, Anna Gonzalez-Neira, Maya Ghoussaini, Joe Dennis, Roger L Milne, Mar-

janka K Schmidt, Jenny Chang-Claude, Stig E Bojesen, Manjeet K Bolla, et al. Large-scale genotyping identifies

41 new loci associated with breast cancer risk. Nature genetics, 45(4):353–361, 2013.

[121] Sujit Kumar Mitra. On the limiting power function of the frequency chi-square test. Ann. Math. Statist., 29:1221–

1233, 1958.

[122] J. Møller, A.N. Pettitt, R. Reeves, and K.K. Berthelsen. An efficient Markov chain Monte Carlo method for

distributions with intractable normalising constants. Biometrika, 93(2):451–458, 2006.

[123] J. Møller, A.N. Pettitt, R. Reeves, and K.K. Berthelsen. An efficient Markov chain Monte Carlo method for

distributions with intractable normalising constants. Biometrika, 93(2):451–458, 2006.

[124] Valentina Moskvina and Karl Michael Schmidt. On multiple-testing correction in genome-wide association

studies. Genetic Epidemiology, 32(6):567–573, 2008.

[125] I. Mukhopadhyay, E. Feingold, D. E. Weeks, and A. Thalamuthu. Association tests using kernel-based measures

of multi-locus genotype similarity between individuals. Genetic Epidemiology, 34(3):213–221, April 2010.

[126] I. Mukhopadhyay, E. Feingold, D. E. Weeks, and A. Thalamuthu. Association tests using kernel-based measures

of multi-locus genotype similarity between individuals. GENET EPIDEMIO, 34(3):213–221, April 2010.

[127] Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. Loopy belief propagation for approximate inference: An

empirical study. In In Proceedings of Uncertainty in AI, pages 467–475, 1999.

[128] Iain Murray, Zoubin Ghahramani, and David J. C. MacKay. MCMC for doubly-intractable distributions. In UAI,

2006.

[129] E. Nadaraya. On estimating regression. Theory of Probability and Its Applications, 9(1):141–142, 1964.

[130] H Nassif, R Wood, E S Burnside, M Ayvaci, J Shavlik, and D Page. Information extraction for clinical data

mining: a mammography case study. In IEEE International Conference on Data Mining (ICDM’09) Workshops,

pages 37–42, Miami, Florida, 2009.

[131] Radford M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computa-

tional and Graphical Statistics, 9(2):249–265, 2000.

[132] Heidi D Nelson, Kari Tyne, Arpana Naik, Christina Bougatsos, Benjamin K Chan, and Linda Humphrey. Screen-

ing for breast cancer: an update for the us preventive services task force. Ann Intern Med, 151:727–737, 2009.

165

[133] Roland Nilsson, Jose M. Pena, Johan Bjorkegren, and Jesper Tegner. Consistent feature selection for pattern

recognition in polynomial time. Journal of Machine Learning Research, 8:589–612, 2007.

[134] Dale R. Nyholt. A simple correction for multiple testing for single-nucleotide polymorphisms in linkage dise-

quilibrium with each other. American Journal of Human Genetics, 74(4):765–769, 2004.

[135] P. Orbanz and Y. W. Teh. Bayesian nonparametric models. In Encyclopedia of Machine Learning. Springer,

2010.

[136] Art B. Owen. Variance of the number of false discoveries. J ROY STAT SOC B, 67:411–426, 2005.

[137] Konstantina Palla, David A. Knowles, and Zoubin Ghahramani. An infinite latent attribute model for network

data. In ICML, 2012.

[138] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann

Publishers Inc., San Francisco, CA, USA, 1988.

[139] Hanchuan Peng, Fuhui Long, and Chris Ding. Feature selection based on mutual information: criteria of max-

dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelli-

gence, 27(8):1226–1238, 2005.

[140] B Percha, H Nassif, J Lipson, E Burnside, and D Rubin. Automatic classification of mammography reports by

BI-RADS breast tissue composition class. J. Am. Med. Inform. Assn., 19(5):913–916, 2012.

[141] Charles M Perou, Therese Sørlie, Michael B Eisen, Matt van de Rijn, Stefanie S Jeffrey, Christian A Rees,

Jonathan R Pollack, Douglas T Ross, Hilde Johnsen, Lars A Akslen, et al. Molecular portraits of human breast

tumours. Nature, 406(6797):747–752, 2000.

[142] Paul D. P. Pharoah, Antonis C. Antoniou, Douglas F. Easton, and Bruce A. J. Ponder. Polygenes, risk prediction,

and targeted prevention of breast cancer. New England Journal of Medicine, 358(26):2796–2803, 2008.

[143] James Gary Propp and David Bruce Wilson. Exact sampling with coupled Markov chains and applications to

statistical mechanics. Random structures and Algorithms, 9(1-2):223–252, 1996.

[144] Yuan Qi and Feng Yan. EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning.

In NIPS, 2011.

[145] Carl Edward Rasmussen. The infinite Gaussian mixture model. In NIPS, 2000.

[146] Carl Edward Rasmussen and Zoubin Ghahramani. Infinite mixtures of Gaussian process experts. In NIPS, 2001.

166

[147] Anat Reiner, Daniel Yekutieli, and Yoav Benjamini. Identifying differentially expressed genes using false dis-

covery rate controlling procedures. Bioinformatics, 19(3):368–375, 2003.

[148] Herbert Robbins. An empirical Bayes approach to statistics. In The 3rd Berkeley Symposium I, pages 157–163,

1956.

[149] K Roeder, SA Bacanu, V Sonpar, X Zhang, and B Devlin. Analysis of single-locus tests to detect gene/disease

associations. Genetic Epidemiology, 28(3):207–219, 2005.

[150] Joseph Romano, Azeem Shaikh, and Michael Wolf. Control of the false discovery rate under dependence using

the bootstrap and subsampling. TEST, 17:417–442, 2008.

[151] Lorenzo Rosasco, Matteo Santoro, Sofia Mosci, Alessandro Verri, and Silvia Villa. A regularization approach

to nonlinear variable selection. In AISTATS, 2010.

[152] Murray Rosenblatt. Remarks on some nonparametric estimates of a density function. ANN MATH STAT,

27(3):832–837, 1956.

[153] Ruslan Salakhutdinov. Learning in Markov random fields using tempered transitions. In NIPS, pages 1598–

1606, 2009.

[154] Daria Salyakina, Shaun R Seaman, Brian L Browning, Frank Dudbridge, and Bertram Muller-Myhsok. Evalua-

tion of nyholts procedure for multiple testing correction. Human heredity, 60(1):19–25, 2005.

[155] Sanat K. Sarkar. False discovery and false nondiscovery rates in single-step multiple testing procedures. ANN

STAT, 34(1):394–415, 2006.

[156] Pal Satrom, Jacob Biesinger, Sierra M Li, David Smith, Laurent F Thomas, Karim Majzoub, Guillermo E

Rivas, Jessica Alluin, John J Rossi, Theodore G Krontiris, Jeffrey Weitzel, Mary B Daly, Al B Benson, John M

Kirkwood, Peter J ODwyer, Rebecca Sutphen, James A Stewart, David Johnson, and Garrett P Larson. A risk

variant in an mir-125b binding site in bmpr1b is associated with breast cancer pathogenesis. CANCER RES,

69(18):7459–7465, 2009.

[157] Daniel J. Schaid, Shannon K. McDonnell, Scott J. Hebbring, Julie M. Cunningham, and Stephen N. Thibodeau.

Nonparametric tests of association of multiple genes with human disease. American Journal of Human Genetics,

76(5):780–793, May 2005.

[158] John T Schousboe, Karla Kerlikowske, Andrew Loh, and Steven R Cummings. Personalizing mammography

by breast density and other risk factors for breast cancer: analysis of health benefits and cost-effectiveness. Ann

Intern Med, 155:10–20, 2011.

167

[159] Nicol N. Schraudolph. Polynomial-time exact inference in NP-hard binary MRFs via reweighted perfect match-

ing. In AISTATS, 2010.

[160] Nicol N. Schraudolph and Dmitry Kamenetsky. Efficient exact inference in planar Ising models. In NIPS, 2009.

[161] Babak Shahbaba and Radford Neal. Nonlinear models using Dirichlet process mixtures. Journal of Machine


[162] Afshan Siddiq, Fergus J Couch, Gary K Chen, Sara Lindstrom, Diana Eccles, Robert C Millikan, Kyriaki

Michailidou, Daniel O Stram, Lars Beckmann, Suhn Kyong Rhie, et al. A meta-analysis of genome-wide

association studies of breast cancer identifies two novel susceptibility loci at 6q14 and 20q11. Human molecular

genetics, 21(24):5373–5384, 2012.

[163] S. L. Slager and D. J. Schaid. Case-control studies of genetic markers: power and sample size approximations

for armitage’s test for trend. Human Heredity, 52(3):149–153, 2001.

[164] L. Song, B. Boots, S. Siddiqi, G. Gordon, and A. Smola. Hilbert space embeddings of hidden Markov models.

In ICML, 2010.

[165] Kristen N Stevens, Zachary Fredericksen, Celine M Vachon, Xianshu Wang, Sara Margolin, Annika Lindblom,

Heli Nevanlinna, Dario Greco, Kristiina Aittomaki, Carl Blomqvist, et al. 19p13. 1 is a triple-negative–specific

breast cancer susceptibility locus. Cancer research, 72(7):1795–1803, 2012.

[166] John D. Storey. A direct approach to false discovery rates. Journal of The Royal Statistical Society Series

B-Statistical Methodology, 64:479–498, 2002.

[167] John D. Storey. The positive false discovery rate: A bayesian interpretation and the q-value. Annals of Statistics,

31(6):2013–2035, 2003.

[168] John D Storey, Jonathan E Taylor, and David Siegmund. Strong control, conservative point estimation and

simultaneous conservative consistency of false discovery rates: a unified approach. JRSS-B, 66(1):187–205,

2004.

[169] Korbinian Strimmer. A unified approach to false discovery rate estimation. BMC bioinformatics, 9(1):303, 2008.

[170] Zhan Su, Jonathan Marchini, and Peter Donnelly. HAPGEN2: simulation of multiple disease SNPs. BIOIN-

FORMATICS, 2011.

[171] Wenguang Sun and T. Tony Cai. Oracle and adaptive compound decision rules for false discovery rate control.

J AM STAT ASSOC, 102(479):901–912, 2007.

168

[172] Wenguang Sun and T. Tony Cai. Large-scale multiple testing under dependence. Journal of The Royal Statistical

Society Series B-Statistical Methodology, 71:393–424, 2009.

[173] I. Sutskever and T. Tieleman. On the convergence properties of Contrastive Divergence. In AISTATS, 2010.

[174] Ulrika Svenson, Katarina Nordfjall, Birgitta Stegmayr, Jonas Manjer, Peter Nilsson, Bjorn Tavelin, Roger Hen-

riksson, Per Lenner, and Goran Roos. Breast cancer survival is associated with telomere length in peripheral

blood cells. CANCER RES, 68(10):3618–3623, 2008.

[175] The International HapMap Consortium. The international hapmap project. Nature, 426:789–796, 2003.

[176] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of The Royal Statistical Society

Series B-Statistical Methodology, 58(1):267–288, 1996.

[177] Robert Tibshirani and Michael Saunders. Sparsity and smoothness via the fused lasso. Journal of The Royal

Statistical Society Series B-Statistical Methodology, 67:91–108, 2005.

[178] Tijmen Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In

ICML, pages 1064–1071, 2008.

[179] Tijmen Tieleman and Geoffrey Hinton. Using fast weights to improve persistent contrastive divergence. In

ICML, pages 1033–1040, 2009.

[180] Clare Turnbull, Shahana Ahmed, Jonathan Morrison, David Pernet, Anthony Renwick, Mel Maranian, Sheila

Seal, Maya Ghoussaini, Sarah Hines, Catherine S Healey, et al. Genome-wide association study identifies five

new breast cancer susceptibility loci. Nature genetics, 42(6):504–507, 2010.

[181] Clare Turnbull, Sheila Seal, Anthony Renwick, Margaret Warren-Perry, Deborah Hughes, Anna Elliott, David

Pernet, Susan Peock, Julian W Adlard, Julian Barwell, et al. Gene–gene interactions in breast cancer suscepti-

bility. Human molecular genetics, 21(4):958–962, 2012.

[182] Lishanthi Udabage, Gary R. Brownlee, Susan K. Nilsson, and Tracey J. Brown. The over-expression of HAS2,

Hyal-2 and CD44 is implicated in the invasiveness of breast cancer. EXP CELL RES, 310(1):205 – 217, 2005.

[183] Mark J. van der Laan, Sandrine Dudoit, and Katherine S. Pollard. Augmentation procedures for control of

the generalized family-wise error rate and tail probabilities for the proportion of false positives. Statistical

Applications in Genetics and Molecular Biology, 3, 2004.

[184] David Vickrey, C Lin, and Daphne Koller. Non-local contrastive objectives. In ICML, 2010.

169

[185] Matthieu Vignes and Florence Forbes. Gene clustering via integrated markov models combining individual and

pairwise features. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 6:260–270, April 2009.

[186] S. Wacholder, P. Hartge, R. Prentice, M. Garcia-Closas, H. S. Feigelson, W. R. Diver, M. J. Thun, D. G. Cox,

S. E. Hankinson, P. Kraft, B. Rosner, C. D. Berg, L. A. Brinton, J. Lissowska, M. E. Sherman, R. Chlebowski,

C. Kooperberg, R. D. Jackson, D. W. Buckman, P. Hui, R. Pfeiffer, K. B. Jacobs, G. D. Thomas, R. N. Hoover,

M. H. Gail, S. J. Chanock, and D. J. Hunter. Performance of common genetic variants in breast-cancer risk

models. N Engl J Med, 362(11):986–993, 2010.

[187] Sholom Wacholder, Stephen Chanock, Montserrat Garcia-Closas, Laure El ghormli, and Nathaniel Rothman.

Assessing the probability that a positive report is false: an approach for molecular epidemiology studies. Journal

of the National Cancer Institute, 96(6):434–442, March 2004.

[188] Martin J. Wainwright, Tommi S. Jaakkola, and Alan S. Willsky. A new class of upper bounds on the log partition

function. In UAI, pages 536–543, 2002.

[189] Martin J. Wainwright, Tommi S. Jaakkola, and Alan S. Willsky. Tree-based reparameterization framework for

analysis of sum-product and related algorithms. IEEE Transactions on Information Theory, 49:2003, 2003.

[190] Martin J. Wainwright, Tommi S. Jaakkola, and Alan S. Willsky. Tree-reweighted belief propagation algorithms

and approximate ML estimation via pseudo-moment matching. In AISTATS, 2003.

[191] Martin J. Wainwright and Michael I. Jordan. Log-determinant relaxation for approximate inference in discrete

Markov random fields. IEEE T SIGNAL PROCES, 54(6):2099–2109, 2006.

[192] Martin J Wainwright and Michael I Jordan. Graphical Models, Exponential Families, and Variational Inference.

Now Publishers Inc., Hanover, MA, USA, 2008.

[193] Helen Warren, Frank Dudbridge, Olivia Fletcher, Nick Orr, Nichola Johnson, John L Hopper, Carmel Apicella,

Melissa C Southey, Maryam Mahmoodi, Marjanka K Schmidt, et al. 9q31. 2-rs865686 as a susceptibility locus

for estrogen receptor-positive breast cancer: evidence from the breast cancer association consortium. Cancer

Epidemiology Biomarkers & Prevention, 21(10):1783–1791, 2012.

[194] Larry Wasserman and Kathryn Roeder. High-dimensional variable selection. Annals of Statistics, 37(5):2178–

2201, 2009.

[195] Geoffrey S. Watson. Smooth regression analysis. The Indian Journal of Statistics, Series A, 26(4):359–372,

1964.

170

[196] Zhi Wei, Kai Wang, Hui-Qi Qu, Haitao Zhang, Jonathan Bradfield, Cecilia Kim, Edward Frackleton, Cuiping

Hou, Joseph T. Glessner, Rosetta Chiavacci, Charles Stanley, Dimitri Monos, Struan F. A. Grant, Constantin

Polychronakos, and Hakon Hakonarson. From disease association to risk assessment: An optimistic view from

genome-wide association studies on type 1 diabetes. PLoS Genetics, 5:e1000678, 2009.

[197] Yair Weiss. Correctness of local probability propagation in graphical models with loops. Neural Computation,

12(1):1–41, January 2000.

[198] Max Welling and Charles Sutton. Learning in Markov random fields with contrastive free energies. In AISTATS,

2005.

[199] Jennifer Wessel and Nicholas J. Schork. Generalized genomic distance-based regression methodology for mul-

tilocus association analysis. American journal of human genetics, 79(5):792–806, November 2006.

[200] Fa-Yueh Wu. The potts model. Reviews of Modern Physics, 54:235–268, 1982.

[201] Michael C. Wu, Peter Kraft, Michael P. Epstein, Deanne M. Taylor, Stephen J. Chanock, David J. Hunter, and

Xihong Lin. Powerful snp-set analysis for case-control genome-wide association studies. American Journal of

Human Genetics, 86(6):929–942, June 2010.

[202] Michael C. Wu, Seunggeun Lee, Tianxi Cai, Yun Li, Michael Boehnke, and Xihong Lin. Rare variant association

testing for sequencing data using the sequence kernel association test (skat). American Journal of Human

Genetics, 2011.

[203] Tong Tong Wu, Yi Fang Chen, Trevor Hastie, Eric M. Sobel, and Kenneth Lange. Genome-wide association

analysis by lasso penalized logistic regression. Bioinformatics, 25(6):714–721, 2009.

[204] Wei Biao Wu. On false discovery control under dependence. ANN STAT, 36(1):364–380, 2008.

[205] GuiCBo Ye, Yifei Chen, and Xiaohui Xie. Efficient variable selection in support vector machines via the alter-

nating direction method of multipliers. In AISTATS, 2011.

[206] Jonathan S. Yedidia, William T. Freeman, and Yair Weiss. Generalized belief propagation. In NIPS, pages

689–695. MIT Press, 2000.

[207] Daniel Yekutieli and Yoav Benjamini. Resampling-based false discovery rate controlling multiple test proce-

dures for correlated test statistics. J STAT PLAN INFER, 82:171–196, 1999.

[208] Laurent Younes. Estimation and annealing for Gibbsian fields. Annales de l’Institut Henri Poincare, Section B,

Calcul des Probabilities et Statistique, 24(2):269–294, 1988.

171

[209] Lei Yu and Huan Liu. Efficient feature selection via analysis of relevance and redundancy. Journal of Machine


[210] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of The

Royal Statistical Society Series B-Statistical Methodology, 68:49–67, 2006.

[211] Alan Yuille. The convergence of contrastive divergences. In NIPS, 2004.

[212] Chunming Zhang, Jianqing Fan, and Tao Yu. Multiple testing via FDRL for large-scale imaging data. ANN

STAT, 39(1):613–642, 2011.

[213] Hao Helen Zhang, Jeongyoun Ahn, and Xiaodong Lin. Gene selection using support vector machines with

nonconvex penalty. BIOINFORMATICS, 22(1):88–95, 2006.

[214] Yongyue Zhang, Michael Brady, and Stephen Smith. Segmentation of brain MR images through a hidden

Markov random field model and the expectation-maximization algorithm. IEEE Transactions on Medical Imag-

ing, 2001.

[215] Yang Zhou, Rong Jin, and Steven Hoi. Exclusive lasso for multi-task feature selection. In AISTATS, 2010.

[216] Jun Zhu, Ning Chen, and Eric P. Xing. Infinite latent SVM for classification and multi-task learning. In NIPS,

2011.

[217] Jun Zhu, Ning Chen, and Eric P. Xing. Infinite SVM: a Dirichlet process mixture of large-margin kernel ma-

chines. In ICML, 2011.

[218] Song Chun Zhu and Xiuwen Liu. Learning in gibbsian fields: How accurate and how fast can it be? IEEE

Transactions on Pattern Analysis and Machine Intelligence, 24:1001–1006, 2002.

[219] Hui Zou and Trevor Hastie. Regularization and variable selection via the Elastic Net. Journal of The Royal

Statistical Society Series B-Statistical Methodology, 67:301–320, 2005.

[220] Hui Zou and Hao Helen Zhang. On the adaptive elastic-net with a diverging number of parameters. ANN STAT,

37:1733, 2009.

[221] Verena Zuber and Korbinian Strimmer. Gene ranking and biomarker discovery under correlation. Bioinformatics,

25(20):2700–2707, 2009.

[222] Verena Zuber and Korbinian Strimmer. High-dimensional regression and variable selection using CAR scores.

Statistical Applications in Genetics and Molecular Biology, 10(1), 2011.

Date post:	14-Feb-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Statistical Methods for Genome-wide Association Studies and Personalized...

Documents