+ All Categories
Home > Documents > Information Bottleneck for Pathway-Centric Gene Expression Analysis

Information Bottleneck for Pathway-Centric Gene Expression Analysis

Date post: 10-Nov-2023
Category:
Upload: unibasel
View: 0 times
Download: 0 times
Share this document with a friend
11
Information Bottleneck for Pathway-Centric Gene Expression Analysis David Adametz, M´ elanie Rey, and Volker Roth Department of Mathematics and Computer Science University of Basel, Switzerland {david.adametz,melanie.rey,volker.roth}@unibas.ch Abstract. While DNA microarrays enable us to conveniently measure expression profiles in the scope of thousands of genes, the subsequent association studies typically suffer from a tremendous imbalance between number of variables (genes) and observations (subjects). Even more so, each gene is heavily perturbed by noise which prevents any meaningful analysis on the single-gene level [6]. Hence, the focus shifted to pathways as groups of functionally related genes [4], in the hope that aggregation potentiates the underlying signal. Technically, this leads to a problem of feature extraction which was previously tackled by principal component analysis [5]. We reformulate the task using an extension of the Meta- Gaussian Information Bottleneck method as a means to compress a gene set while preserving information about a relevance variable. This opens up new possibilities, enabling us to make use of clinical side information in order to uncover hidden characteristics in the data. 1 Introduction Gene expression analysis is concerned with finding individual genes that are connected to a given clinical trait or disease pattern. Typically, however, this task is inherently difficult since (i) a single gene is highly prone to noise and (ii) the number of genes heavily outweighs the number of subjects. Due to these circumstances it is possible to observe genes which perfectly explain an outcome, but cannot be reproduced independently, see Ein-Dor et al. [6]. Therefore, in an attempt to alleviate said shortcomings, the focus was turned to pathways [4], which are formally defined as high-level systemic functions in biological processes. These can be categorized into five groups: metabolism, ge- netic and environmental information processing, cellular processes, organisimal systems and human diseases. For our purposes, a pathway is simply a set of genes given by biological prior knowledge. The underlying idea is that genes will exhibit detectable patterns only when grouped with functionally related entities. As a result, the conceptual pipeline in gene expression analysis is now extended by an intermediate step to identify common features within such a group. Our focus will be solely on this process of data fusion and feature extraction. A recognized solution was introduced by Drier et al. [5] and implemented in the software package Pathifier. Here, principal component analysis (PCA) is used
Transcript

Information Bottleneck forPathway-Centric Gene Expression Analysis

David Adametz, Melanie Rey, and Volker Roth

Department of Mathematics and Computer ScienceUniversity of Basel, Switzerland

{david.adametz,melanie.rey,volker.roth}@unibas.ch

Abstract. While DNA microarrays enable us to conveniently measureexpression profiles in the scope of thousands of genes, the subsequentassociation studies typically suffer from a tremendous imbalance betweennumber of variables (genes) and observations (subjects). Even more so,each gene is heavily perturbed by noise which prevents any meaningfulanalysis on the single-gene level [6]. Hence, the focus shifted to pathwaysas groups of functionally related genes [4], in the hope that aggregationpotentiates the underlying signal. Technically, this leads to a problem offeature extraction which was previously tackled by principal componentanalysis [5]. We reformulate the task using an extension of the Meta-Gaussian Information Bottleneck method as a means to compress a geneset while preserving information about a relevance variable. This opensup new possibilities, enabling us to make use of clinical side informationin order to uncover hidden characteristics in the data.

1 Introduction

Gene expression analysis is concerned with finding individual genes that areconnected to a given clinical trait or disease pattern. Typically, however, thistask is inherently difficult since (i) a single gene is highly prone to noise and(ii) the number of genes heavily outweighs the number of subjects. Due to thesecircumstances it is possible to observe genes which perfectly explain an outcome,but cannot be reproduced independently, see Ein-Dor et al. [6].

Therefore, in an attempt to alleviate said shortcomings, the focus was turnedto pathways [4], which are formally defined as high-level systemic functions inbiological processes. These can be categorized into five groups: metabolism, ge-netic and environmental information processing, cellular processes, organisimalsystems and human diseases. For our purposes, a pathway is simply a set ofgenes given by biological prior knowledge. The underlying idea is that genes willexhibit detectable patterns only when grouped with functionally related entities.As a result, the conceptual pipeline in gene expression analysis is now extendedby an intermediate step to identify common features within such a group. Ourfocus will be solely on this process of data fusion and feature extraction.

A recognized solution was introduced by Drier et al. [5] and implemented inthe software package Pathifier. Here, principal component analysis (PCA) is used

2 D. Adametz, M. Rey, V. Roth

to project all genes of a pathway onto the directions of largest variance. Whilepreserving variance is one possible choice to find a compact representation, itcan very well be contradictory to what is of interest in a biological sense. Hence,the question is: What is a substitute for variance and how do we specify suchan interest? In that regard, note that we often have access to rich clinical sideinformation, although it is typically not used at this stage of an analysis. Thisis a prime example for the Information Bottleneck, where a variable X (e.g. agene) is compressed while maintaining information about a second variable Y (aclinical feature, e.g. tumor stage). Hence, it gives rise to the notion of relevancein the data.

For a better understanding of the problem, Figure 1 depicts the structure andcomposition of our data. Hereby, we think of each gene as a random variable Xj

where every subject is an independent realization. Further, we combine theseinto the p-dimensional random vector X with observations stored in rows ofmatrix ZX . The same holds for q clinical variables in random vector Y and thecorresponding data matrix ZY .

ZX =

gene 1 . . . gene p

subject 1 −0.25 . . . 0.72subject 2 −0.02 . . . −1.49subject 3 0.82 . . . 1.23

......

. . ....

subject n −0.17 . . . 1.89

, ZY =

age sextumorstage . . . trait q

57 M A . . . 1.2138 M C . . . 0.6763 F ? . . . ?...

......

. . ....

48 M D . . . ?

Fig. 1. Data scheme considered throughout the paper. Matrix ZX contains the ex-pression values of p genes on a pathway across n patients/subjects. All its entries arecontinuous. In addition, matrix ZY stores the corresponding clinical information over qdifferent traits/features. Typically, this includes discrete, categorical and/or continuousvariables. Question marks denote missing values.

Applying the Information Bottleneck to pathway-centric analysis entails anumber of challenges. As can be seen in Figure 1, gene expression values arecontinuous and in general follow unknown distributions. Clinical data, however,can exhibit various forms: continuous (viral load), discrete (age), categorical(tumor stage) or binary (sex). Additionally, parts of information could be missingdue to practical or ethical reasons. In the following, we introduce the InformationBottleneck and an extension that is capable of handling the above data scheme.

2 Information Bottleneck (IB)

Consider the random vectors X and Y . The framework of the Information Bot-tleneck [17] is concerned with finding a compression T of X while maintaininginformation about Y . This leads to an optimization problem

minp(t|x)

L | L ≡ I(X;T )− β I(T ;Y ), (1)

Information Bottleneck for Pathway-Centric Gene Expression Analysis 3

where L denotes the likelihood, I(·) is the mutual information and β > 0 deter-mines the tradeoff between best compression of X and being most informativeabout Y . For the general problem, there exists no closed-form solution.

2.1 Gaussian Information Bottleneck (GIB)

An interesting special case arises, when X and Y are jointly distributed Gaussianrandom vectors of dimension p and q respectively (see Chechik et al. [3]):[

XY

]∼ N (0p+q, Σ) with Σ =

[Σx ΣxyΣyx Σy

]. (2)

It can be shown that the compression is also Gaussian [8]

T = AX + ξ, (3)

given by projection matrix A ∈ Rp×p and noise component ξ ∼ N (0p, Σξ).Therefore, rewriting Eq. (1) using Gaussian random vectors leads to the problemof finding the optimal pair (A,Σξ). For Σξ ≡ Ip1, we have

A =[α1v

t1 ; α2v

t2 ; . . . ; αpv

tp

], (4)

where vi and λi ≤ 1 are the left eigenvectors and -values of p× p matrix

B = Σx|yΣ−1x = Ip −ΣxyΣ−1y ΣyxΣ

−1x (5)

in the order of λ1 < . . . < λp. The scaling of each row in A is given by

αi =

{√β(1−λi)−1λivt

iΣxviif β > (1− λi)−1

0 else.(6)

In the limit of β ≡ 0, we have αi = 0 ∀i and thus A = 0p×p. As β increases,a growing number of αi becomes non-zero and A is filled row-wise from top tobottom. For a sufficiently large β > (1− λp)−1, all vi are present, meaning T ismost informative about Y .

This concludes the results for Gaussian random vectors. Speaking in termsof gene expression data, we seek the joint covariance between genes and clinicalfeatures. Since the assumption of normality is too restrictive, the next sectionexplores a solution.

2.2 Meta-Gaussian Information Bottleneck (MGIB)

The Gaussian formulation of the IB leads to an analytic solution, however, itsscope is severely limited in light of our present setting. In general, gene expression

1 (A,Σξ) and (A∗, Ip) are equivalent under L with a linear transformation between Aand A∗.

4 D. Adametz, M. Rey, V. Roth

data—while being continuous—does not necessarily obey normality. Even morecritically, clinical information is far from being Gaussian. To that effect, Rey andRoth [13] greatly relaxed the problem as shown in the following.

Using Sklar’s theorem [15], every multivariate distribution can be expressedin terms of its univariate margins and their dependence structure. This sepa-ration allows us to analyze both properties independently from one another. Ingeneral, if random vector (X,Y ) follows joint distribution F , it decomposes as

F (x1, . . . , xp, y1, . . . , yq) = C (F1(x1), . . . , Fp(xp), Fp+1(y1), . . . , Fp+q(yq)) , (7)

where Fj(·) =: uj ∈ [0, 1] is a univariate margin and C(·) is the copula — amapping [0, 1]p+q 7→ [0, 1]. In the Gaussian case, both margins Φ(·) and copulaCP (·) with density cP (·) are Gaussian. Since the Gaussian copula only dependson correlation matrix P , we explicitly denote the subscript P . This is a crucialobservation regarding I(X) = −H(cPx

), which means the GIB problem can beexpressed solely using copula densities. Consequently, as the margins Φ(·) are notinvolved anymore, they can be replaced by arbitrary distributions F (xi) withoutchanging the optimal solution. Hereby, we arrive at the family of so-called Meta-Gaussian distributions, which all have the Gaussian copula in common, butallow arbitrary margins. Analogous to GIB, the optimal projection A is foundby the eigenvectors and -values of B = Ip − PxyP−1y PyxP

−1x .

The goal now is to estimate the joint correlation matrix P of X and Y fromtheir n instances stored in data matrix ZX and ZY . Typically, though, copulamodels are studied strictly in the context of continuous margins for which thereexist robust estimators. Regarding pathway analysis, this approach will fail fordiscrete or categorical variables, hence, these data types (and mixed data inparticular) must be treated differently.

3 Extensions for Pathway Analysis

3.1 Mixed Data and Missing Values

Empirically estimating the margins of gene expression values is a convenient wayto derive P , however, such a scheme is only applicable for continuous variables,since it does not consider ties (i.e. non-unique values). We follow the approachof Hoff [9], which avoids to specify the margins altogether. Here, it is assumedthat each discrete variable Yj is a function of a hidden Gaussian variable Yj . Incontrast to a continuous variable, the mapping Yj 7→ Yj is surjective and thusleads to ties among the realizations of Yj . We can use the fact that regardless ofthe specifics of this mapping (i.e. the margin), (Yj , Yj) are always linked via theordering of their instances. In other words, yj1 > yj2 implies yj1 > yj2. Hence,the model is (

X1, . . . , Xp, Y1, . . . , Yq)∼ N (0p+q, P ) (8)

with Xm = F−1m

(Φ(Xm)

)and Yj = F−1j

(Φ(Yj)

). (9)

Information Bottleneck for Pathway-Centric Gene Expression Analysis 5

Note that due to surjectivity, the converse Yj = Φ−1 (Fj(Yj)) is not unique andtherefore violates the requirements of a copula.

In order to make inference about P , Hoff [9] proposed a Gibbs samplingscheme, which can be used to obtain an estimate by averaging. More specifi-cally, data matrix Z of latent variables (X, Y ) follows truncated Gaussians (dueto the constraint of ordering) combined with a conjugate Wishart prior for P . Formissing values Zij , the truncation is replaced by a standard Gaussian withoutbounds, thereby imputing the latent values. Algorithm 1 shows a computation-ally more efficient variant of the original sampler that avoids matrix inversion inthe innermost loop. Here, subscript −j is a shorthand for set {1, . . . , (p+q)} \ j.

Algorithm 1 Sampling correlation matrix P and data matrix Z

Input: n× (p+ q) data matrix Z = [ZX , ZY ]Set B0 ← Ip+q, B ← Ip+q, ν ← (p+ q) + 1

Initialize Z ={Z•j ← Φ−1

(1

#levels+1ranks(Z•j)

) ∣∣∣ j ∈ {1, . . . , (p+ q)}}

for k = 1 . . . Nsamples dofor variable j = 1 . . . (p+ q) do

Set σ ← 1/Bjjfor level r in Z•j do

Find lower bound: a← max{Z•j |Z•j < r}Find upper bound: b← min{Z•j |Z•j > r}for every i ∈ {1, . . . , n} where Zij = r do

Set µij ← −Zi,−jB−j,j/σSample from a truncated Gaussian: Zij ∼ T N (µij , σ

2, a, b)end for

end forend forSample B ∼ W

(ν + n,

[B0 + ZtZ

]−1)

Compute P ={Pij ← (B−1)ij/

√(B−1)ii(B−1)jj

∣∣∣ (i, j) ∈ {1, . . . , (p+ q)}}

Compute Znorm ={

(Znorm)•j ← Z•j/√

(B−1)jj

∣∣∣ j ∈ {1, . . . , (p+ q)}}

end for

With regard to the MGIB compression, we are interested in data matrix Zof latent variables, which is only a by-product in [9]. This is due to the problemthat categorical values (e.g. high, medium, low) cannot be used in conjunctionwith projection A. Therefore, we estimate the hidden variables Z by averagingover samples, analogous to the scheme for P . As the individual Z samples arenot bound by scale, however, we use an additional normalizing step (Znorm) asshown in the algorithm.

Although the sampling scheme was initially motivated by discrete or cate-gorical margins, it also handles continuous variables, and, hence, the setup issuitable for mixed data in general.

6 D. Adametz, M. Rey, V. Roth

3.2 Relation to PCA

Interestingly, (M)GIB is related to principal component analysis (PCA), whichis a projection that preserves variance in the data. If data matrix ZX has zero-mean columns, its PC transformation is given by

ZPCA = V tZX with Σx = V ΛV t. (10)

Here, V and Λ refer to eigenvectors and -values of Σx. In (M)GIB, we defineY to be a noisy representation of X, i.e. Y = X + ε with ε ∼ N (0p, Ip). Thismeans, we have

Σy = Σx + Ip (11)

Σxy = Σyx = Σx (12)

leading to

B = Ip −ΣxyΣ−1y ΣyxΣ−1x (13)

= Ip −Σx(Σx + Ip)−1 (14)

= Ip − V ΛV t(V ΛV t + Ip)−1 (15)

= V (Ip − Λ [Λ+ Ip]−1

)V t (16)

= V ΛV t. (17)

In this form, it is easy to see that Λ and V correspond to the eigendecompositionof B, therefore the directions coincide with PCA. For a sufficiently large β, Acontains all eigenvectors vi. Note, however, that the compression T will differfrom the PC projection in terms of scales αi (as a function of β).

3.3 Irrelevance Variables

Aside from the IB concept to compress a random variable with respect to arelevance entity, one might also be interested in the converse, meaning a com-pression that is explicitly devoid of some information. In a practical setting, thismay include sex and/or age, which often have an undesirable effect on the data.

Conventionally, one would perform multiple regression and control for thevariables in question, where categorical inputs require some form of dummycoding. However, as we deal with multiple variables of possibly many levelseach, this approach quickly becomes complicated and cumbersome.

In contrast, MGIB offers an alternative which naturally arises from the IBframework. We refer to these features as irrelevance variables Y = (Y 1, . . . , Y k).In a first step, we compress X with regards to Y and hereby obtain an optimalprojection A. This specifically captures all unwanted information about X, thuswe compute the nullspace of A:

Q = Null(A) ∈ Rp×p ⇔ QtA = 0p×p. (18)

As a result, compression T = QX cancels all information about Y . Notice thedifference between excluding a variable from the relevance set Y and explicitlytreating it as nuisance Y .

Information Bottleneck for Pathway-Centric Gene Expression Analysis 7

4 Experiments

MGIB is a general tool for data fusion and feature extraction with the flexibilityto specify which aspect is of interest to the user. Therefore, it can either beused as preprocessor to reduce dimensionality or as a self-contained module inan existing pipeline.

For the purpose of visualizing, we integrate the method into the previouslymentioned Pathifier software package, thereby replacing PCA. The conventionalworkflow is as follows: Pathifier takes all gene expression values associated witha given pathway and projects them onto their first few principal components.This produces a point cloud, where each point represents a subject living in areduced set of dimensions. The shape of the cloud carries meaningful informationand is captured by the so-called principal curve, which is the same as finding theunderlying ’skeleton’. When projecting all points onto this path, we can measurethe distance between two points (= subjects) along the curve. Pathifier assumestwo groups of subjects—healthy and diseased—, where the average of healthysubjects serves as a reference point. The distance of a diseased subject to thisreference gives the deregulation score, the final output of Pathifier. Finally, thesubject with maximum distance defines a score of 1.

We use the publicly available colon cancer dataset of Sheffer et al. [14], whichconsists of p = 13 437 genes measured for n = 313 subjects. The latter can beclassified into 53 healthy (H), 184 primary tumor (T), 46 polyp (P) and 30metastasis (M) patients. A separate clinical data table contains the propertiesage, sex and TNM tumor staging (T = size ∈ {1, 2, 3, 4}, N = spread to regionallymph nodes ∈ {0, 1, 2, 3}, M = presence of distant metastasis ∈ {0, 1}). Sincehealthy subjects are not affected by cancer, their disease-related values are zero.

As for pathways, we use KEGG2 gene sets obtained from the curated Molec-ular Signature Database (MSigDB3), version 4.0. In general, these sets containeverything from 5 genes up to 200, thus, for reasons of tractability regarding 313subjects, we limit the maximum to 60 genes. If a pathway exceeds this number,we cluster the pairwise mutual information using Ward’s method and cut theresulting tree on the level of 60 clusters. For each cluster, the gene with highestentropy is selected as a representative.

As we aim for a pathway compression T being most informative about clinicaldata Y , β is always set to a small value that attains the maximum number ofeigenvectors in A. Further increasing β would only alter the scaling, which is notof interest to us.

4.1 Comparison to Classic Pathifier using PCA

In order to highlight the flexibility of MGIB, we first attempt to recreate the PCprojection. This is achieved when Y is a noisy representation of X, i.e. we aimto compress X while maintaining information about itself. The scheme is:

2 http://www.genome.jp/kegg/3 http://www.broadinstitute.org/gsea/msigdb/

8 D. Adametz, M. Rey, V. Roth

1. Find all genes corresponding to a pathway, store their expression values inZX and a noisy copy in ZY .

2. Compress X with regards to Y in order to receive ZT = ZXAt.

3. Run Pathifier on ZT to obtain the deregulation score for each subject.

We choose a set of 3 pathways that are directly related to (colon) cancer and runPathifier on both conventional PC projection and MGIB compression. As can beseen in Figure 2, this leads to fairly similar deregulation scores, the differencesbeing mainly due to scaling αj .

Deregulation score from PCA ... and MGIB, Y = X + noise

P53_SIGNALING_PATHWAY

MISMATCH_REPAIR

CELL_CYCLE

00.20.40.60.81

CLASSH PM T

Fig. 2. Pathifier using PCA (left) and MGIB compression T (right). Each row is apathway and the columns correspond to subjects (same ordering left and right). Subjectclasses are healthy (H), metastasis (M), polyp (P) and primary tumor (T).

4.2 Analysis with Relevance and Irrelevance Variables

The following experiment shows how to apply MGIB in a practical context touncover specific characteristics in the data. In particular, the compression issupposed to highlight aspects related to the TNM tumor stage while discardingage and sex of a subject. The required steps are:

1. Find all genes corresponding to a pathway, store expression values in ZX .2. Compress X while removing Y = (age, sex), calculate ZT1

= ZXQt1.

3. Compress T1 with regards to Y = (T, N, M) and receive ZT2= ZT1

At2.4. Run Pathifier on ZT2 to obtain the deregulation score for each subject.

For comparison purposes, we follow this scheme with and without irrelevancevariables. The outcome of both experiments can be interpreted best when sort-ing the subjects according to age, sex and class. Hence, we report the samederegulation scores 3 times, but only show pathways which are known to berelated to the ordered variable.

Irrelevance Variable Age Figure 3 depicts age-related pathways, which areinvolved in inflammatory response [10], immune system [2, 11] and developmentof blood vessels (angiogenesis) [1]. The effect ranges from cancelling to amplifyingscores of certain age groups.

Irrelevance Variable Sex Sorting the subjects according to sex results inthe score patterns of Figure 4. Pathways known to be affected by sex cover thecategories colorectal cancer [12, 16], immune system and cellular processes [7].As can be seen, subjects are sometimes normalized to a common level (cell cycle)and sometimes put into contrast (chemokine signaling).

Information Bottleneck for Pathway-Centric Gene Expression Analysis 9

Deregulation score from MGIB, rel = (T, N, M) ... and MGIB, irrel = (age, sex), rel = (T, N, M)

VEGF_SIGNALING_PATHWAYFC_GAMMA_R_MEDIATED_PHAGOCYTOSIS

ERBB_SIGNALING_PATHWAYCHEMOKINE_SIGNALING_PATHWAY

00.20.40.60.81

CLASS

AGEH PM T

20 90

Fig. 3. Impact of age on the deregulation scores: with irrelevance variables (right) andwithout (left). Subjects (= columns) are sorted by age.

Deregulation score from MGIB, rel = (T, N, M) ... and MGIB, irrel = (age, sex), rel = (T, N, M)

CHEMOKINE_SIGNALING_PATHWAYANTIGEN_PROCESSING_AND_PRESENTATION

CELL_CYCLECOLORECTAL_CANCER

00.20.40.60.81

CLASS

SEXH PM T

M F

Fig. 4. Impact of sex on the deregulation scores: with irrelevance variables (right) andwithout (left). Columns are ordered by sex.

Relevance Variables In a practical analysis, one would evaluate the scoresas in Figure 5. Here, we clearly see two distinct subgroups of primary tumorpatients (T), possibly related to stage M = 0. Also, patients of class metastasis(M) appear to be similar to those of primary tumor (T) with stage M = 1.

ERBB_SIGNALING_PATHWAY

PATHWAYS_IN_CANCER

MTOR_SIGNALING_PATHWAY

VEGF_SIGNALING_PATHWAY

CHEMOKINE_SIGNALING_PATHWAY

TOLL_LIKE_RECEPTOR_SIGNALING_PATHWAY

NOD_LIKE_RECEPTOR_SIGNALING_PATHWAY

CYTOSOLIC_DNA_SENSING_PATHWAY

GAP_JUNCTION

COLORECTAL_CANCER

PROSTATE_CANCER

0

0.2

0.4

0.6

0.8

1

CLASS

T

N

M

H PM T

0 2 41 3

0 1 2 3

0 1

Deregulation score from MGIB: irrel = (age, sex), rel = (T, N, M)

Fig. 5. Deregulation scores for cancer-related pathways, sorted by patient class. TheMGIB compression used both relevance and irrelevance variables.

Pathway Map The MGIB compression can also be analyzed in terms of theprojection weights to identify the importance of single genes. Since we used twocompression steps (irrelevance and relevance variables), the combined projectionof T1 = Q1X and T2 = A2T1 is T2 = A2Q1X. Therefore, the contribution of gene

10 D. Adametz, M. Rey, V. Roth

j is found by∑pi=1 |(A2Q1)ij |, where the absolute value prevents weights from

cancelling each other. Figure 6 illustrates the map of the mTOR pathway, wherethe impact of genes is overlayed by color. This pathway is linked to Rapamycin[18], a key protein which triggers a cascade of processes related to (uncontrolled)cell growth.

Fig. 6. Map of the mTOR pathway. Each squared box represents a gene and is coloredaccording to the sum of absolute compression weights. Genes with white color werenot measured by the microarray.

5 Conclusion

We presented an extension to the Meta-Gaussian Information Bottleneck as ageneral way to compress a random variable in the presence of a second relevancevariable. This allows us to guide the compression and formalize which aspect isof actual interest to the user. Applied to the pathway-centric analysis of geneexpression data, we can finally make practical use of the clinical side informationand treat all features in a unified way regardless of their data type. Also, theexclusive dependence on the Gaussian copula does not impose any restrictionson the marginal distributions, which makes the setup widely applicable, evenwhere data are missing.

Using software package Pathifier, the experiments demonstrate that the ap-proach can easily be incorporated into a given workflow. Not only is it possibleto mimic principal component analysis as a special case, but the method can alsoconveniently remove nuisance variables in an information-theoretic fashion. Insummary, we believe this tool greatly enhances gene expression analysis, whileat the same time being suitable for a wide class of biological applications.

Information Bottleneck for Pathway-Centric Gene Expression Analysis 11

References

1. Baffert, F., Thurston, G., Rochon-Duck, M., Le, T., Brekken, R., McDonald, D.M.:Age-Related Changes in Vascular Endothelial Growth Factor Dependency andAngiopoietin-1-Induced Plasticity of Adult Blood Vessels. Circulation Research94(984), 984–992 (2004)

2. Castle, S.C.: Clinical Relevance of Age-Related Immune Dysfunction. Clinical In-fectious Diseases 31(2), 578–585 (2000)

3. Chechik, G., Globerson, A., Tishby, N., Weiss, Y.: Information Bottleneck for Gaus-sian Variables. Journal of Machine Learning Research 6, 165–188 (2005)

4. Curtis, R.K., Oresic, M., Vidal-Puig, A.: Pathways to the Analysis of MicroarrayData. Trends in Biotechnology 23(8), 429–435 (2005)

5. Drier, Y., Sheffer, M., Domany, E.: Pathway-Based Personalized Analysis of Can-cer. In: Proceedings of the National Academy of Sciences. pp. 6388–6393 (2013)

6. Ein-Dor, L., Zuk, O., Domany, E.: Thousands of Samples are Needed to Generatea Robust Gene List for Predicting Outcome in Cancer. In: Proceedings of theNational Academy of Sciences. pp. 5923–5928 (2006)

7. Elsaleh, H., Joseph, D., Grieu, F., Zeps, N., Spry, N., Iacopetta, B.: Associationof Tumour Site and Sex with Survival Benefit from Adjuvant Chemotherapy inColorectal Cancer. The Lancet 355(9217), 1745–1750 (2000)

8. Globerson, A., Tishby, N.: On the Optimality of the Gaussian Information Bottle-neck Curve. Tech. rep., The Hebrew University of Jerusalem (2004)

9. Hoff, P.D.: Extending the Rank Likelihood for Semiparametric Copula Estimation.The Annals of Applied Statistics pp. 265–283 (2007)

10. Licastro, F., Candore, G., Lio, D., Porcellini, E., Colonna-Romano, G., Franceschi,C., Caruso, C.: Innate Immunity and Inflammation in Ageing: A Key for Under-standing Age-Related Diseases. Immunity & Ageing 2(1), 8 (2005)

11. Migliore, L., Coppede, F.: Genetic and Environmental Factors in Cancer and Neu-rodegenerative Diseases. Mutation Research 512(2–3), 135–153 (2002)

12. Pal, S.K., Hurria, A.: Impact of Age, Sex, and Comorbidity on Cancer Therapyand Disease Progression. Journal of Clinical Oncology 28(26), 4086–4093 (2010)

13. Rey, M., Roth, V.: Meta-Gaussian Information Bottleneck. In: Advances in NeuralInformation Processing Systems 25. pp. 1925–1933 (2012)

14. Sheffer, M., Bacolod, M.D., Zuk, O., Giardina, S.F., Pincas, H., Barany, F., Paty,P.B., Gerald, W.L., Notterman, D.A., Domany, E.: Association of Survival andDisease Progression with Chromosomal Instability: A Genomic Exploration of Col-orectal Cancer. In: Proceedings of the National Academy of Sciences. pp. 7131–7136(2009)

15. Sklar, A.: Fonctions de repartition a n dimensions et leurs marges. Universite Paris(1959)

16. Soderlund, S., Granath, F., Brostrom, O., Karlen, P., Lofberg, R., Ekbom, A.,Askling, J.: Inflammatory Bowel Disease Confers a Lower Risk of Colorectal Cancerto Females Than to Males. Gastroenterology 138(5), 1697–1703 (2010)

17. Tishby, N., Pereira, F.C., Bialek, W.: The Information Bottleneck Method. In:Proceedings of the 37th Annual Allerton Conference on Communication, Controland Computing. pp. 368–377 (1999)

18. Wullschleger, S., Loewith, R., Hall, M.N.: TOR Signaling in Growth andMetabolism. Cell 124(3), 471–484 (2006)


Recommended