+ All Categories
Home > Documents > SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS ... · SPECIAL ISSUE OF PROCEEDINGS OF IEEE -...

SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS ... · SPECIAL ISSUE OF PROCEEDINGS OF IEEE -...

Date post: 20-Jul-2020
Category:
Upload: others
View: 8 times
Download: 0 times
Share this document with a friend
20
SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 1 A Critical Survey of Deconvolution Methods for Separating cell-types in Complex Tissues Shahin Mohammadi *, 1 , Neta Zuckerman ?, 2 , Andrea Goldsmith 2 , and Ananth Grama *, 1 1 Department of Computer Sciences, Purdue University, West Lafayette, IN 47907, USA 2 Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA Abstract—Identifying properties and concentrations of components from an observed mixture, known as deconvolution, is a fundamental problem in signal processing. It has diverse applications in fields rang- ing from hyperspectral imaging to denoising readings from biomedical sensors. This paper focuses on in-silico deconvolution of signals as- sociated with complex tissues into their constitutive cell-type specific components, along with a quantitative characterization of the cell- types. Deconvolving mixed tissues/cell-types is useful in the removal of contaminants (e.g., surrounding cells) from tumor biopsies, as well as in monitoring changes in the cell population in response to treatment or infection. In these contexts, the observed signal from the mixture of cell-types is assumed to be a convolution, using a linear instantaneous (LI) mixing process, of the expression levels of genes in constitutive cell- types. The goal is to use known signals corresponding to individual cell- types along with a model of the mixing process to cast the deconvolution problem as a suitable optimization problem. In this paper, we present a survey and in-depth analysis of models, methods, and assumptions underlying deconvolution techniques. We investigate the choice of the different loss functions for evaluating es- timation error, constraints on solutions, preprocessing and data filtering, feature selection, and regularization to enhance the quality of solutions, along with the impact of these choices on the performance of commonly used regression-based methods for deconvolution. We assess different combinations of these factors and use detailed statistical measures to evaluate their effectiveness. Some of these combinations have been proposed in the literature, whereas others represent novel algorithmic choices for deconvolution. We identify shortcomings of current methods and avenues for further investigation. For many of the identified short- comings, such as normalization issues and data filtering, we provide new solutions. We summarize our findings in a prescriptive step-by- step process, which can be applied to a wide range of deconvolution problems. Index Terms—Deconvolution, Gene expression, Linear regression, Loss function, Range filtering, Feature selection, Regularization 1 I NTRODUCTION S OURCE separation, or deconvolution, is the problem of estimating individual signal components from their mixtures. This problem arises when source sig- nals are transmitted through a mixing channel and the mixed sensor readings are observed. Source sepa- ration has applications in a variety of fields. One of * Corresponding authors: [email protected], [email protected] ? Currently at: Genentech Inc., South San Francisco, CA 94080, USA its early applications was in processing audio signals [10–13]. Here, mixtures of different sound sources, such as speech or music, are recorded simultaneously using several microphones. Various frequencies are convolved by the impulse response of the room and the goal is to separate one or several sources from this mixture. This has direct applications in speech enhancement, voice removal, and noise cancellation in recordings from populated areas. In hyperspectral imaging, the spectral signature of each pixel is observed. This signal is the combination of pure spectral signatures of constitutive elements mixed according to their relative abundance. In satellite imaging, each pixel represents sensor readings for different patches of land at multiple wavelengths. In- dividual sources correspond to reflectances of materials at different wavelengths that are mixed according to the material composition of each pixel [1–5]. Beyond these domains, deconvolution has applica- tions in denoising biomedical sensors. Tracing electrical current in the brain is widely used as a proxy for spatiotemporal patterns of brain activity. These patterns have significant clinical applications in diagnosis and prediction of epileptic seizures, as well as characteriz- ing different stages of sleep in patients with sleep dis- orders. Electroencephalography (EEG) and magnetoen- cephalography (MEG) are two of the most commonly used techniques for cerebral imaging. These techniques measure voltage fluctuations and changes in the electro- magnetic fields, respectively. Superconducting QUantum Interference Device (SQUID) sensors used in the latter technology are susceptible to magnetic coupling due to geometry and must be shielded carefully against magnetic noise. Deconvolution techniques are used to separate different noise sources and ameliorate the effect of electrical and magnetic coupling in these devices [6–9]. At a high level, mixing channels can be classified as follows: (i) linear or nonlinear, (ii) instantaneous, de- layed, or convolutive, and (iii) over/under determined. When neither the sources nor the mixing process is avail- able, the problem is known as blind source separation (BSS). This problem is highly under-determined in gen- arXiv:1510.04583v1 [cs.CE] 15 Oct 2015
Transcript
Page 1: SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS ... · SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 3 RNASeq

SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 1

A Critical Survey of Deconvolution Methods forSeparating cell-types in Complex TissuesShahin Mohammadi∗,1, Neta Zuckerman?,2, Andrea Goldsmith2, and Ananth Grama∗,1

1Department of Computer Sciences, Purdue University, West Lafayette, IN 47907, USA2Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA

F

Abstract—Identifying properties and concentrations of componentsfrom an observed mixture, known as deconvolution, is a fundamentalproblem in signal processing. It has diverse applications in fields rang-ing from hyperspectral imaging to denoising readings from biomedicalsensors. This paper focuses on in-silico deconvolution of signals as-sociated with complex tissues into their constitutive cell-type specificcomponents, along with a quantitative characterization of the cell-types. Deconvolving mixed tissues/cell-types is useful in the removal ofcontaminants (e.g., surrounding cells) from tumor biopsies, as well asin monitoring changes in the cell population in response to treatmentor infection. In these contexts, the observed signal from the mixture ofcell-types is assumed to be a convolution, using a linear instantaneous(LI) mixing process, of the expression levels of genes in constitutive cell-types. The goal is to use known signals corresponding to individual cell-types along with a model of the mixing process to cast the deconvolutionproblem as a suitable optimization problem.

In this paper, we present a survey and in-depth analysis of models,methods, and assumptions underlying deconvolution techniques. Weinvestigate the choice of the different loss functions for evaluating es-timation error, constraints on solutions, preprocessing and data filtering,feature selection, and regularization to enhance the quality of solutions,along with the impact of these choices on the performance of commonlyused regression-based methods for deconvolution. We assess differentcombinations of these factors and use detailed statistical measures toevaluate their effectiveness. Some of these combinations have beenproposed in the literature, whereas others represent novel algorithmicchoices for deconvolution. We identify shortcomings of current methodsand avenues for further investigation. For many of the identified short-comings, such as normalization issues and data filtering, we providenew solutions. We summarize our findings in a prescriptive step-by-step process, which can be applied to a wide range of deconvolutionproblems.

Index Terms—Deconvolution, Gene expression, Linear regression,Loss function, Range filtering, Feature selection, Regularization

1 INTRODUCTION

SOURCE separation, or deconvolution, is the problemof estimating individual signal components from

their mixtures. This problem arises when source sig-nals are transmitted through a mixing channel andthe mixed sensor readings are observed. Source sepa-ration has applications in a variety of fields. One of

∗ Corresponding authors: [email protected], [email protected]? Currently at: Genentech Inc., South San Francisco, CA 94080, USA

its early applications was in processing audio signals[10–13]. Here, mixtures of different sound sources, suchas speech or music, are recorded simultaneously usingseveral microphones. Various frequencies are convolvedby the impulse response of the room and the goal isto separate one or several sources from this mixture.This has direct applications in speech enhancement,voice removal, and noise cancellation in recordings frompopulated areas. In hyperspectral imaging, the spectralsignature of each pixel is observed. This signal is thecombination of pure spectral signatures of constitutiveelements mixed according to their relative abundance. Insatellite imaging, each pixel represents sensor readingsfor different patches of land at multiple wavelengths. In-dividual sources correspond to reflectances of materialsat different wavelengths that are mixed according to thematerial composition of each pixel [1–5].

Beyond these domains, deconvolution has applica-tions in denoising biomedical sensors. Tracing electricalcurrent in the brain is widely used as a proxy forspatiotemporal patterns of brain activity. These patternshave significant clinical applications in diagnosis andprediction of epileptic seizures, as well as characteriz-ing different stages of sleep in patients with sleep dis-orders. Electroencephalography (EEG) and magnetoen-cephalography (MEG) are two of the most commonlyused techniques for cerebral imaging. These techniquesmeasure voltage fluctuations and changes in the electro-magnetic fields, respectively. Superconducting QUantumInterference Device (SQUID) sensors used in the lattertechnology are susceptible to magnetic coupling dueto geometry and must be shielded carefully againstmagnetic noise. Deconvolution techniques are used toseparate different noise sources and ameliorate the effectof electrical and magnetic coupling in these devices [6–9].

At a high level, mixing channels can be classified asfollows: (i) linear or nonlinear, (ii) instantaneous, de-layed, or convolutive, and (iii) over/under determined.When neither the sources nor the mixing process is avail-able, the problem is known as blind source separation(BSS). This problem is highly under-determined in gen-

arX

iv:1

510.

0458

3v1

[cs

.CE

] 1

5 O

ct 2

015

Page 2: SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS ... · SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 3 RNASeq

SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 2

eral, and additional constraints; such as independenceamong sources, sparsity, or non-negativity; are typicallyenforced on the sources in practical applications. On theother hand, a new class of methods has been developedrecently, which is known as semi or guided BSS [7,10, 12, 13]. In these methods, additional information isavailable a priori on the approximate behavior of eithersources or the mixing process. In this paper, we focuson the class of over-determined, linear instantaneous (LI)mixing processes, for which a deformed prior on sourcesis available. In this case, the parameters of the linearmixer, as well as the true identity of the original sourcesare to be determined.

In the context of molecular biology, deconvolutionmethods have been used to identify constituent cell-types in a tissue, along with their relative proportions.The inherent heterogeneity of tissue samples makes itdifficult to identify separated, cell-type specific signa-tures, i.e., the precise gene expression levels for eachcell-type. Relative changes in cell proportions, combinedwith variations attributed to the changes in the biologicalconditions, such as disease state, complicate identifica-tion of true biological signals from mere technical varia-tions. Changes in tissue composition are often indicativeof disease progression or drug response. For example,coupled depletion of specific neuronal cells with thegradual increase in the glial cell population is indicativeof neurodegenerative disorders. An increasing propor-tion of malignant cells, as well as a growing fraction oftumor infiltrating lymphocytes (TIL) compared to sur-rounding cells, directly influence tumor growth, metas-tasis, and clinical outcomes for patients [14, 15]. Decon-volving tissue biopsies allows further investigation ofthe interaction between tumor and micro-environmentalcells, along with its role in the progression of cancer.

The expression level of genes, which is a proxy forthe number of present copies of each gene product,is one of the most common source factors used forseparating cell-types and tissues. In the linear mixingmodel, the expression of each gene in a complex mixtureis estimated as a linear combination of the expression ofthe same gene in the constitutive cell-types. In silico de-convolution methods for separating complex tissues canbe coarsely classified as either full deconvolution, in whichboth cell-type specific expressions and the percentagesof each cell-type are estimated, or partial deconvolutionmethods, where one of these data sources is used tocompute the other [16]. These two classes loosely relateto BSS and guided-BSS problems. Note that in caseswhere relative abundances are used to estimate cell-type-specific expressions, the problem is highly under-determined, whereas in the complementary case of com-puting percentages from noisy expressions of purifiedcells, it is highly over-determined. The challenge is toselect the most reliable features that satisfy the linearityassumption. We provide an in-depth review of the recentdeconvolution methods in Section 2.5.

In contrast to computational methods, a variety of

experimental cell separation techniques have been pro-posed to enrich samples for cell-types of interest. How-ever, these experimental methods not only involve sig-nificant time, effort, and expense, but may also result ininsufficient RNA abundance for further quantification ofgene expression. In this case, amplification steps mayintroduce technical artifacts into the gene expressiondata. Furthermore, sorting of cell-types must be em-bedded in the experiment design for the desired subsetof cells, and any subsequent separation is infeasible.Computational methods, on the other hand, are capableof sorting mixtures at different levels of resolution andfor arbitrary cell-type subsets of interest.

The organization of the remainder of the paper isas follows: in Section 2.1 we introduce the basic ter-minology from biology needed to formally define thedeconvolution problem in the context of quantifying cell-type fractions in complex tissues. The formal definitionof the deconvolution problem and its relationship tolinear regression is defined in Section 2.2. Sections 2.3and 2.4 review different choices and examples of theobjective function used in regression. An overview ofcomputational methods for biological deconvolution isprovided in Section 2.5. Datasets and evaluation mea-sures used in this study are described in Sections 3.1and 3.2, respectively. The effect of the loss function, con-straint enforcement, range filtering, and feature selectionchoices on the performance of deconvolution methodsis evaluated systematically in Sections 3.4-3.8. Finally, asummary of our findings is provided in Section 4.

2 BACKGROUND AND NOTATION

2.1 Biological UnderpinningsThe central dogma of biology describes the flow of geneticinformation within cells – the genetic code, representedin DNA molecules, is first transcribed to an intermediateconstruct, called messenger RNA (mRNA), which in turntranslates into proteins. These proteins are the functionalworkhorses of the cell. Genes, defined as the minimalcoding sections of the DNA, contain the recipe for mak-ing proteins. These instructions are utilized dynamicallyby the cell to adapt to different conditions. The amountsof various proteins in a cell can be measured at a timepoint. This corresponds to the level of protein expression.This process is limited by the availability of high-qualityantibodies that can specifically target each protein. Theamount of active mRNA in a cell, however, can bemeasured at the genome scale using high-throughputtechnologies such as microarrays and RNASeq. The for-mer is an older technology that relies on the bindingaffinity of complementary base pairs (alphabets used inthe DNA/RNA molecules), while the latter is a newertechnique, using next generation sequencing (NGS). Thistechnique estimates gene expression based on the over-lap of mRNA fragments with known genomic features.Since microarrays have been used for years, extensivedatabases from different studies are publicly available.

Page 3: SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS ... · SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 3 RNASeq

SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 3

RNASeq datasets, in comparison, are relatively smallerbut growing rapidly in scale and coverage. Both of thesetechnologies provide reliable proxies for the amount ofproteins in cells, with RNASeq being more sensitive, es-pecially for lowly expressed genes. A drawback of thesemethods, however, is that the true protein expression isalso regulated by additional mechanisms, such as post-transcriptional modifications, which cannot be assayedat the mRNA level.

The expression level of genes is tightly regulated indifferent stages of cellular development and in responseto environmental changes. In addition to these biologicalvariations due to cellular state, intermediate steps ineach technology introduce technical variations in repeatedmeasurement of gene expression in the same cell-type.To enhance reproducibly of measurements, one normallyincludes multiple instances of the same cell-type in eachexperiment, known as technical replicates. The expressionprofiles from these experiments provide a snapshot ofthe cell under different conditions. In addition to bi-ological variation of genes within the same cell-type,there is an additional level of variation when we lookacross different cell-types. Some genes are ubiquitouslyexpressed in all cell-types to perform housekeeping func-tions, whereas other genes exhibit specificity or selec-tivity for one, or a group of cell-types, respectively. Acompendium of expression profiles of different cells atdifferent developmental stages is the data substrate forin silico deconvolution of complex tissues.

2.2 Deconvolution: Formal DefinitionWe introduce formalisms and notation used in dis-cussing different aspects of in silico deconvolution ofbiological signals. We focus on models that assumelinearity, that is, the expression signature of the mixtureis a weighted sum of the expression profile for itsconstitutive cell-types. In this case, sources are cell-typespecific references and the mixing process is determinedby the relative fraction of cell-types in the mixture.

We first introduce the mathematical constructs used:• M ∈ Rn×p: Mixture matrix, where each entry M(i, j)

represents the raw expression of gene i, 1 ≤ i ≤ n,in heterogeneous sample j, 1 ≤ j ≤ p. Each sample,represented by m, is a column of the matrix M, andis a combination of gene expression profiles fromconstituting cell types in the mixture.

• H ∈ Rn×r: Reference signature matrix for the ex-pression of primary cell types, with multiple bio-logical/technical replicates for each cell-type. In thismatrix, rows correspond to the same set of genes asin M, columns represent replicates and there is anunderlying grouping among columns that collectsprofiles corresponding to the same cell-type.

• G ∈ Rn×q : Reference expression profile, wherethe expression of similar cell-types in matrix H isrepresented by the average value.

• C ∈ Rq×p: Relative proportions of each cell-typein the mixture sample. Here, rows correspond to

cell-types and columns represent samples in mixturematrix M.

Using this notation, we can formally define deconvo-lution as an optimization problem that seeks to identify“optimal” estimates for matrices G and C, denoted byG and C, respectively. Since G and/or C are not knowna priori, we use an approximation that is based on thelinearity assumption. In this case, we aim to find G andC such that their product is close to the mixture matrix,M. Specifically, given a function δ that measures thedistance between the true and approximated solutions,also referred to as the loss function, we aim to solve:

min0≤G,C

δ(GC,M) (1)

In partial deconvolution, either C or G, or their noisyrepresentation, is known a priori and the goal is to findthe other unknown matrix. When matrix G, referred toas the reference profile, is known, the problem is over-determined and we seek to distinguish features (genes)that closely conform to the linearity assumption, fromthe rest of the (variable) genes. In this case, we can solvethe problem individually for each mixture sample. Let usdenote by m and c the expression profile and estimatedcell-type proportion of a mixture sample, respectively.Then, we can rewrite Equation 1 as:

min0≤c

δ(Gc,m) (2)

This formulation is essentially a linear regression prob-lem, with an arbitrary loss function. On the other hand,in the case of full deconvolution, we can still estimate Cin a column-by-column fashion. However, estimating Gis highly under-determined and we must use additionalsources to restrict the search space. One such sourceof information is the variation across samples in M,depending on the cell-type concentrations in the latestestimated value of C. In general, most regression-basedmethods for full deconvolution use an iterative schemethat starts from either noisy estimates of G and C, or arandom sample that satisfies given constraints on thesematrices, and successively improves over this initialapproximation. This iterative process can be formalizedas follows:

C ← argmin0≤C

(δ(GC−M)) (3)

G ← (argmin0≤G

(δ(CT GT −M)T ))T

Please note that the updating G is typically row-wise(for each gene), whereas updation of C is column-wise(for each sample). Non-negative matrix factorization(NMF) is a dimension reduction technique that aims tofactor each column of the given input matrix as a non-negative weighted sum of non-negative basis vectors,with the number of basis vectors being equal or less

Page 4: SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS ... · SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 3 RNASeq

SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 4

than the number of columns in the original matrix. Thealternating non-negative least squares formulation (ANLS)for solving NMF can be formulated using the frame-work introduced in Equation 3. There are additionaltechniques for solving NMF, including the multiplicativeupdating rule and the hierarchical alternating least squares(HALS) methods, all of which are special cases of block-coordinate descent [25]. Two of the most common lossfunctions used in NMF are the Frobenius and Kullback-Leibler (KL) divergence [25].

In addition to non-negativity (NN), an additional sum-to-one (STO) constraint is typically applied over columnsof the matrix C, or the sample-specific vector c. Thisconstraint restricts the search space, which can poten-tially enhance the accuracy of the results, and simplifiesthe interpretation of values in c as relative percentages.Finally, another fundamental assumption that is mostlyneglected in prior work is the similar cell quantity(SCQ) constraint. The similar cell quantity assumptionstates that all reference profiles and corresponding mix-tures must be normalized to ensure that they representthe expression level of the “same number of cells.” Ifthis constraint is not satisfied, differences in the cell-type counts directly affect concentrations by rescaling theestimated coefficients to adjust for the difference.

In this paper, we focus on different loss functions (δfunctions), as well as the role of constraint enforcementstrategies, in estimating c. These constitute the keybuilding blocks of both partial and full deconvolutionmethods.

2.3 Choice of Objective FunctionIn linear regression, often a slightly different notation isused, which we describe here. We subsequently relate itto the deconvolution problem. Given a set of samples,{(xi, yi)}mi=1, where xi ∈ Rk and yi ∈ R, the regressionproblem seeks to find a function f(x) that minimizesthe aggregate error over all samples. Let us denote thefitting error by ri = yi − f(xi). Using this notation, wecan write the regression problem as:

argminf∈F

m∑i=1

L(ri) (4)

where the loss function L measures the cost of estimationerror. We focus on the class of linear functions, that isfw(x) = w

Tx, for which we have ri = yi−wTxi. In thisformulation, yi corresponds to the expression level of agene in the mixture, vector xi is the expression level ofthe same gene in the reference cell types, and w is thefraction of each cell-type in the mixture. We can represent{xi}mi=1 in the compact form by matrix X, in which rowi corresponds to xi.

In cases where the number of parameters is greaterthan the number of samples, that is matrix X is afat matrix, minimizing Equation 4, directly, can resultin the over-fitting problem. Furthermore, when features(columns of X) are highly correlated, solution may

change drastically in response to small changes in thesamples, specifically among the correlated features. Thiscondition, known as multicollinearity, can result in inac-curate estimates, in which coefficients of similar featuresare greatly different. To remedy these problems, we canadd a regularization term, which incorporates additionalconstraints (such as sparsity or flatness) to enhance thestability of results. We re-write the problem with theadded regularizer as:

argminw∈Rk

{m∑i=1

L(yi −wTxi)︸ ︷︷ ︸Overall loss

+ λR(w)︸ ︷︷ ︸Regularizer

} (5)

where the λ parameter controls the relative importanceof estimation error versus regularization. There are dif-ferent choices and combinations for the loss function Land regularizer function R, which we describe in thefollowing sections.

2.3.1 Choice of Loss FunctionsThere are a variety of options for suitable loss functions.Some of these functions are known to be asymptoticallyoptimal for a given noise density, whereas others mayyield better performance in practice when assumptionsunderlying the noise model are violated. We summarizethe most commonly used set of loss functions:• If we assume that the underlying model is perturbed

by Gaussian white noise, the squared or quadraticloss, denoted by L2, is known to be asymptoticallyoptimal. This loss function is used in classical leastsquares regression and is defined as:

L2(ri) = r2i = (yi −wTxi)2

• Absolute deviation loss, denoted by L1, is the opti-mal choice if noise follows a Laplacian distribution.Formally, it is defined as:

L1(ri) = |ri| = |yi −wTxi|

Compared to L2, the choice of L1 is preferred in thepresence of outliers, as it is less sensitive to extremevalues

• Huber’s loss function, denoted by L(M)huber, is a

parametrized combination of L1 and L2. The mainidea is that L2 loss is more susceptible to outliers,while it is more sensitive to small estimation errors.To combine the best of these two functions, we candefine a half-length parameter M , which we use totransition from L2 to L1. More formally:

L(M)Huber(ri) =

{r2i , if |ri| ≤MM(2|ri| −M), otherwise

• The loss function used in support vector regression(SVR) is the ε-insensitive loss, denoted by L(ε)

ε .Similar to Huber loss, there is a transition phase be-tween small and large estimation errors. However,

Page 5: SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS ... · SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 3 RNASeq

SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 5

−3 −2 −1 0 1 2 30

1

2

3

4

5

6

7

8

9

Residual

Lo

ss

Comparisson of different loss functions (ε = 0.75, M = 1.00)

L1

L2

Huber

ε−insensitive

Fig. 1. Comparison of different loss functions

ε-insensitive loss does not penalize the errors thatare smaller than a threshold. Formally, we defineε-insensitive loss as:

L(ε)ε (ri) = max(0, |ri| − ε)

=

{0, if |ri| ≤ ε|ri| − ε, otherwise

Figure 1 provides a visual representation of these lossfunctions, in which we use M = 1 and ε = 1

2 for theHuber and ε-insensitive loss functions, respectively. Notethat for small residual values, |ri| ≤ M = 1, Huber andsquare loss are equivalent. However, outside this regionHuber loss becomes linear.

2.3.2 Choice of Regularizers

When the reference profile contains many cell-types thatmay not exist in mixtures, or in cases where constitutivecell-types are highly correlated, regularizing the objec-tive function can sparsify the solution or enhance theconditioning of the problem. We describe two commonlyused regularizers here:

• The norm-2 regularizer is used to shrink the regres-sion coefficient vector w to ensure that it is as flatas possible. A common use of this regularizer isin conjunction with L2 loss to remedy the multi-collinearity problem in classical least squares regres-sion. This regularizer is formally defined as:

R2(w) =‖ w ‖22=k∑i=1

w2i . (6)

• Another common regularizer is the norm-1 regu-larizer, which is used to enforce sparsity over w.

Formally, it can be defined as:

R1(w) =‖ w ‖1=k∑i=1

|wi|. (7)

In addition to these two regularizers, their combina-tions have also been introduced in the literature. Onesuch example is elastic net, which uses a convex combi-nation of the two, that is Relastic(w) = αR1(w) + (1 −α)R2(w). Another example is group Lasso, which, givena grouping G among cell-types, enforces flatness amongmembers of the group, while enhancing the sparsitypattern across groups. This regularizer function can bewritten as Rgroup =

∑Gi

L2(w(Gi)), where w(Gi) is theweight of cell-types in the ith group.

2.4 Examples of objective functions used in practice2.4.1 Ordinary Least Squares (OLS)The formulation of OLS is based on squared loss, L2.Formally, we have:

minw{m∑i=1

L2(ri)} = minw{m∑i=1

(yi −wTxi)2}

= minw ‖ y −Xw ‖22where row i of the matrix X, also known as the designmatrix, corresponds to xi. This formulation has a closedform solution given by:

w = (XTX)−1XTy

In this formulation, we can observe that norm-2 regu-larization is especially useful in cases where the matrixX is ill-conditioned and near-singular, that is, columnsare dependent on each other. By shifting XTX towardsthe identity matrix, we ensure that the eigenvalues arefarther from zero, which enhances the conditioning ofthe resulting combination.

2.4.2 Ridge RegressionOne of the main issues with the OLS formulation is thatthe design matrix, X, should have full column rank k.Otherwise, if we have highly correlated variables, thesolution suffers from the multicollinearity problem. Thiscondition can be remedied by incorporating a norm-2regularizer. The resulting formulation, known as ridgeregression, is as follows:

minw{m∑i=1

L2(ri) + λR2(w)}

= minw ‖ y −Xw ‖22 +λ ‖ w ‖22Similar to OLS, we can differentiate w.r.t. w to find the

close form solution for Ridge regression given by:

w = (XTX+ λI)−1XTy

Page 6: SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS ... · SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 3 RNASeq

SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 6

2.4.3 Least Absolute Selection and Shrinkage Operator(LASSO) RegressionCombining the OLS with a norm-1 regularizer, we havethe LASSO formulation:

minw{m∑i=1

L2(ri) + λR1(w)}

= minw ‖ y −Xw ‖22 +λ ‖ w ‖1

This formulation is especially useful for producingsparse solutions by introducing zero elements in vectorw. However, while being convex, it does not have aclosed form solution.

2.4.4 Robust RegressionIt is known that L2(r) is dominated by the largest ele-ments of the residual vector r, which makes it sensitiveto outliers. To remedy this problem, different robustregression formulations have been proposed that usealternative loss functions. Two of the best-known for-mulations are based on the L1 and Lhuber loss functions.The L1 formulation can be written as:

minw{m∑i=1

L1(ri)} = minw{m∑i=1

|yi −wTxi|}

= minw ‖ y −Xw ‖1

However, for the Huber loss function, while it can bedefined similarly, it is usually formulated as an alterna-tive convex Quadratic Program (QP):

minx,z,t{1

2‖ z ‖22 +M1T t}

Subject to:− t ≤ Xw − y − z ≤ t (8)

which can be solved more efficiently using the followingequivalent QP variant [17]:

minx,z,r,s{1

2‖ z ‖22 +M1T (r + s)}

Subject to:

{Xw − y − z = r − s0 ≤ r, s

(9)

In both of these formulations, the scalar M corre-sponds to half-length parameter of the Huber’s lossfunction.

2.4.5 Support Vector RegressionIn machine learning, Support Vector Regression (SVR) isa commonly used technique that aims to find a regres-sion by maximizing the margins around the estimatedseparator hyperplane from the closest data points oneach side of it. This margin provides the region in whichestimation errors are ignored. SVR has been recentlyused to deconvolve biological mixtures, where it hasbeen shown to outperform other methods [15]. One of

the variants of SVR is ε-SVR, in which parameter εdefines the margin, or the ε-tube. The primal formulationof ε-SVR with linear kernel can be written as [18]:

minw,ξ+i ,ξ−i{12‖ w ‖22 +C

m∑i=1

(ξ+i + ξ−i )}

Subject to:

yi −w · xi ≤ ε+ ξ+i−(ε+ ξ−i ) ≤ yi −w · xi0 ≤ ξ+i , ξ

−i

(10)

in which, given the unit norm assumption introduced inSection 2.2, we assume that b = 0. The dual problem forthe primal in Equation 10 can be written in matrix formas:

maxα+,α−

{1T((α+ −α−)� y

)−ε1T (α+ +α−)

−(α+ −α−)TK(α+ −α−)}

Subject to:

{1T (α+ −α−) = 0

0 ≤ α+,α− ≤ C(11)

In this formulation, 1 is a vector of all ones, � isthe element-wise product, and K is the kernel matrixdefined as K = XXT . The dual formulation is often usedto solve ε-SVR, because it can be easily extended to usedifferent kernel functions to map xi to a d-dimensionalnon-linear feature space. Additionally, when m � k,such as the case of high-dimensional feature spaces,it provides a better way to solve the SVR problem.However, the primal problem provides a more straight-forward interpretation. In addition, in the case wherek � m, it provides superior performance. To show thesimilarity with Equation 5, we can rewrite Equation 10using the ε-insensitive loss function as follows:

minw{m∑i=1

Lε(yi −wTxi) + λR2(w)} (12)

where λ = 12C [19].

2.5 Overview of Prior in silico Deconvolution Meth-ods

A majority of existing deconvolution methods fall intotwo groups – they either use a regression-based frame-work to compute G, C, or both; or perform statisticalinference over a probabilistic model. Abbas et al. [20]present one of the early regression-based methods forestimating C. This method is designed to identify cell-type concentrations from a known reference profile ofimmune cells. Their method is based on Ordinary LeastSquares (OLS) regression and does not consider eithernon-negativity or sum-to-one constraints explicitly, butrather it enforces these constraints implicitly after theoptimization procedure. An extension of this approach

Page 7: SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS ... · SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 3 RNASeq

SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 7

is proposed by Qiao et al. [21], which uses non-negativeleast squares (NNLS) to explicitly enforce non-negativityas part of the optimization. Gong et al. [22] present aquadratic programming (QP) framework to explicitlyencode both constraints in the optimization problemformulation. They also propose an extension to thismethod, called DeconRNASeq, which applies the sameQP framework to RNASeq datasets. More recently New-man et al. [15] propose robust linear regression (RLR) andν-SVR regression instead of L2 based regression, whichis highly susceptible to noise. Digital cell quantification(DCQ) [23] is another approach designed for monitoringthe immune system during infection. Compared to priormethods, DCQ forces sparsity by combining R2 andR1 regularization into an elastic net. This regularizationis essential for successfully identifying the subset ofactive cells at each stage, given the larger number ofcell-types included in their panel (213 immune cell sub-populations). In contrast to these techniques, Shen-Orret al. [24] propose a method, call csSAM, which is specif-ically designed to identify genes that are differentiallyexpressed among purified cell-types. The core of thismethod is regression over matrix C to estimate matrixG.

Full regression-based methods use variations of block-coordinate descent to successively identify better esti-mates for both C and G [25]. Venet et al. [26] presentone of the early methods in this class, which uses anNMF-like method coupled with a heuristic to decorrelatecolumns of G in each iteration. Repsilber et al. [27]propose an algorithm called deconf, which uses alter-nating non-negative least squares (ANLS) for solvingNMF, without the decorrelation step of Vennet et al.,while implicitly applying constraints on C and G ateach iteration. Inspired by the work of Pauca et al. onhyperspectral image deconvolution [4], Zuckerman et al.[28] propose an NMF method based on the Frobeniusnorm for gene expression deconvolution. They use gra-dient descent to solve for C and G at each step, whichconverges to a local optimum of the objective function.Given that the expression domain of cell-type specificmarkers is restricted to unique cells in the reference pro-file, Gaujoux et al. [29] present a semi-supervised NMF(ssNMF) method that explicitly enforces an orthogonal-ity constraint at each iteration over the subset of markersin the reference profile. This constraint both enhancesthe convergence of the NMF algorithm, and simplifiesthe matching of columns in the estimated cell-type ex-pression to the columns of the reference panel, G. TheDigital Sorting Algorithm (DSA) [30] works as follows: ifconcentration matrix C is known a priori, it directly usesquadratic programming (QP) with added constraints onthe lower/upper bound of gene expressions to estimatematrix G. Otherwise, if fractions are also unknown, ituses the average expression of given marker genes thatare only expressed in one cell-type, combined with theSTO constraint, to estimate concentrations matrix C first.Population-specific expression analysis (PSEA) [36] per-

forms a linear least squares regression to estimate quanti-tative measures of cell-type-specific expression levels, ina similar fashion as the update equation for estimatingG in Equation 3. In cases where the matrix C is notknown a priori, PSEA exploits the average expression ofmarker genes that are exclusively expressed in one of thereference profiles as reference signals to track the variationof cell-type fractions across multiple mixture samples.

In addition to regression-based methods, a large classof methods is based on probabilistic modeling of geneexpression. Erikkila et al. [31] introduce a method, calledDSection, which formulates the deconvolution problemusing a Bayesian model. It incorporates a Bayesian priorover the noisy observation of given concentration pa-rameters to account for their uncertainty, and employs aMCMC sampling scheme to estimate the posterior distri-bution of the parameters/latent variables, including Gand a denoised version of C. The in-silico NanoDissec-tion method [32] uses a classification algorithm based onlinear SVM coupled with an iterative adjustment processto refine a set of provided, positive and negative, markergenes and infer a ranked list of genome-scale predictionsfor cell-type-specific markers. Quon et al. [33] propose aprobabilistic deconvolution method, called PERT, whichestimates a global, multiplicative perturbation vector tocorrect for the differences between provided referenceprofiles and the true cell-types in the mixture. PERT for-mulates the deconvolution problem in a similar frame-work as Latent Dirichlet Allocation (LDA), and uses theconjugate gradient descent method to cyclically optimizethe joint likelihood function with respect to each latentvariable/parameter. Finally, microarray microdissectionwith analysis of differences (MMAD) [34] incorporatesthe concept of the effective RNA fraction to account forsource and sample-specific bias in the cell-type fractionsfor each gene. They propose different strategies depend-ing on the availability of additional data sources. Incases where no additional information is available, theyidentify genes with the highest variation in mixturesas markers and assign them to different reference cell-types using k-means clustering, and finally use these denovo markers to compute cell-type fractions. MMAD usesa MLE approach over the residual sum of squares toestimate unknown parameters in their formulation.

3 RESULTS AND DISCUSSION

We now present a comprehensive evaluation of variousformulations for solving deconvolution problems. Someof these algorithimic combinations have been proposedin literature, while others represent new algorithmicchoices. We systematically assess the impact of thesealgorithmic choices on the performance of in-silico de-convolution.

3.1 Datasets1) In vivo mixtures with known percentages: We

use a total of five datasets with known mixtures.

Page 8: SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS ... · SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 3 RNASeq

SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 8

We use CellMix to download and normalize thesedatasets [35], which uses the soft format data avail-able from Gene Expression Omnibus (GEO).• BreatBlood [22] (GEO ID: GSE29830): Breast

and blood from human specimens are mixedin three different proportions and each of themixtures is measured three times, with a totalof nine samples.

• CellLines [20] (GEO ID: GSE11058): Mixtureof human cell lines Jurkat (T cell leukemia),THP-1 (acute monocytic leukemia), IM-9 (Blymphoblastoid multiple myeloma) and Raji(Burkitt B-cell lymphoma) in four different con-centrations, each of which is repeated threetimes, resulting in a total of 12 samples.

• LiverBrainLung [24] (GEO ID: GSE19830): Thisdataset contains three different rat tissues,namely brain, liver, and lung tissues, which aremixed in 11 different concentrations with eachmixture having three technical replicates, for atotal of 33 samples.

• RatBrain [36] (GEO ID: GSE19380): This con-tains four different cell-types, namely rat’sneuronal, astrocytic, oligodendrocytic and mi-croglial cultures, and two replicates of fivedifferent mixing proportions, for a total of 10samples.

• Retina [37] (GEO ID: GSE33076): This datasetpools together retinas from two differentmouse lines and mixed them in eight differ-ent combinations and three replicates for eachmixture, resulting in a total of 24 samples.

2) Mixtures with available cell-sorting data throughflow-cytometry: For this experiment, we use twodatasets available from Qiao et al. [21]. We directlydownload these datasets from the supplementarymaterial of the paper. These datasets are post-processed by the supervised normalization of mi-croarrays (SNM) method to correct for batch ef-fects. Raw expression profiles are also available fordownload under GEO ID GSE40830. This datasetcontains two sub-datasets:• PERT Uncultured: This dataset contains un-

cultured human cord blood mono-nucleatedand lineage-depleted (Lin-) cells on the firstday.

• PERT Cultured: This dataset contains culture-derived lineage-depleted human blood cellsafter four days of cultivation.

Table 1 summarizes overall statistics related to each ofthese datasets.

3.2 Evaluation MeasuresLet us denote the actual and estimated coefficient matri-ces by C and C , respectively. We first normalize thesemeasures to ensure each column sums to one. Then, wedefine the corresponding percentages as P = 100×Cnorm

TABLE 1Summary statistics of each dataset

Dataset # features # samples # referencesBreastBlood 54675 9 2CellLines 54675 12 4LiverBrainLung 31099 33 3PERT Cultured 22215 2 11PERT Uncultured 22215 4 11RatBrain 31099 10 4Retina 22347 24 2

and P = 100 × Cnorm. Finally, let rjk = pjk − pjk bethe residual estimation error of cell-type k in sample j.Using this notation, we can define three commonly usedmeasures of estimation error as follows:

1) Mean absolute difference (mAD): This is amongthe easiest measures to interpret. It is defined asthe average of all differences for different cell-typepercentages in different mixture samples. Morespecifically:

mAD =1

p× q

p∑j=1

q∑k=1

|rjk|

2) Root mean squared distance (RMSD): This mea-sure is one of the most commonly used distancefunctions in the literature. It is formally defined as:

mAD =

√√√√ 1

p× q

p∑j=1

q∑k=1

r2jk

3) Pearson’s correlation distance: Pearson’s correla-tion measures the linear dependence between es-timated and actual percentages. Let us vectorizepercentage matrices as p = vec(P) and p = vec(P).Using this notation, the correlation between thesetwo vectors is defined as:

ρp,p =cov(p, p)σ(p)σ(p)

(13)

where cov and σ correspond to covariance andstandard variation of vectors, respectively. Finally,we define the correlation distance measure asR2D = 1− ρp,p.

3.3 Implementation

All codes and experiments have been implemented inMatlab. To implement different formulations of the de-convolution problem, we used CVX, a package for spec-ifying and solving convex programs [38, 39]. We usedMosek together with CVX, which is a high-performancesolver for large-scale linear and quadratic programs [40].All codes and datasets are freely available at github.com/shmohammadi86/DeconvolutionReview.

Page 9: SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS ... · SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 3 RNASeq

SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 9

BreastBlood CellLines LiverBrainLung PERT_Cultured PERT_Uncultured RatBrain Retina0

5

10

15

20

25

30

35

40

Av

era

ge t

ime

(s)

per

sam

ple

L2

L1

Huber

Hinge

Fig. 2. Average computational time for each loss functionin different datasets

3.4 Effect of Loss Function and Constraint Enforce-ment on Deconvolution Performance

We perform a systematic evaluation of the four differentloss functions introduced in Section 2.3.1, as well asimplicit and explicit enforcement of non-negativity (NN)and sum-to-one (STO) constraints over the concentrationmatrix (C), on the overall performance of deconvolutionmethods for each dataset. There are 16 configurations ofloss functions/constraints for each test case. Addition-ally, for Huber and Hinge loss functions, where M andε are unknown, we perform a grid search with 15 valuesin multiples of 10 spanning the range {10−7, · · · , 107}to find the best values for these parameters. In order toevaluate an upper bound on the “potential” performanceof these two loss functions, we use the true concen-trations in each sample, c, to evaluate each parameterchoice. In practical applications, the RMSD of residualerror between m and Gc is often used to select theoptimal parameter. This is not always in agreement withthe choice made based on known c.

For each test dataset, we compute the three evaluationmeasures defined in Section 3.2. Additionally, for eachof these measures, we compute an empirical p-valueby sampling random concentrations from a Uniformdistribution and enforcing NN and STO constraints onthe resulting random sample. In our study, we sampled10, 000 concentrations for each dataset/measure, whichresults in a lower bound of 10−4 on the estimated p-values. Figure 2 presents the time each loss functiontakes to compute per sample, averaged over all con-straint combinations. The actual times taken for Huberand Hinge losses are roughly 15 times those reportedhere, which is the number of experiments performedto find the optimal parameters for these loss functions.From these results, L2 can be observed to have thefastest computation time, whereas LHuber is the slow-est. Measures L1 and LHinge fit in between these twoextremes, with L1 being faster the majority of times. Wecan directly compare these computation times, becausewe formulate all methods within the same framework;thus, differences in implementations do not impact directcomparisons.

BreastBlood CellLines LiverBrainLung PERT_Cultured PERT_Uncultured RatBrain Retina

RMSD

mAD

R2D

Fig. 3. Agreement among different evaluation measuresacross different datasets

Computation time, while important, is not the criticalmeasure in our evaluation. The true performance of aconfiguration (selection of loss function and constraints)is measured by its estimation error. In order to rankdifferent configurations, we first assess the agreementamong different measures. To this end, we evaluate eachdataset as follows: for each experiment, we computemAD,RMSD, and R2D independently. Then, we useKendall rank correlation, a non-parametric hypothesistest for statistical dependence between two random vari-ables, between each pair of measures and compute a log-transformed p-value for each correlation. Figure 3 showsthe agreement among these measures across differentdatasets. Overall, RMSD and mAD measures showhigher consistency, compared to R2D measure. However,the mAD measure is easier to interpret as a measure ofpercentage loss for each configuration. Consequently, wechoose this measure for our evaluation in this study.

Using mAD as the measure of performance, we eval-uate each configuration over each dataset and sortthe results. Figure 4 shows various combinations foreach dataset. The RatBrain, LiverBrainLung, Breast-Blood, and CellLines datasets achieve high perfor-mance. Among these datasets, RatBrain, LiverBrain-Lung, and BreastBlood had the L2 loss function asthe best configuration, with the CellLines dataset beingless sensitive to the choice of the loss function. Anothersurprising observation is that for the majority of con-figurations, enforcing the sum-to-one constraint worsensthe results. We investigate this issue in greater depth inSection 3.5.

For Retina, as well as both PERT datasets, the overallperformance is worse than the other datasets. In the caseof PERT, this is expected, since the flow-sorted propor-tions are used as an estimate of cell-type proportions.Furthermore, the reference profiles come from a differentstudy and therefore have greater difference with the truecell-types in the mixture. However, the Retina datasetexhibits unusually low performance, which may be at-tributed to multiple factors. As an initial investigation,we performed a quality control (QC) over differentsamples to see if errors are similarly distributed acrosssamples. Figure 5 presents per-sample error, measured

Page 10: SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS ... · SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 3 RNASeq

SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 10

0

2

4

6

8

10

12

14

16

Bre

as

tBlo

od

Ce

llL

ine

s

Liv

erB

rain

Lu

ng

PE

RT

_C

ult

ure

d

PE

RT

_U

nc

ult

ure

d

Ra

tBra

in

Re

tin

a

mA

D o

f E

sti

ma

tio

n (

Lo

we

r th

e B

ett

er)

L2 (NN=Imp, STO=Imp)

L2 (NN=Exp, STO=Imp)

L1 (NN=Imp, STO=Exp)

Huber (NN=Imp, STO=Exp)

L2 (NN=Imp, STO=Exp)

Huber (NN=Imp, STO=Imp)

L1 (NN=Imp, STO=Imp)

Hinge (NN=Imp, STO=Imp)

Hinge (NN=Imp, STO=Exp)

L1 (NN=Exp, STO=Exp)

Hinge (NN=Exp, STO=Exp)

Huber (NN=Exp, STO=Exp)

Huber (NN=Exp, STO=Imp)

Hinge (NN=Exp, STO=Imp)

L1 (NN=Exp, STO=Imp)

L2 (NN=Exp, STO=Exp)

Fig. 4. Overall performance of different loss/constraintscombinations over all datasets

0

5

10

15

20

25

30

35

40

Samples

GS

M8

19

14

0

GS

M8

19

14

1

GS

M8

19

14

2

GS

M8

19

14

3

GS

M8

19

14

4

GS

M8

19

14

5

GS

M8

19

14

6

GS

M8

19

14

7

GS

M8

19

14

8

GS

M8

19

14

9

GS

M8

19

15

0

GS

M8

19

15

1

GS

M8

19

15

2

GS

M8

19

15

3

GS

M8

19

15

4

GS

M8

19

15

5

GS

M8

19

15

6

GS

M8

19

15

7

GS

M8

19

15

8

GS

M8

19

15

9

GS

M8

19

16

0

GS

M8

19

16

1

GS

M8

19

16

2

GS

M8

19

16

3

mA

D o

f E

sti

ma

ted

C p

er

Sa

mp

le

Fig. 5. Sample-based error of the Retina dataset, basedon L2 with explicit NN and STO

by mAD, with median and median absolute deviation(MAD) marked accordingly. Interestingly, for the 4th, 6th,and 8th mixtures, the third replicate has much highererror than the rest. In the expression matrix, we observeda lower correlation between these replicates and theother two replicates in the batch. Additionally, for the7th mixture, all three replicates show high error rates.We expand on these results in later sections to identifyadditional reasons that contribute to the low deconvolu-tion performance of the Retina dataset.

Finally, we note that in all test cases the performanceof L1,LHuber, and LHinge are comparable, while LHuber

and LHinge needed an additional step of parametertuning. Consequently, we only consider L1 as a repre-sentative of this “robust” group of loss functions in therest of our study.

3.5 Agreement of Gene Expressions with Sum-to-One (STO) ConstraintConsidering the lower performance of configurationsthat explicitly enforce STO constraints, we aim to in-vestigate whether features (genes) in each dataset re-spect this constraint. Under the STO and NN con-straints, we use simple bounds for identifying violatingfeatures, for which there is no combination of con-centration values that can satisfy both STO and NN.Let m(i) be the expression value of the ith gene inthe given mixture, and G(i, 1), · · · ,G(i, q) be the cor-responding expressions in different reference cell-types.

Let Gmin(i) = min{G(i, 1), · · · ,G(i, q)} and Gmax(i) =max{G(i, 1), · · · ,G(i, q)}. Given that all concentrationsare bound between 0 ≤ c(k) ≤ 1;∀1 ≤ k ≤ k,the minimum and maximum values that an estimatedmixture value for the ith gene can attain are Gmin(i) andGmax(i), respectively (by setting c(k) = 1 for min/maxvalue, and 0 everywhere else). Using this notation, wecan identify features that violate STO as follows:

m(i) ≤ Gmin(i) ∀1 ≤ i ≤ n {Violating reference}Gmax(i) ≤m(i) ∀1 ≤ i ≤ n {Violating mixture}

The first condition holds because expression values inreference profiles are so large that we need the sum ofconcentrations to be lower than one to be able to matchthe corresponding gene expression in the mixture. Thesecond condition holds in cases where the expression ofa gene in the mixture is so high that we need the sumof concentrations to be greater than one to be able tomatch it. In other words, for feature i, these constraintsidentify extreme expression values in reference profilesand mixture samples, respectively. Using these condi-tions, we compute the total number of features violatingSTO condition in each dataset.

Figure 6 presents violating features in mixtures andreference profiles, averaged over all mixture samples ineach dataset. We normalize and report the percent offeatures to account for differences in the total numberof features in each dataset. We first observe that for themajority of datasets, except Retina and BreastBlood, thepercent of violating features is much smaller than vio-lating features in reference profiles. These two datasetsalso have the highest number of violating features intheir reference profiles, summing to a total of approxi-mately 60% of all features. This observation is likely dueto the normalization used in pre-processing microarrayprofiles. Specifically, one must not only normalize M andG independently, but also with respect to each other.We suggest using control genes that are expressed inall cell-types with low variation to normalize expressionprofiles. A recent study aimed to identify subsets ofhousekeeping genes in human tissues that respect theseconditions [41]. Another choice is using ribosomal pro-teins, the basic building blocks of the cellular translationmachinery, which are expressed in a wide range ofspecies. The Remove Unwanted Variation (RUV) [42]method is developed to remove batch effects from mi-croarray and RNASeq expression profiles, but also tonormalize them using control genes. A simple extensionof this method can be adopted to solve the normalizationdifference between mixtures and references.

Next, we evaluate how filtering these features affectsdeconvolution performance of each dataset. For eachcase, we run deconvolution using all configurationsand report the change (delta mAD) independently. Fig-ure 7 presents changes in the mAD estimation error

Page 11: SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS ... · SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 3 RNASeq

SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 11

BreastBlood CellLines LiverBrainLung PERT_Cultured PERT_Uncultured RatBrain Retina0

5

10

15

20

25

30

35

% o

f V

iola

tin

g F

ea

ture

s

Mixtures

References

Fig. 6. Percent of features in each dataset that violate theSTO constraint

BreastBlood CellLines LiverBrainLung PERT_Cultured PERT_Uncultured RatBrain Retina−6

−4

−2

0

2

4

6

De

lta

mA

D (

Hig

he

r th

e B

ett

er)

L2 (NN=Imp, STO=Imp)

L2 (NN=Imp, STO=Exp)

L2 (NN=Exp, STO=Imp)

L2 (NN=Exp, STO=Exp)

L1 (NN=Imp, STO=Imp)

L1 (NN=Imp, STO=Exp)

L1 (NN=Exp, STO=Imp)

L1 (NN=Exp, STO=Exp)

Fig. 7. Performance of deconvolution methods after re-moving violating features

after removing violating features in both m and Gbefore performing deconvolution. Similar to previousexperiments, the Retina dataset exhibits widely differentbehavior than the rest of the datasets. Removing thisdataset from further consideration, we find that theoverall performance over all datasets improves, withthe exception of the RatBrain dataset. In the case ofthe RatBrain dataset, we hypothesize that the initiallysuperior performance can be attributed to highly ex-pressed features. These outliers, that happens to agreewith the true solution, result in over-fitting. Finally, wenote a correlation between observed enhancements andthe level of violation of features in m. Consistent withthis observation, we obtain similar results when we onlyfilter violating features from mixtures, but not referenceprofiles.

3.6 Range Filtering– Finding an Optimal Threshold

Different upper/lower bounds have been proposed inthe literature to prefilter expression values prior to de-convolution. For example, Gong et al. [22] suggest aneffective range of [0.5, 5000], whereas Ahn et al. [43]observe an optimal range of [24 − 214]. To facilitatethe choice of expression bounds, we seek a systematicway to identify an optimal range for different datasets.Kawaji et al. [44] recently report on an experiment toassess whether gene expression is quantified linearly inmixtures. To this end, they mix two cell-types (THP-1and HeLa cell-lines) and see if experimentally measuredexpressions match with the computationally simulated

datasets. They observe that expression values for mi-croarray measurements are skewed for the lowly ex-pressed genes (approximately < 10). This allows us tochoose the lower bound based on experimental evidence.In our study, we search for the optimal bounds overa log2-linear space; thus, we set a threshold of 23 onthe minimum expression values, which is closest to thebound proposed by Kawaji et al. [44].

Choosing an upper bound on the expression values isa harder problem, since it relates to enhancing the perfor-mance of deconvolution methods by removing outliers.Additionally, there is a known relationship betweenthe mean expression value and its variance [45], whichmakes these outliers noisier than the rest of the features.This becomes even more important when dealing withpurified cell-types that come from different labs, sincehighly expressed time/micro-environment dependentgenes would be significantly different than the ones inthe mixture [21]. A simple argument is to filter genes thatthe range of expression values in Affymetrix microarraytechnology is bounded by 216 (due to initial normaliza-tion and image processing steps). Measurements closeto this bound are not reliable as they might be satu-rated and inaccurate. However, practical bounds usedin previous studies are far from these extreme values. Inorder to examine the overall distribution of expressionvalues, we analyze different datasets independently. Foreach dataset, we separately analyze mixture samplesand reference profiles, encoded by matrices M and G,respectively. For each of these matrices, we vectorize theexpression values and perform kernel smoothing usingthe Gaussian kernel to estimate the probability densityfunction.

Figure 8(a) and Figure 8(b) show the distribution oflog-transformed expression values for mixtures and ref-erence profiles, respectively. These expression values aregreater than our lower bound of 23. In agreement withour previous results, we observe an unusually skeweddistribution for the Retina dataset, which in turn con-tributes to its lower performance compared to other idealmixtures. Additionally, we observe that approximately80% of the features in this dataset are smaller than 23,which are filtered and not shown in the distributionplot. For the rest of the datasets, in both mixtures andreferences, we observe a bell-shaped distribution withmost of the features captured up to an upper boundof 28 − 210. Another exception to this pattern is theCellLines dataset, which has a heavier tail than otherdatasets, especially in its reference profile.

Next, we systematically evaluate the effect of rangefiltering by analyzing upper bounds increasing in factorsof 10 in the range {25, · · · , 216}. In each case, we removeall features that at least one of the reference profilesor mixture samples has a value exceeding this upperbound. Figure 9 illustrates the percent of features that areretained, as we increase the upper bound. As mentionedearlier, approximately 80% of the features in the Retinadataset are lower than 23, which is evident from the

Page 12: SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS ... · SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 3 RNASeq

SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 12

3 4 5 6 7 8 9 10 11 12 13 14 15 160

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Range of expression values (log2 scale)

De

ns

ity

BreastBlood

CellLines

LiverBrainLung

PERT_Cultured

PERT_Uncultured

RatBrain

Retina

(a) Mixtures

3 4 5 6 7 8 9 10 11 12 13 14 15 160

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Range of expression values (log2 scale)

De

ns

ity

BreastBlood

CellLines

LiverBrainLung

PERT_Cultured

PERT_Uncultured

RatBrain

Retina

(b) Reference Profiles

Fig. 8. Distribution of expression values

5 6 7 8 9 10 11 12 13 14 15 160

10

20

30

40

50

60

70

80

90

100

log2 of the upper bound on expression values

Perc

en

t o

f co

vere

d f

eatu

res

BreastBlood

CellLines

LiverBrainLung

PERT_Cultured

PERT_Uncultured

RatBrain

Retina

Fig. 9. Percent of covered features during range filtering

maximum percent of features left to be bounded by 20%in this figure. Additionally, consistent with our previousobservation over expression densities, more that 80% ofthe features are covered between 28− 210, except for theCellLine dataset.

Finally, we perform deconvolution using the remain-ing features given each upper bound. The results aremixed, but a common trend is that removing highlyexpressed genes decreases performance of ideal mixtureswith known concentrations, while enhancing the perfor-mance of PERT datasets. Figure 10(a) and Figure 10(b)show the changes in mAD error, compared to unfiltereddeconvolution, for the PERT dataset. In each case, weobserve improvements up to 7 and 8 percent, respec-tively. The red and green points on the diagram show thesignificance of deconvolution. Interestingly, while both

5 6 7 8 9 10 11 12 13 14 15 16−4

−2

0

2

4

6

8

Log2 of the Upper Bound on Expression Values

Ch

an

ges in

mA

D E

sti

mati

on

Err

or

(Hig

her

the B

ett

er)

L2 (NN=Imp, STO=Imp)

L2 (NN=Imp, STO=Exp)

L2 (NN=Exp, STO=Imp)

L2 (NN=Exp, STO=Exp)

L1 (NN=Imp, STO=Imp)

L1 (NN=Imp, STO=Exp)

L1 (NN=Exp, STO=Imp)

L1 (NN=Exp, STO=Exp)

(a) Cultured

5 6 7 8 9 10 11 12 13 14 15 16−8

−6

−4

−2

0

2

4

6

8

10

Log2 of the Upper Bound on Expression Values

Ch

an

ges in

mA

D E

sti

mati

on

Err

or

(Hig

her

the B

ett

er)

L2 (NN=Imp, STO=Imp)

L2 (NN=Imp, STO=Exp)

L2 (NN=Exp, STO=Imp)

L2 (NN=Exp, STO=Exp)

L1 (NN=Imp, STO=Imp)

L1 (NN=Imp, STO=Exp)

L1 (NN=Exp, STO=Imp)

L1 (NN=Exp, STO=Exp)

(b) Uncultured

Fig. 10. Performance of PERT datasets during rangefiltering

methods show similar improvements, all data points forcultured PERT seem to be insignificant, whereas uncul-tured PERT shows significance for the majority of data-points. This is due to the weakness of our random model,which is dependent on the number of samples and isnot comparable across datasets. Uncultured PERT hastwice as many samples as cultured PERT, which makesit less likely to have any random samples achieving asgood an mAD as the observed estimation error. Thisdependency on the number of samples can be addressedby defining sample-based p-values. Another observationis that for the uncultured dataset, all measures havebeen improved, except L1 with explicit NN and STOconstraints. On the other hand, for the cultured dataset,both L1 and L2 with the explicit NN constraint performwell, whereas implicitly enforcing NN deteriorates theirperformance. Cultured and uncultured datasets havetheir peak at 210 and 212, respectively.

For the rest of the datasets, range filtering decreasedperformance in a majority of cases, except the Retinadataset, which had an improved performance at 26 withthe best result achieved with L1 with both explicit NNand STO enforcement. This changed the best observedperformance of this datasest, measured as mAD, to beclose to 7. These mixed results make it harder to choosea threshold for the upper bound, so we average resultsover all datasets to find a balance between improvements

Page 13: SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS ... · SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 3 RNASeq

SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 13

5 6 7 8 9 10 11 12 13 14 15 16−6

−5

−4

−3

−2

−1

0

1

2

log2 of the upper bound on expression values

Me

an

mA

D C

ha

ng

e (

Hig

he

r th

e B

ett

er)

L2 (NN=Imp, STO=Imp)

L2 (NN=Imp, STO=Exp)

L2 (NN=Exp, STO=Imp)

L2 (NN=Exp, STO=Exp)

L1 (NN=Imp, STO=Imp)

L1 (NN=Imp, STO=Exp)

L1 (NN=Exp, STO=Imp)

L1 (NN=Exp, STO=Exp)

Fig. 11. Average performance of range filtering over alldatasets

BreastBlood CellLines LiverBrainLung PERT_Cultured PERT_Uncultured RatBrain Retina−8

−6

−4

−2

0

2

4

6

8

De

lta

mA

D (

Hig

he

r th

e B

ett

er)

L2ImpImp

L2ImpExp

L2ExpImp

L2ExpExp

L1ImpImp

L1ImpExp

L1ExpImp

L1ExpExp

Fig. 12. Dataset-specific changes in the performance ofdeconvolution methods after filtering expression ranges tofit within [23 − 212]

in PERT and overall deterioration in other datasets.Figure 11 presents the averaged mAD difference acrossall datasets. This suggests a “general” upper bound filterof 212 to be optimal across all datasets.

We use this threshold to filter all datasets and per-form deconvolution on them. Figure 12 presents thedataset-specific performance of range filtering with fixedbounds, measured by changes in the mAD value com-pared to the original deconvolution. As observed fromindividual performance plots, range filtering is mosteffective in cases where the reference profiles differsignificantly from the true cell-types in the mixture, suchas the case with the PERT datasets. In ideal mixtures,since cell-types are measured and mixed at the sametime/laboratory, this distinction is negligible. In thesecases, highly expressed genes in mixtures and refer-ences coincide with each other and provide additionalclues for the regression. Consequently, removing thesehighly expressed genes often degrades the performanceof deconvolution methods. This generalization of theupper bound threshold, however, should be adoptedwith care, since each dataset exhibits different behaviorin response to range filtering. Ideally, one must filtereach dataset individually based on the distribution ofexpression values. Furthermore, in practical applications,gold standards are not available to aid in the choice ofcutoff threshold.

We now introduce a new method that adaptivelyidentifies an effective range for each dataset. Figure 13

0 10 20 30 40 50 60 70 80 90 1000

2

4

6

8

10

12

14

16

Percent of Covered Features

log

2 E

xp

ressio

n V

alu

es

BreastBlood

CellLines

LiverBrainLung

PERT_Cultured

PERT_Uncultured

RatBrain

Retina

Fig. 13. Sorted log2-transformed gene expressions indifferent datasets

0 10 20 30 40 50 60 70 80 90 1000

2

4

6

8

10

12

14

16

Percent of covered features

log

2 ex

pre

ss

ion

valu

e

Fig. 14. Example of adaptive filtering over CellLinesdataset

illustrates the log2 normalized value of maximal expres-sion for each gene in matrices M and G, sorted inascending order. In all cases, intermediate values exhibita gradual increase, whereas the top and bottom elementsin the sorted list show a steep change in their expression.We aim to identify the critical points corresponding tothese sudden changes in the expression values for eachdataset. To this end, we select the middle point as apoint of reference and analyze the upper and lowerhalf, independently. For each half, we find the point onthe curve that has the longest distance from the lineconnecting the first (last) element to the middle element.Application of this process over the CellTypes datasetis visualized in Figure 14. Green points in this figurecorrespond to the critical points, which are used to definethe lower and upper bound for the expression values ofthis dataset.

We use this technique to identify adaptive rangesfor each dataset prior to deconvolution. Table 2 sum-marizes the identified critical points for each dataset.Figure 15 presents the dataset-specific performance ofeach method after adaptive range filtering. While inmost cases the results for fixed and adaptive rangefiltering are compatible, and in some cases adaptivefiltering gives better results, the most notable differenceis the degraded performance of LiverBrainLung, and,to some extent, RatBrain datasets. To further investigatethis observation, we examine individual experiments forthese datasets for fixed thresholds. Figure 16 illustratesindividual plots for each dataset. The common trendhere is that in both cases range filtering, in general, de-

Page 14: SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS ... · SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 3 RNASeq

SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 14

TABLE 2Summary of adaptive ranges for each dataset

LowerBound UpperBoundBreastBlood 4.2842 9.4314CellLines 5.2814 11.6942LiverBrainLung 3.3245 9.9324PERT Cultured 4.9416 10.9224PERT Uncultured 5.1674 11.5042RatBrain 3.3726 9.9698Retina 2.4063 6.7499

BreastBlood CellLines LiverBrainLung PERT_Cultured PERT_Uncultured RatBrain Retina−8

−6

−4

−2

0

2

4

6

8

De

lta

mA

D (

Hig

he

r th

e B

ett

er)

L2ImpImp

L2ImpExp

L2ExpImp

L2ExpExp

L1ImpImp

L1ImpExp

L1ExpImp

L1ExpExp

Fig. 15. Dataset-specific changes in the performance ofdeconvolution methods after adaptive range filtering

grades the performance of deconvolution methods for allconfigurations. In other words, extreme values in thesedatasets are actually helpful in guiding the regression,and any filtering negatively impacts performance. Thissuggests that range filtering, in general, is not alwayshelpful in enhancing the deconvolution performance,and in fact in some cases; for example the ideal datasetssuch as LiverBrainLung, RatBrain, and BreastBlood; itcan be counterproductive.

3.7 Selection of Marker Genes – The Good, Bad, andUgly

Selecting marker genes that uniquely identify a certaintissue or cell-type, prior to deconvolution, can help inimproving the conditioning of matrix G, thus improv-ing its discriminating power and stability of results, aswell as decreasing the overall computation time. A keychallenge in identifying “marker” genes is the choice ofmethod that is used to assess selectivity of genes. Var-ious parametric and nonparametric methods have beenproposed in literature to identify differentially expressedgenes between two groups [46, 47] or between a groupand other groups [48]. Furthermore, different methodshave been developed in parallel to identify tissue-specificand tissue-selective genes that are unique markers withhigh specificity to their host tissue/cell type [49–52].While choosing/developing accurate methods for iden-tifying reliable markers is an important challenge, anin-depth discussion of the matter is beyond the scopeof this article. Instead, we adopt two methods used inthe literature. Abbas et al. [20] present a frameworkfor choosing genes based on their overall differentialexpression. For each gene, they use a t-test to compare

5 6 7 8 9 10 11 12 13 14 15 16−25

−20

−15

−10

−5

0

5

log2 of the upper bound on expression values

Ch

an

ges in

mA

D e

sti

mati

on

err

or

L2 (NN=Imp, STO=Imp)

L2 (NN=Imp, STO=Exp)

L2 (NN=Exp, STO=Imp)

L2 (NN=Exp, STO=Exp)

L1 (NN=Imp, STO=Imp)

L1 (NN=Imp, STO=Exp)

L1 (NN=Exp, STO=Imp)

L1 (NN=Exp, STO=Exp)

(a) LiverBrainLung

5 6 7 8 9 10 11 12 13 14 15 16−9

−8

−7

−6

−5

−4

−3

−2

−1

0

1

log2 of the upper bound on expression values

Ch

an

ge

s i

n m

AD

es

tim

ati

on

err

or

L2 (NN=Imp, STO=Imp)

L2 (NN=Imp, STO=Exp)

L2 (NN=Exp, STO=Imp)

L2 (NN=Exp, STO=Exp)

L1 (NN=Imp, STO=Imp)

L1 (NN=Imp, STO=Exp)

L1 (NN=Exp, STO=Imp)

L1 (NN=Exp, STO=Exp)

(b) RatBrain

Fig. 16. Individual performance plots for range filtering indatasets which range filtering exhibits negative effect onthe deconvolution

the cell-type with the highest expression with the secondand third highest expressing cell-type. Then, they sort allgenes and construct a sequence of basis matrices withincreasing sizes. Finally, they use condition number toidentify an “optimal” cut among top-ranked genes thatminimizes the condition number. Newman et al. [15]propose a modification to the method of Abbas et al.,in which genes are not sorted based on their overalldifferential expression, but according to their tissue-specific expression when compared to all other cell-types. After prefiltering differentially expressed genes,they sort genes based on their expression fold ratio anduse a similar cutoff that optimizes the condition number.Note that the former method increases the size of thebasis matrix by one at each step, while the latter methodincreases it by q (number of cell-types). The method ofNewman et al. has the benefit that it chooses a similarnumber of markers per cell-type, which is useful in caseswhere one of the references has a significantly highernumber of markers.

We implement both methods and assess their per-formance over the datasets. We observe slightly betterperformance with the second method and use it for therest of our experiments. Due to unexpected behaviorof the Retina dataset, as well as a low number ofsignificant markers in all our trials, we eliminate thisdataset from further study. In identifying differentially

Page 15: SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS ... · SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 3 RNASeq

SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 15

BreastBlood CellLines LiverBrainLung PERT_Cultured PERT_Uncultured RatBrain−6

−4

−2

0

2

4

6

8

10

De

lta

mA

D (

Hig

he

r th

e B

ett

er)

L2ImpImp

L2ImpExp

L2ExpImp

L2ExpExp

L1ImpImp

L1ImpExp

L1ExpImp

L1ExpExp

Fig. 17. Effect of marker selection on the performance ofdeconvolution methods

expressed genes, a key parameter is the q-value cutoff toreport significant features. The distribution of correctedp-values exhibits high similarity among ideal mixtures,while differing significantly in CellLines mixtures andboth PERT datasets. We find the range of 10−3 − 10−5

to be an optimal balance between these two casesand perform experiments to test different cutoff values.Figure 17 shows changes in the mAD measure afterapplying marker detection, using a q-value cutoff of10−3, which resulted in the best overall performance inour study. We observe that the PERT Uncultured andLiverBrainLung datasets have the highest gain acrossthe majority of configurations, while BreastBlood andRatBrain exhibit an improvement in experiments withL1 while their L2 performance is greatly decreased.Finally, for the PERT Cultured and CellLines datasets,we observe an overall decrease in performance in almostall configurations.

Next, we note that the internal sorting based on fold-ratio intrinsically prioritizes highly expressed genes andis susceptible to noisy outliers. To test this hypothesis, weperform a range selection using a global upper bound of1012 prior to the marker selection method and examineif this combination can enhance our previous results. Wefind the q-value threshold of 10−5 to be the better choicein this case. Figure 18 shows changes in performance ofdifferent methods when we prefilter expression rangesprior to marker selection. The most notable change isthat both the PERT Cultured and the CellLines, whichwere among the datasets with the lowest performancein the previous experiment, are now among the best-performing datasets, in terms of overall mAD enhance-ment. We still observe a higher negative impact on L2 inthis case, but the overall amplitude of the effect has beendampened in both BreastBlood and RatBrain datasets.

We note that there is no prior knowledge as to the“proper” choice of the marker selection method in theliterature and that their effect on the deconvolution per-formance is unclear. An in-depth comparison of markerdetection methods can benefit future developments inthis field. An ideal marker should serve two purpose:(i) be highly informative of the cell-type in which it isexpressed, (ii) shows low variance due to spatiotemporalchanges in the environment (changes in time or microen-

BreastBlood CellLines LiverBrainLung PERT_Cultured PERT_Uncultured RatBrain−8

−6

−4

−2

0

2

4

6

8

De

lta

mA

D (

Hig

he

r th

e B

ett

er)

L2ImpImp

L2ImpExp

L2ExpImp

L2ExpExp

L1ImpImp

L1ImpExp

L1ExpImp

L1ExpExp

Fig. 18. Effect of marker selection, after range filtering,on the performance of deconvolution methods

Fig. 19. High-level classification of genes

vironment). Figure 19 shows a high-level classificationof genes. An ideal marker is an invariant, cell-typespecific gene, marked with green in the diagram. On theother hand, variant genes, both universally expressedand tissue-specific, are not good candidates, especiallyin cases where references are adopted from a differentstudy. These genes, however, comprise ideal subsetsof genes that should be updated in full deconvolutionwhile updating matrix G, since their expression in thereference profile may differ significantly from the truecell-types in the mixture. It is worth mentioning thatthe proper ordering to identify best markers is to firstidentify tissue-specific genes and then prune them basedon their variability. Otherwise, when selecting invariantgenes, we may select many housekeeping genes, sincetheir expression is known to be more uniform comparedto tissue-specific genes.

Another observation relates to the case in whichgroups of profiles of cell-types have high similaritywithin the group, but are significantly distant from therest. This makes identifying marker genes more chal-lenging for these groups of cell-types. An instance ofthis problem is when we consider markers in the PERTdatasets. In this case, erythrocytes have a much largernumber of distinguishing markers compared to otherreferences. This phenomenon is primarily attributed tothe underlying similarity between undifferentiated cell-types in the PERT datasets, and their distance from thefully differentiated red blood cells. In these cases, it isbeneficial to summarize each group of similar tissuesusing a “representative profile” for the whole group,and to use a hierarchical structure to recursively identifymarkers at different levels of resolution [52].

Finally, we examine the common choice of condition

Page 16: SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS ... · SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 3 RNASeq

SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 16

number as the optimal choice to select the number ofmarkers. First, unlike the “U” shape plot reported inprevious studies, in which condition number initiallydecreases to an optimal point and then starts increasing,we observe variable behavior in the condition numberplot, both for Newman et al. and Abbas et al. methods.This makes the generalization of condition number asa measure applicable to all datasets infeasible. Addi-tionally, we note that the lowest condition number isachieved if G is fully orthogonal, that is GTG = κI forany constant κ. By selecting tissue-selective markers, wecan ensure that the product of columns in the resultingmatrix is close to zero. However, the norm-2 of eachcolumn can still be different. We developed a methodthat specifically grows the basis matrix by accountingfor the norm equality across columns. We find that inall cases our basis matrix has a lower condition numberthan both the Newman et al. and Abbas et al. methods,but it did not always improve the overall performanceof deconvolution methods using different loss functions.Further study on the optimal choice of the numberof markers is another key question that needs furtherinvestigation

3.8 To Regularize or Not to Regularize

We now evaluate the impact of regularization on the per-formance of different deconvolution methods. To isolatethe effect of the regularizer from prior filtering/featureselection steps, we apply regularization on the originaldatasets. The R1 regularizer is typically applied in caseswhere the solution space is large, that is, the total num-ber of available reference cell-types is a superset of thetrue cell-types in the mixture. This type of regularizationacts as a “selector” to choose the most relevant cell-types and zero-out the coefficients for the rest of the cell-types. This has the effect of enforcing a sparsity pattern.Datasets used in this study are all controlled benchmarksin which references are hand-picked to match the ones inthe mixture; thus, sparsifying the solution does not addvalue to the deconvolution process. On the other hand,an R2 regularizer, also known as Tikhonov regulariza-tion, is most commonly used when the problem is ill-posed. This is the case, for example, when the underlyingcell-types are highly correlated with each other, whichintroduces dependency among columns of the basismatrix. In order to quantify the impact of this typeof regularization on the performance of deconvolutionmethods, we perform an experiment similar to the onein Section 3.4 with an added R2 regularizer. In thisexperiment, we use L1 and L2 loss functions, as wepreviously showed that the performance of the other twoloss functions is similar to L1. Instead of using Ridgeregression introduced in Section 2.4.2, we implementan equivalent formulation, ‖ m − Gc ‖2 +λ ‖ c ‖1,which traces the same path but has higher numericalaccuracy. To identify the optimal value of the λ param-eter that balances the relative importance of solution

BreastBlood CellLines LiverBrainLung PERT_Cultured PERT_Uncultured RatBrain−1

0

1

2

3

4

5

6

7

8

9

Delt

a m

AD

(H

igh

er

the B

ett

er)

L2ImpImp

L2ImpExp

L2ExpImp

L2ExpExp

L1ImpImp

L1ImpExp

L1ExpImp

L1ExpExp

Fig. 20. Effect of L2 regularization on the performance ofdeconvolution methods

fit versus regularization, we search over the range of{10−7, · · · , 107}. It is notable here that when λ is closeto zero, the solution is identical to the one withoutregularization, whereas when λ→∞ the deconvolutionprocess is only guided by the solution size. Similar to therange filtering step in Section 3.6, we use the minimummAD error to choose the optimal value of λ.

Figure 20 presents changes in mAD error, compared tooriginal errors, after regularizing loss functions with theR2 regularizer. From these observations, it appears thatPERT Cultured has the most gain due to regularization,whereas for PERT Uncultured, the changes are smaller.A detailed investigation, however, suggests that in themajority of cases for PERT Cultured, the performancegain is due to over shrinkage of vector c to the case ofbeing almost uniform. Interestingly, the choice of uni-form c has lower mAD error for this dataset comparedto most other results. Overall, both of the PERT datasetsshow significant improvements compared to the originalsolution, which can be attributed to the underlyingsimilarity among hematopoietic cells. On the other hand,an unexpected observation is the performance gain overL1 configurations for the BreastBlood dataset. This isprimarily explained by the limited number of cell-types(only two), combined with the similar concentrationsused in all samples (only combinations of 67% and 33%).

To gain additional insight into the parameters used ineach case during deconvolution, we plot the optimal λvalues for each configuration in each dataset. Figure 21summarizes the optimal values of the λ parameter. Largevalues indicate a beneficial effect for regularization,whereas small values are suggestive of negative impact.In all cases where the overall mAD score has beenimproved, their corresponding λ parameter was large.However, large values of λ do not necessarily indicate asignificant impact on the final solution, as is evident inthe CellLines and LiverBrainLung datasets. Finally, weobserve that cases where the value of λ is close to zeroare primarily associated with the L2 loss function.

3.9 SummaryBased on our observations, we propose the followingguidelines for the deconvolution of expression datasets:

Page 17: SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS ... · SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 3 RNASeq

REFERENCES 17

−8

−6

−4

−2

0

2

4

6

8

Bre

as

tBlo

od

Ce

llL

ine

s

Liv

erB

rain

Lu

ng

PE

RT

_C

ult

ure

d

PE

RT

_U

nc

ult

ure

d

Ra

tBra

in

log

10(λ

)

L2 (NN=Imp, STO=Imp)

L2 (NN=Imp, STO=Exp)

L2 (NN=Exp, STO=Imp)

L2 (NN=Exp, STO=Exp)

L1 (NN=Imp, STO=Imp)

L1 (NN=Imp, STO=Exp)

L1 (NN=Exp, STO=Imp)

L1 (NN=Exp, STO=Exp)

Fig. 21. Optimal value of λ for each dataset/configurationpair

1) Pre-process reference profiles and mixtures us-ing invariant, universally expressed (housekeep-ing) genes to ensure that the similar cell quantity(SCQ) constraint is satisfied.

2) Filter violating features that cannot satisfy the sum-to-one (STO) constraint.

3) Filter lower and upper bounds of gene expressionsusing adaptive range filtering.

4) Select invariant (among references and betweenreferences and samples) cell-type-specific markersto enhance the discriminating power of the basismatrix.

5) Solve the regression using the L1 loss function withexplicit constraints (check), together with an R2

regularizer, or group LASSO if sparsity is desiredamong groups of tissues/cell-types.

6) Use the L-curve method to identify the optimalbalance between the regression fit and the regu-larization penalty.

4 CONCLUDING REMARKS

In this paper, we present a comprehensive review ofdifferent methods for deconvolving linear mixtures ofcell-types in complex tissues. We perform a systematicanalysis of the impact of different algorithmic choiceson the performance of the deconvolution methods, in-cluding the choice of the loss function, constraints onsolutions, data filtering, feature selection, and regular-ization. We find L2 loss to be superior in cases wherethe reference cell-types are representative of constitutivecell-types in the mixture, while L1 outperforms the L2

in cases where this condition does not hold. Explicitenforcement of the sum-to-one (STO) constraint typi-cally degrades the performance of deconvolution. Wepropose simple bounds to identify features violating thisconstraint and evaluate the total number of violatingfeatures in each dataset. We observe an unexpectedlyhigh number of features that cannot satisfy the STOcondition, which can be attributed to problems withnormalization of expression profiles, specifically normal-izing references and samples with respect to each other.In terms of filtering the range of expression values, we

find that fixed thresholding is not effective and developan adaptive method for filtering each dataset individu-ally. Furthermore, we observed that range filtering is notalways beneficial for deconvolution and, in fact, in somecases it can deteriorate the performance. We implementtwo commonly used marker selection methods fromthe literature to assess their effect on the deconvolutionprocess. Orthogonalizing reference profiles can enhancethe discriminating power of the basis matrix. However,due to known correlation between the mean and vari-ance of expression values, this process alone does notalways provide satisfactory results. Another key factor toconsider is the low biological variance of genes in orderto enhance the reproducibility of the results and allowdeconvolution with noisy references. The combinationof range filtering and marker selection eliminates geneswith high mean expression, which in turn enhances theobserved results. Finally, we address the application ofTikhonov regularization in cases where reference cell-types are highly correlated and the regression problemis ill-posed.

We summarize our findings in a simple set of guide-lines and identify open problems that need further inves-tigation. Areas of particular interest for future researchinclude: (i) identifying the proper set of filters based onthe datasets, (ii) expanding deconvolution problem tocases with more complex, hierarchical structure amongreference vectors, and (iii) selecting optimal features toreduce computation time while maximizing the discrim-inating power.

ACKNOWLEDGMENT

This work is supported by the Center for Science ofInformation (CSoI), an NSF Science and Technology Cen-ter, under grant agreement CCF-0939370, and by NSFGrant BIO 1124962.

REFERENCES

[1] N. Gillis, “Successive Nonnegative Projection Al-gorithm for Robust Nonnegative Blind Source Sep-aration,” SIAM Journal on Imaging Sciences, vol. 7,no. 2, pp. 1420–1450, Jan. 2014.

[2] W.-K. Ma et al., “A Signal Processing Perspectiveon Hyperspectral Unmixing: Insights from RemoteSensing,” IEEE Signal Processing Magazine, vol. 31,no. 1, pp. 67–81, Jan. 2014.

[3] D. Nuzillard and A. Bijaoui, “Blind source sepa-ration and analysis of multispectral astronomicalimages,” Astronomy& astrophysics supplement series,vol. 147, pp. 129–138, Nov. 2000.

[4] V. P. Pauca, J. Piper, and R. J. Plemmons, “Nonneg-ative matrix factorization for spectral data analy-sis,” Linear Algebra and its Applications, vol. 416, no.1, pp. 29–47, Jul. 2006.

Page 18: SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS ... · SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 3 RNASeq

REFERENCES 18

[5] E. Villeneuve and H. Carfantan, “Hyperspec-tral data deconvolution for galaxy kinematicswith mcmc,” in Signal Processing Conference (EU-SIPCO), 2012 Proceedings of the 20th European, 2012,pp. 2477–2481.

[6] A. C. Tang, B. A. Pearlmutter, M. Zibulevsky, andS. A. Carter, “Blind source separation of multi-channel neuromagnetic responses,” Neurocomput-ing, vol. 3233, pp. 1115 –1120, 2000.

[7] C. Hesse and C. James, “On semi-blind sourceseparation using spatial constraints with applica-tions in eeg analysis,” Biomedical Engineering, IEEETransactions on, vol. 53, no. 12, pp. 2525–2534, 2006.

[8] C. Vaya, J. J. Rieta, C. Sanchez, and D. Moratal,“Convolutive blind source separation algorithmsapplied to the electrocardiogram of atrial fibrilla-tion: Study of performance,” IEEE Transactions onBiomedical Engineering, vol. 54, no. 8, pp. 1530–1533,2007.

[9] K. Zhang and A. Hyvarinen, “Source separationand higher-order causal analysis of MEG andEEG,” CoRR, vol. abs/1203.3533, 2012.

[10] M. Pedersen, U. Kjems, K. Rasmussen, andL. Hansen, “Semi-blind source separation usinghead-related transfer functions [speech signal sep-aration],” in Acoustics, Speech, and Signal Processing,2004. Proceedings. (ICASSP ’04). IEEE InternationalConference on, vol. 5, 2004, V–71316 vol.5–.

[11] M. Yu, W. Ma, J. Xin, and S. Osher, “Multi-channel l1 regularized convex speech enhance-ment model and fast computation by the splitbregman method,” Audio, Speech, and LanguageProcessing, IEEE Transactions on, vol. 20, no. 2,pp. 661–675, 2012.

[12] E. Vincent, N. Bertin, R. Gribonval, and F. Bimbot,“From blind to guided audio source separation:How models and side information can improvethe separation of sound,” IEEE Signal ProcessingMagazine, vol. 31, no. 3, pp. 107–115, May 2014.

[13] N. Souviraa-Labastie, A. Olivero, E. Vincent, andF. Bimbot, “Multi-channel audio source separationusing multiple deformed references,” IEEE Trans-actions on Audio, Speech and Language Processing,vol. 23, no. 11, pp. 1775–1787, Jun. 2015.

[14] A. Kuhn, A. Kumar, A. Beilina, A. Dillman, M. R.Cookson, and A. B. Singleton, “Cell population-specific expression analysis of human cerebel-lum.,” BMC genomics, vol. 13, p. 610, 2012.

[15] A. M. Newman et al., “Robust enumeration ofcell subsets from tissue expression profiles,” NatureMethods, no. MAY 2014, pp. 1–10, 2015.

[16] S. S. Shen-Orr and R. Gaujoux, “Computational de-convolution: extracting cell type-specific informa-tion from heterogeneous samples.,” Current opinionin immunology, vol. 25, no. 5, pp. 571–8, Oct. 2013.

[17] O. L. Mangasarian and D. R. Musicant, “Robustlinear and support vector regression,” IEEE Trans.

Pattern Anal. Mach. Intell., vol. 22, no. 9, pp. 950–955, 2000.

[18] V. Vapnik, Statistical learning theory. Wiley, 1998.[19] A. J. Smola and B. Scholkopf, “A tutorial on sup-

port vector regression,” Statistics and Computing,vol. 14, no. 3, pp. 199–222, 2004.

[20] A. R. Abbas, K. Wolslegel, D. Seshasayee, Z. Mod-rusan, and H. F. Clark, “Deconvolution of bloodmicroarray data identifies cellular activation pat-terns in systemic lupus erythematosus.,” PloS one,vol. 4, no. 7, e6098, 2009.

[21] W. Qiao, G. Quon, E. Csaszar, M. Yu, Q. Morris,and P. W. Zandstra, “PERT: a method for ex-pression deconvolution of human blood samplesfrom varied microenvironmental and developmen-tal conditions.,” PLoS computational biology, vol. 8,no. 12, e1002838, 2012.

[22] T. Gong et al., “Optimal deconvolution of tran-scriptional profiling data using quadratic program-ming with application to complex clinical bloodsamples.,” PloS one, vol. 6, no. 11, e27156, 2011.

[23] Z. Altboum et al., “Digital cell quantification iden-tifies global immune cell dynamics during in-fluenza infection,” Molecular Systems Biology, vol.10, no. 2, pp. 720–720, Mar. 2014.

[24] S. S. Shen-Orr et al., “Cell type-specific gene ex-pression differences in complex tissues.,” Naturemethods, vol. 7, no. 4, pp. 287–9, 2010.

[25] J. Kim, Y. He, and H. Park, “Algorithms for non-negative matrix and tensor factorizations: a unifiedview based on block coordinate descent frame-work,” Journal of Global Optimization, vol. 58, no.2, pp. 285–319, 2013.

[26] D. Venet, F. Pecasse, C. Maenhaut, and H. Bersini,“Separation of samples into their constituents us-ing gene expression data,” Bioinformatics, vol. 17,no. Suppl 1, S279–S287, 2001.

[27] D. Repsilber et al., “Biomarker discovery in het-erogeneous tissue samples -taking the in-silico de-confounding approach.,” BMC bioinformatics, vol.11, p. 27, 2010.

[28] N. S. Zuckerman, Y. Noam, A. J. Goldsmith, andP. P. Lee, “A self-directed method for cell-typeidentification and separation of gene expressionmicroarrays.,” PLoS computational biology, vol. 9,no. 8, e1003189, 2013.

[29] R. Gaujoux and C. Seoighe, “Semi-supervisedNonnegative Matrix Factorization for gene expres-sion deconvolution: a case study.,” Infection, genet-ics and evolution : journal of molecular epidemiologyand evolutionary genetics in infectious diseases, vol.12, no. 5, pp. 913–21, 2012.

[30] Y. Zhong, Y.-W. Wan, K. Pang, L. M. L. Chow,and Z. Liu, “Digital sorting of complex tissues forcell type-specific gene expression profiles.,” BMCbioinformatics, vol. 14, p. 89, 2013.

[31] T. Erkkila, S. Lehmusvaara, P. Ruusuvuori, T.Visakorpi, I. Shmulevich, and H. Lahdesmaki,

Page 19: SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS ... · SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 3 RNASeq

SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 19

“Probabilistic analysis of gene expression measure-ments from heterogeneous tissues.,” Bioinformatics(Oxford, England), vol. 26, no. 20, pp. 2571–7, 2010.

[32] W. Ju et al., “Defining cell-type specificity at thetranscriptional level in human disease.,” Genomeresearch, vol. 23, no. 11, pp. 1862–73, 2013.

[33] G. Quon, S. Haider, A. G. Deshwar, A. Cui, P. C.Boutros, and Q. Morris, “Computational purifica-tion of individual tumor gene expression profilesleads to significant improvements in prognosticprediction.,” Genome medicine, vol. 5, no. 3, p. 29,2013.

[34] D. A. Liebner, K. Huang, and J. D. Parvin,“MMAD: microarray microdissection with analy-sis of differences is a computational tool for de-convoluting cell type-specific contributions fromtissue samples.,” Bioinformatics (Oxford, England),vol. 30, no. 5, pp. 682–9, 2014.

[35] R. Gaujoux and C. Seoighe, “CellMix: a compre-hensive toolbox for gene expression deconvolu-tion.,” Bioinformatics (Oxford, England), vol. 29, no.17, pp. 2211–2, 2013.

[36] A. Kuhn, D. Thu, H. J. Waldvogel, R. L. M. Faull,and R. Luthi-Carter, “Population-specific expres-sion analysis (PSEA) reveals molecular changes indiseased brain.,” Nature methods, vol. 8, no. 11,pp. 945–7, 2011.

[37] S. Siegert et al., Transcriptional code and disease mapfor adult retinal cell types, 2012.

[38] M. Grant and S. Boyd, “Graph implementations fornonsmooth convex programs,” in Recent Advancesin Learning and Control, V. Blondel, S. Boyd, andH. Kimura, Eds., ser. Lecture Notes in Controland Information Sciences, http://stanford.edu/∼boyd/graph dcp.html, Springer-Verlag Limited,2008, pp. 95–110.

[39] CVX: matlab software for disciplined convex program-ming, version 2.1, http://cvxr.com/cvx, Mar. 2014.

[40] The MOSEK optimization software,http://www.mosek.com/.

[41] E. Eisenberg and E. Y. Levanon, Human housekeep-ing genes, revisited, 2013.

[42] J. A. Gagnon-Bartsch and T. P. Speed, “Usingcontrol genes to correct for unwanted variationin microarray data.,” Biostatistics (Oxford, England),vol. 13, no. 3, pp. 539–52, 2012.

[43] J. Ahn et al., “DeMix: deconvolution for mixedcancer transcriptomes using raw measured data.,”Bioinformatics (Oxford, England), vol. 29, no. 15,pp. 1865–71, 2013.

[44] H. Kawaji et al., “Comparison of CAGE and RNA-seq transcriptome profiling using clonally ampli-fied and single-molecule next-generation sequenc-ing,” Genome Research, vol. 24, no. 4, pp. 708–717,Apr. 2014.

[45] Y Tu, G Stolovitzky, and U Klein, “Quantitativenoise analysis for gene expression microarray ex-periments.,” Proceedings of the National Academy of

Sciences of the United States of America, vol. 99, no.22, pp. 14 031–14 036, 2002.

[46] M. Jeanmougin, A. de Reynies, L. Marisa, C. Pac-card, G. Nuel, and M. Guedj, “Should we abandonthe t-test in the analysis of gene expression mi-croarray data: a comparison of variance modelingstrategies.,” PloS one, vol. 5, no. 9, e12336, Jan. 2010.

[47] N. R. Clark et al., “The characteristic direction:a geometrical approach to identify differentiallyexpressed genes,” BMC Bioinformatics, vol. 15, no.1, p. 79, 2014.

[48] K Van Deun, H Hoijtink, L Thorrez, L Van Lom-mel, F Schuit, and I Van Mechelen, “Testing thehypothesis of tissue selectivity: the intersection-union test and a Bayesian approach.,” Bioinformat-ics (Oxford, England), vol. 25, no. 19, pp. 2588–94,Oct. 2009.

[49] F. M. G. Cavalli, R. Bourgon, W. Huber, J. M.Vaquerizas, and N. M. Luscombe, “SpeCond: amethod to detect condition-specific gene expres-sion.,” Genome biology, vol. 12, no. 10, R101, Jan.2011.

[50] K. Kadota, J. Ye, Y. Nakai, T. Terada, and K.Shimizu, “ROKU: a novel method for identifica-tion of tissue-specific genes.,” BMC bioinformatics,vol. 7, p. 294, Jan. 2006.

[51] K. D. Birnbaum and E. Kussell, “Measuring cellidentity in noisy biological systems.,” Nucleic acidsresearch, vol. 39, no. 21, pp. 9093–107, Nov. 2011.

[52] S. Mohammadi and A. Grama, “A novel method toenhance the sensitivity of marker detection usinga refined hierarchical prior of tissue similarities,”Tech. Rep., Jun. 2015.

Shahin Mohammadi received his Master’s de-gree in Computer Science from Purdue Uni-versity in Dec. 2012 and is currently a Ph.D.candidate at Purdue. His research interests in-clude computational biology, machine learning,and parallel computing. His current work spansdifferent areas of Bioinformatics/Systems Biol-ogy and aims to develop computational methodscoupled with statistical models for data-intensiveproblems, with application in mining the humantissue-specific transcriptome and interactome.

Neta Zuckerman received her PhD degree incomputational biology from the University of Bar-Ilan, Israel in June 2010. She has completed herpost-doctorate in 2015 as a computational biol-ogist at Stanford University, School of Medicineand City of Hope as well as a visiting scholar atthe department of Electrical Engineering, Stan-ford University. Her research interests focus oninvestigating the role of immune cells in thesetting of various diseases, specifically cancer,utilizing algorithm development and microarray

data analysis. Neta is currently a computational biologist at GenentechInc.

Page 20: SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS ... · SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 3 RNASeq

SPECIAL ISSUE OF PROCEEDINGS OF IEEE - FOUNDATIONS & APPLICATIONS OF SCIENCE OF INFORMATION, VOL. X, NO. X, X 2015 20

Andrea Goldsmith Andrea Goldsmith is theStephen Harris professor in the School of En-gineering and a professor of Electrical Engineer-ing at Stanford University. Her research interestsare in information theory and communicationtheory, and their application to wireless commu-nications as well as biology and neuroscience.She has received several awards for her workincluding the IEEE ComSoc Edwin H. ArmstrongAchievement Award, the IEEE ComSoc and In-formation Theory Society Joint Paper Award,

and the National Academy of Engineering Gilbreth Lecture Award.She is author of 3 textbooks, including Wireless Communications, allpublished by Cambridge University Press, as well as an inventor on 28patents.

Ananth Grama received the PhD degree fromthe University of Minnesota in 1996. He is cur-rently a Professor of Computer Sciences andAssociate Director of the Center for Science ofInformation at Purdue University. His researchinterests span areas of parallel and distributedcomputing architectures, algorithms, and appli-cations. On these topics, he has authored sev-eral papers and texts. He is a member of theAmerican Association for Advancement of Sci-ences.


Recommended