Comparative Analysis of Linear and Nonlinear Dimension ... · Comparative Analysis of Linear and...

Comparative Analysis of Linear andNonlinear Dimension Reduction Techniques

on Mass Cytometry Data

Anna Konstorum∗1, Nathan Jekel2, Emily Vidal3 and Reinhard Laubenbacher1,4

1Center for Quantitative Medicine, UConn Health, Farmington, CT2Department of Mathematics, Indiana University East, Richmond, IN

3Department of Mathematics, Angelo State University, San Angelo, TX1,4Jackson Laboratory for Genomic Medicine, Farmington, CT

Abstract

Mass cytometry, also known as CyTOF, is a newly developed technology for quantification and classi-fication of immune cells that can allow for analysis of up to hundreds of markers per cell. The highdimensional data that is generated requires innovative methods for analysis and visualization. We con-ducted a comparative analysis of four dimension reduction techniques – principal component analysis(PCA), isometric feature mapping (Isomap), t-distributed stochastic neighbor embedding (t-SNE), andDiffusion Maps by implementing them on a benchmark mass cytometry data set. We compare the resultsof these reductions using computation time, residual variance, neighborhood proportion error (NPE), andtwo-dimensional visualizations. We find that t-SNE and Diffusion Maps are the two most effective meth-ods for preserving local distance relationships among cells and providing informative visualizations. Inlow dimensional embeddings, t-SNE exhibits well-defined phenotypic clustering. Additionally, DiffusionMaps can represent cell differentiation pathways with long projections along each diffusion component.We thus recommend a complementary approach using t-SNE and Diffusion Maps to visualize CyTOFdata in order to extract diverse and informative information in a two-dimensional setting from the high-dimensional CyTOF data.

1 Introduction

Many current questions in the field of immunologyrequire cell classification to analyze the responsesof immune cells to external stimulants [19]. Flowcytometry, which allows the single-cell analysis of

marker expression via fluorescent-antibody labeling,has been used by immunologists to classify and sortcells since the 1960’s [33]. Mass cytometry, alsoknown as CyTOF (cytometry by time of flight), is anewly developed technique for single cell measure-ment which has opened the door to new functionaland phenotypic information that can help immu-

∗Corresponding author: [email protected]

1

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted March 1, 2018. ; https://doi.org/10.1101/273862doi: bioRxiv preprint

https://doi.org/10.1101/273862

http://creativecommons.org/licenses/by-nc-nd/4.0/

nologists in these analyses. Unlike flow cytometry,where the number of markers analyzed is limited byfluorescence spectral overlap, in mass cytometry an-tibodies are labeled with heavy metal ion tags thatcan be read using inductively coupled plasma massspectrometry (ICP-MS) [14]. Hence, a main advan-tage of CyTOF over flow cytometry is its ability tomeasure over 100 biomarkers simultaneously. Whilethis capability is a breakthrough in the observationof single cell parameters, it comes with its own setof challenges. As with any high throughput tech-nique, mass cytometry produces such large amountsof data that it is difficult to extract the salient in-formation. For the data to be useful, it is necessaryto apply innovative data analysis techniques [4].

One of the most powerful tools for analyzinghigh dimensional data is dimension reduction. Di-mension reduction renders high dimensional data ina few dimensions while preserving their significantcharacteristics. There exists a wide array of tech-niques for dimension reduction, each of which aimsto preserve a specific feature of the data by findingthe intrinsic degrees of freedom with respect to thatfeature. While marker-specific information is lostin this process, critical information regarding inter-cellular relationships is extant in dimension-reduceddata.

In order to decide which dimension reductiontechnique to use, it can be helpful to know howeach technique performs on mass cytometry datawith respect to a given criterion. We thus con-sider how each of four popular dimension reductiontechniques – principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE),isometric feature mapping (Isomap), and DiffusionMaps, perform on a manually gated benchmarkmass cytometry data set with respect to computa-tion time, neighborhood proportion error, residualvariance, and ability to cluster known cell types andtrack differentiation trajectories. We chose thesefour dimension reduction methods for the follow-ing reasons: PCA is a highly-utilized and compu-tationally efficient linear method for dimension re-duction, t-SNE exaggerates cluster formation and

has been extensively employed in CyTOF analy-sis via the viSNE toolkit [2]; Isomap, on the otherhand, does not exaggerate cluster formation, andpreserves geodesic distances between points that canallow for preservation of nonlinear interactions (un-like PCA) and global relationships between cells andcell clusters (unlike t-SNE). Finally, Diffusion Maps,like Isomap, can preserve nonlinear interactions andglobal relationships between cells, and has the ad-ditional property of being highly suitable to extractdifferentiation trajectories between cells [18], whichis often of interest in experimental studies utilizingCyTOF.

In addition to dimension-reduction techniques,there exist several computational algorithms foranalysis of mass cytometry data, including but notlimited to methods based on agglomerative cluster-ing (SPADE [31], FLOW-MAP [44], and Citrus),neural networks (FLOW-SOM [41]), and graph-based, trajectory-seeking, algorithms (Wanderlust[7], Wishbone [35]) (see [9, 27, 34] for recent reviewsof the applications of these and other algorithms tomass cytometry). The SPADE and FLOW-SOMalgorithms organize clusters of points into a mini-mal spanning tree (MST), while the FLOW-MAPuses a highly connected graph structure to connectclusters instead. The assumption that all clustersare connected can create artificial relationships be-tween clusters and/or cells if the common progenitorof two cell types is not included in the population.Similarly, the assumption by trajectory-seeking al-gorithms such as Wanderlust (which does not allowbranching) or Wishbone (which does), that all cellsare part of a developmental hierarchy may not holdfor more heterogeneous cell populations. We focuson dimensionality-reduction methods since they re-quire fewer assumptions about the nature of cell-cellrelationships for any given analysis, and hence pro-vide the most unsupervised analysis possible for thistype of data.

2



https://doi.org/10.1101/273862


2 Methods

2.1 Dimension Reduction Tech-niques

2.1.1 Principal Component Analysis (PCA)

PCA is a linear dimension reduction method thatis designed to preserve the variability of a data set,and has been recognized as an effective method toanalyze high-throughput biological data [32]. PCAhas been used on CyTOF data in a variety of set-tings, including a demonstration that different sub-sets of CD8+ T cells actually form a continuum,with cells that can be considered intermediate phe-notypes bridging different subsets [28], to help de-velop a T cell epitope binding prediction algorithm[29], and to show that a conserved T cell tran-scriptional profile including CD161 expression existsamong different T cell subtypes [15].

Given a data matrix Xm×n, with m attributesand n samples, it’s covariance matrix Cx is given by

CX =1

nXXT . (1)

The diagonal elements of the matrix CX representthe variance of the attributes, and the off-diagonalelements the covariance. Principal component anal-ysis seeks a transformation

PX = Y (2)

such that the diagonal elements of CY are rank-ordered and the off-diagonal elements equal to zero.It can be shown that this P is the eigenvector matrixof CX∗ , whereX∗ is the mean-centered matrix ofX.Each of these eigenvectors, termed principal com-ponents, are linear combinations of the original at-tributes that are orthogonal to each other. The firstprincipal component comprises the most variance ofthe data that can be captured by a linear combina-tion of the attributes, the second principal compo-nent comprises the most variance of the data thatcan be captured by a linear combination of the at-tributes after the first principal component has been

accounted for, and so on. In practice, the principalcomponents are often calculated using the singularvalue decomposition (SVD), which can take any ma-trix Xm×n and convert it into a product of anm×morthogonal matrix (the left singular vectors of X),an m × n rectangular diagonal matrix (containingthe singular values of X), and an n× n orthogonalmatrix (containing the right singular vectors of X).The principal components of X are the right singu-lar vectors of X∗, and since calculating the SVD ofa matrix X is more computationally efficient thanfinding the eigenvectors of its covariance matrix, itis commonly used to determine the principal com-ponents ofX in practice. For more details regardingthe technical background of PCA and SVD, see thetutorial by Shlens [37].

2.1.2 T-distributed Stochastic NeighborEmbedding (t-SNE)

The t-SNE algorithm is a non-linear dimension re-duction method that associates probability distri-butions to all high-dimensional points and seeks tomaintain a probability distribution profile in thelow-dimensional embedding. It has been used tovisualize and cluster data in a variety of CyTOF-related studies, including healthy and leukemic bonemarrow [2], the human mucosal immune system[42], and phenotypic diversity of human regulatoryT cells (Tregs) [25]. T-SNE is the reduction tech-nique behind the viSNE algorithm for mass cytom-etry data [2].

In t-SNE, the similarity pij between two points iand j is calculated using a Gaussian joint probabil-ity distribution, which gives a joint probability dis-tribution P = pij. To compute the joint probabil-ity distribution, Q = qij, in the low-dimensionalspace, a Cauchy distribution is used since it has aheavier tail than the Gaussian distribution and canthus mitigate the effect of the ‘crowding problem’,which can push points at moderate distances too faraway in the low-dimensional map, and points thatare nearby too close together [24]. The goal of t-SNEis to minimize the difference between the two distri-

3



https://doi.org/10.1101/273862


butions, hence a measure for the divergence of twoprobability distributions, the Kullback-Leibler (KL)divergence, is minimized using gradient descent inthe cost function

C = KL(P ||Q) =∑i

∑j

logpijqij. (3)

T-SNE is effective at extracting natural clusterssince data points further apart from each other inthe high-dimensional space are assigned dispropor-tionately smaller probabilities, which exaggeratesthe boundaries between clusters in the low dimen-sional embedding [24].

2.1.3 Isometric Feature Mapping (Isomap)

Isomap is another well-known nonlinear dimensionreduction method [40]. Unlike t-SNE, it does notexaggerate distances between clusters, hence canbe used to obtain more appropriate distance mea-sures between different cell types and to investi-gate differentiation trajectories, which do not nat-urally lend themselves to clustering. Becher et al.[5] used Isomap to investigate the relationships be-tween different myeloid subsets based on CyTOFdata, and identified potential differentiation trajec-tories. Notably, they used it as an adjunct to t-SNE, which was used to aid in clustering differentcell types in 2D, but was identified by the authorsas not an appropriate method to identify differenti-ating cells. Similarly, Wong et al. [43] used Isomapto probe differentiation of CD4+ T cells in periph-eral tissue (blood and tonsils) and in conjunctionwith t-SNE, identified clusters of cells representingincreasingly differentiated states along the trajec-tories. Using three different microarray datasets,Dawson et al. [12] were able to show that Isomapwas able to effectively cluster and establish relation-ships between different treatments for spinal cordinjury data, gene expression data from different rattissues, and a high-throughput drug screen againstacute myeloid leukemia (AML), showing the effec-tiveness of Isomap to distinguish important pheno-types amongst a diverse set of datasets.

In Isomap, the distance between two points iand j, δi,j in the original data is measured by thegeodesic distance: for a point x in the originaldataset X, points j within ε Euclidean distance ofi, or the K-nearest neighbors of i, are assigned theirEuclidean distance to the entry δi,j in the geodesicdistance matrix DG = dij. For points j outsideof a neighborhood of i, δi,j is assigned the shortestpath distance between i and j. Isomap seeks to pre-serve the geodesic distances recorded in DG for allpoints in X by using the linear dimension-reductiontechnique multidimensional scaling (MDS) on DG

[11, 16]. MDS can identify a low-dimensional repre-sentation of X, Y , such that the error function

E =‖ τ (DG)− τ (DY ) ‖L2 , (4)

is minimized, where || · ||L2 is the L2 norm, DY thematrix of Euclidean distances of points in Y , and τis the transformation

τ(DG) = HAH, (5)

where H = In − (1/n) is a ‘centering matrix’, nthe number of data points, and A = − 1

2δ2ij . The τ

operator allows recovery of Y via the eigendecom-position of τ(DG).

2.1.4 Diffusion Maps

Diffusion Maps is based on a spectral analysis ofa diffusion matrix based on random walk probabili-ties between cells (see below), and has been champi-oned for use in differentiation analysis of single-celldata, especially as it offers a robustness to noisethat Isomap lacks [10]. For example, Haghverdiet al. [18] used Diffusion Maps to show that forboth a toy differentiation dataset and qPCR andRNA-Seq data from differentiation processes includ-ing mouse hematopoietic differentiation, differenti-ating cells from mouse zygote to blastocyst, and hu-man preimplantation embryos that Diffusion Mapscan identify differentiation trajectories on a low-dimensional representation of these differentiationprocesses that are robust to noise and sampling het-erogeneity.

4



https://doi.org/10.1101/273862


We base our discussion of Diffusion Maps on[13]. If we take p(x, y) as the probability of jump-ing, in a random walk, between two points x andy in a dataset X, then we can calculate p via asymmetric and positivity-preserving Guassian ker-nel, k(x, y). We take the diffusion matrix P =

p(xi, xj), with xi, xj ∈ X. Note that powers ofP , P t = pt(xi, xj) give probabilities of movingfrom xi to xj in t steps. The diffusion distance isthen taken to be

Dt(xi, xj)2 =

∑u∈Ω

|pt(xi, u)− pt(xj , u)|2. (6)

A large number of high probability paths betweentwo points xi and xj will give a small Dt(xi, xj),and vice-versa. Diffusion Maps seeks to identifya low-dimensional embedding Y of the points inX such the Euclidean distance between two pointsyi, yj ∈ Y mapped from xi and xj will approxi-mate the diffusion distance Dt(xi, xj). The low-dimensional embedding Y is found by assigning toY the dominant eigenvectors associated with a spec-tral decomposition of P [10, 13].

2.2 Benchmark Data

The dataset we use to compare the various dimen-sion reduction methods was first generated by Ben-dall et al. [8] and consists of ∼170,000 cells with 13markers, of which ∼50% have been manually gatedinto 24 cell populations. The cells are derived fromhealthy primary human bone marrow (bone marrowmononuclear cells, BMMCs), and hence representthe hematopoietic differentiation spectrum in thebone marrow (Figure 1). We repeat our analysison an additional benchmark BMMC dataset fromtwo healthy human donors generated by Levine etal. [22] that consists of ∼160,000 cells with 32 mark-ers, of which ∼35% have been manually gated into14 cell populations. Both datasets were obtainedfrom [1]. Additional information on the benchmarkdatasets and the analysis of the 32 marker datasetis found in the Supplementary Material. The avail-ability of manual gating data alongside markers forthe cells allow for a comparison of how the di-mension reduction techniques can identify knowncell populations and the relationships between theknown populations.

Figure 1: Differentiation of cells from hematopoietic stem cells. Cells from different stages in the hematopoi-etic differentiation process were collected in both the 13- and 32-marker benchmark datasets. Figure adaptedfrom [26].

5



https://doi.org/10.1101/273862


2.3 Comparison Metrics

2.3.1 Computation Time

The computation time for each method dependsheavily on the software and the implementationused. For each technique, we chose the most opti-mized implementation currently available for R [39],(Table 1). All dimension reductions were executedon a MacBook Air (2.2GHz Intel Core i7 proces-sor and 8 GB memory). Currently, the most widelyused dimension reduction technique for mass cytom-etry data is t-SNE, which is the method behind thepopular viSNE algorithm. ViSNE takes a randomsample of between 6,000 and 12,000 cells to performits reductions [2]. We chose to take three randomsubsets of the data, each consisting of 10,000 cells,and measured the run time for one reduction fromn = 13 to m = 2 dimensions on each subset. Ourfinal computation time result is the average of thethree reductions for each technique.

Method R Package RepositoryPCA stats (princomp) Rt-SNE Rtsne CRANIsomap vegan CRAN

Diffusion Maps destiny Bioconductor

Table 1: The R packages used to implement eachdimension reduction technique, as well as the repos-itories from which they are available [30, 3, 20, 21].

T-SNE is not designed to operate efficiently onmore than 10,000 cells, although a random walkbased version has been proposed that will reducea subset of data while using the information fromthe entire data set [24]. The Rtnse package is aversion of the Barnes-Hut implementation of t-SNE,which is an accelerated implementation and the bestsuited for large data sets [23]. The vegan packageprovides a traditional implementation of the Isomapalgorithm, which involves complex computationsthat render it highly inefficient on large data sets.The most notable inefficiencies in the algorithm arethe construction of neighborhood graphs and theMDS eigenvalue decomposition. A new variation

of Isomap called L-Isomap uses landmark points tocalculate distances providing computational simpli-fication [38]. Due to potential topological instabilityexacerbated by the landmark approach, questions ofhow to choose the landmark points [36], and a lackof an implementation in R, we do not use L-Isomapin our analysis, although it can be considered forfuture applications to CyTOF. The destiny pack-age for R provides an implementation of DiffusionMaps developed by Angerer et al. [3] specifically forhigh-throughput single-cell data. Parameters usedfor each dimension reduction method employed inR are summarized in Supplementary Table S3.

2.3.2 Neighborhood Proportion Error(NPE)

An important motivation for mass cytometry datais the clustering of cells by phenotype. The cellsin the benchmark data set have been manually as-signed to subtypes based on their marker expression.The degree to which the cells cluster together in theinput space is a highly informative characteristic ofany mass cytometry data set. Using the subtype as-signments provided by manual gating, we developedthe NPE to measure how effectively each dimensionreduction technique translates the cell proximitieswithin a subtype from the input space to the low-dimensional embeddings (Figure 2).

In the NPE calculation, all columns (which rep-resent the markers measured in the experiment) arefirst normalized to mean 0 and variance 1. Everydata point (or cell) is then assigned a neighborhoodof the k closest points. For each neighborhood, thefraction of neighbors that belong to the subtype ofthe cell is then calculated (Figure 3).

The results are converted to an empirical densityestimate P (s) in the original space and Q(s) in thedimension-reduced space, for each subtypes s ∈ S,where S is the set of all manually gated subtypes.The error between the density estimates for each sis calculated using total variation distance [17],

6



https://doi.org/10.1101/273862


δs (P,Q) = supa∈[0,1]

|P (a)−Q(a)|, (7)

where a represents the fraction of like neighborsthat constitutes the domain of P and Q. The to-tal variation distance, also known as the “statisticaldistance” is a measure of the largest difference be-tween the probabilities that P and Q can assign tothe same event, s. The NPE is calculated as thesum of δs over all the subtypes si ∈ S,

NPE =n∑

i=1

δsi(P,Q). (8)

NPE Algorithm

Purpose: to measure how effectively a dimension reductiontechnique preserves cell proximities to like phenotypes.

Method:

• For each cell, calculate the fraction a of k neighbors thatbelong to the same subtype as the cell (Figure 3), bothin the original data and in the low-dimensional embedding.

• For each manually gated subtype s ∈ S, calculate the totalvariation distance, δs = supa∈[0,1] |P (a) − Q(a)|, whereP and Q represent the empirical density distributions ofsubtype s in the original data and embedding, respectively.

• The NPE is taken to be the sum of the total variationdistance over all s ∈ S, NPE =

∑ni=1 δsi (P,Q).

Figure 2: Overview of the Neighborhood ProportionError (NPE) algorithm.

k=20# likeneighbors=7

celli

proportionthatbelongto‘red’subtype=0.35

Figure 3: A sample neighborhood of k = 20 pointsshowing the fraction of cells that are of the samesubtype as cell i (black outline).

2.3.3 Residual Variance

In addition to the measure of local preservation thatNPE provides, we also want to compare our meth-ods with respect to global error. Residual variancegives a measure of the global variance not accountedfor by the dimension reduction. It is defined as [40]

RV = 1− r2 (DM , DY ) , (9)

where r is the Pearson correlation coefficient,

r (DM , DY ) =

∑Ni=1

(diM −DM

) (diY −DY

)√∑Ni=1

(diM −DM

)2× [

i= 1]N∑(

diY −DY

)2 ,(10)

and DM and DY are the respective high- and low-dimensional distance matrices. DY consists of thepairwise Euclidean distances between the points inthe reduced space; however, DM is different for eachmethod. In PCA, DM is composed of the pairwiseEuclidean distances in the input space. In Isomap,DM consists of the geodesic distances derived fromthe weighted graph DG (Section 2.1.3). In DiffusionMaps, DM = Dt, where Dt is defined in Equation(6). Because the cost function for t-SNE (Equa-tion (3)) is nonlinear, the relationships between theinput and output probability distributions, P andQ, respectively, are nonlinear (Section 2.1.2) andthe Pearson correlation coefficient will not give aproper measure of the information preserved in thedimension reduction, thus residual variance was notcalculated for t-SNE. For the other methods, resid-ual variance was calculated for dimension reduc-tions for low-dimensional projections of dimensionm = 1, 2, ..., 7.

2.3.4 Visualization: clustering and differen-tiation

Since an important goal of all the dimension re-duction methods is visualization and identificationof known and new cell subtypes, as well as thetracking differentiation trajectories, we visualizedthe two dimensional reductions using a color overlayof the gated cell subtypes, and considered whether

7



https://doi.org/10.1101/273862


(1) manually gated subtypes clustered together inthe visualizations and (2) the dimension reductionmethods were able to capture known differentiationtrajectories.

3 Results

The results presented are from application of thecomparison metrics to one of the 10K subsets ofthe 13-dimensional dataset. Results for the other10K subsets of the 13-dimensional dataset, as wellas for a 10K subset of the 32-dimensional dataset,are found in the Supplementary Material, SectionsS3-S4. We summarize the results in Table 2.

3.1 Computation Time

Computation time calculations show pronounceddifferences between the reduction methods (Figure4). Isomap is the slowest by a wide margin, requir-ing approximately 3 hours. The other three meth-ods proved more practical, with t-SNE and Diffu-sion Maps requiring 1.5-2 minutes, and PCA lessthan one second.

Isomap t-SNE D-Maps PCA100

101

102

103

104

105

log 1

0(tim

e+1)

Figure 4: Dimension reduction time for three ran-dom subsets of 10,000 cells from m = 13 dimensionsto n = 2 from the benchmark dataset. Results areplotted using a semilogarithmic plot in increasingorder of efficiency.

3.2 Neighborhood Proportion Error

5 10 15 20 25 30 35 40 45 50Dimension

4

6

8

10

12

14

16

18

NPE

Isomapt-SNED-MapsPCA

npe_13marker_11_17

Figure 5: Neighborhood Proportion Error (NPE) asa function of k, the size of the neighborhood, and thedimension reduction method (PCA, t-SNE, Isomap,and Diffusion Maps).

NPE was calculated for two-dimensional reductionsof all tested methods (Figure 5). We observe that asthe neighborhood size increases, NPE values beginto stabilize and, most importantly, do not changerank order with respect to the dimension reductionmethods, thus showing a robustness in results tok. PCA shows the most error, indicating that itsneighborhood proportions with respect to cell sub-type were the most altered through the process ofdimension reduction. T-SNE and Diffusion Mapspreserve neighborhood proportion significantly bet-ter than Isomap and t-SNE, signaling that thesemethods are better able to preserve the local struc-ture and neighborhood proportions of gated popu-lations, which may be important when consideringinherent relationships between cells of known phe-notypes vis-a-vis their CyTOF profiles.

8



https://doi.org/10.1101/273862


Table 2: Summary of performance for the dimension reduction methods on the 13 marker dataset based onthe various comparison metrics. Computation time, NPE, and RV results show averages over the three datasubsets. NPE: neighborhood proportion error, RV: residual variance.

Comparison Metric DimensionReduction MethodPCA Isomap t-SNE Diffusion Maps

Comp. Time (m=2) 0.034s 11428.42s 104.174s 122.676sNPE (m=2, k=20) 16.241 15.461 9.144 10.472RV (m=2) 0.230 0.451 N/A 0.916

Visualization

Cell subtypes difficultto distinguish, nodiscernibledifferentiationtrajectories.

Cell subtypes difficultto distinguish, nodiscernibledifferentiationtrajectories.

Different cell types easyto distinguish, nodiscernible differentiationtrajectories.

Different cell types cansometimes bedistinguished, discernibledifferentiationtrajectories.

3.3 Residual Variance

1 2 3 4 5 6 7Dimension

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Res

idua

l Var

ianc

e

IsomapD-MapsPCA

Figure 6: Residual variance error for Isomap, Dif-fusion Maps, and PCA for low-dimensional embed-dings of dimension m = 1, ..., 7.

Residual variance (RV) was calculated for PCA,Isomap, and Diffusion Maps for dimension reduc-tions to m = 1, 2, ..., 7 dimensions (Figure 6). Weobserve that PCA and Isomap show a strong reduc-tion in RV up to m = 3, after which the reduc-tions are more modest. This ‘elbow effect’ can sig-nal that the intrinsic dimensionality (m) of the data

is between 3 and 4. Surprisingly, Diffusion Maps,although a strong performer in the NPE method,shows a higher RV at every m than the other meth-ods, and does not display an elbow effect. Althoughthese results show that the intrinsic dimensionalityof the dataset may be higher than m = 2, we pro-ceed with our comparative analysis to reductions oftwo dimensions since this is the dimensionality atwhich the majority of users are interested in consid-ering the data.

3.4 Visualization: clustering and dif-ferentiation

There are two specific phenomena that experimen-talists are generally interested in observing in thetwo-dimensional embeddings: phenotypic clustersand differentiation trajectories. Phenotypic clus-tering refers to cells with the same manually-gatedsubtypes occupying a region of the plot near toeach other, preferably to a degree that allows dis-tinct populations with clear boundaries to be iden-tified. Differentiation trajectories refer to cell sub-types that are arranged along a clear path in theorder that cell differentiation occurs.

9



https://doi.org/10.1101/273862


−4

0

4

−6 −3 0 3 6Reduced Axis 1

Red

uced

Axi

s 2

PCA

−0.01

0.00

0.01

0.02

−0.01 0.00 0.01 0.02Reduced Axis 1

Red

uced

Axi

s 2

Diffusion Maps

−10

−5

0

5

10

−10 0 10Reduced Axis 1

Red

uced

Axi

s 2

Isomap

−25

0

25

−50 −25 0 25 50Reduced Axis 1

Red

uced

Axi

s 2

t−SNE

population

CD11b

Erythroblasts

Immature B

Mature CD38lo B

Mature CD38mid B

Mature CD4+ T

Mature CD8+ T

Megakaryocytes

Myelocytes

Naive CD4+ T

Naive CD8+ T

NK

Figure 7: Two-dimensional embeddings of a random sample of 10,000 cells from the benchmark dataset forthe four dimension reduction techniques. Manually gated cell subtypes are labeled.

The visualizations show that the different dimensionreduction techniques differ in their ability to identifyphenotypic clusters and differentiation trajectories(Figure 7). PCA does not demonstrate clear sepa-ration of cell subtypes, there is a significant mixingbetween the different populations, making this em-bedding ineffective for subtype classification. PCAalso does not produce any clear paths along whichwe can observe the process of cell differentiation.Isomap shows similarly extensive mixing of different

cell subtypes, displaying very little distinct cluster-ing and no observable differentiation patterns. Dif-fusion Maps, however, shows both clustering of cellsubtypes and trajectories that may correspond todifferentiating cells (see below). T-SNE also showsthe clearly defined clusters, with large gaps betweensome distinct groups. From these preliminary obser-vations, we conclude that our t-SNE and DiffusionMaps embeddings warrant further examination.Indeed, if we view the embedding results for just

10



https://doi.org/10.1101/273862


cells of the common lymphoid progenitor (CLP) lin-eage, which include natural killer (NK) cells, naiveand mature CD8+, naive and mature CD4+, andimmature, mature CD38mid, and mature CD38loB cells, for all the four methods (Figure 8(a), wesee that while all reductions show naive and matureCD8+ (and, CD4+ cells) near each other (respec-tively), the greatest separation of the naive and ma-ture subsets, as well as the CD4+ and CD8+ cellsis achieved by t-SNE and Diffusion Maps (Figure 8(a) and (b) top panels of (i) and (ii). CD38 is a

marker of developing B cells, with a loss of expres-sion as B cells mature [6]. The differentiation tra-jectory from immature B, to CD38 mid, to CD38lo B cells is observable in Diffusion Maps (Figure8(b)(i) bottom panel), but much less so in t-SNE(Figure 8(b)(ii) bottom panel). T-SNE thus showsan enhanced ability over the other methods to dis-play different cell subtypes, whereas Diffusion Mapsshows an enhanced ability to display differentiationtrajectories.

−2

0

2

4

−3 0 3 6Reduced Axis 1

Red

uced

Axi

s 2

PCA

−0.01

0.00

0.01

0.02

−0.01 0.00 0.01 0.02Reduced Axis 1

Red

uced

Axi

s 2

Diffusion Maps

−10

−5

0

5

10

−10 −5 0 5 10Reduced Axis 1

Red

uced

Axi

s 2

Isomap

−20

0

20

40

−50 −25 0 25Reduced Axis 1

Red

uced

Axi

s 2

t−SNE

population

Immature B

Mature CD38lo B

Mature CD38mid B

Mature CD4+ T

Mature CD8+ T

Naive CD4+ T

Naive CD8+ T

NK

−0.0175

−0.0150

−0.0125

−0.0100

−0.0075

−0.010 −0.008 −0.006 −0.004 −0.002Reduced Axis 1

Redu

ced A

xis 2

Diffusion Maps

population

Mature CD8+ T

Naive CD8+ T

−0.010

−0.005

0.000

0.0050.0000.005

0.0100.015

Reduced Axis 1Reduced Axis 2

Diffusion Maps

population

Imm

ature B

Mature CD38lo B

Mature CD38m

id B

15

20

25

30

35

−30 −20 −10 0Reduced Axis 1

Red

uced

Axi

s 2

t−SNE

population Mature CD8+ T Naive CD8+ T

(a) (b)(i)

(ii)

−20

0

20

40

−10 0 10 20 30 40Reduced Axis 1

Red

uced

Axi

s 2

t−SNE

population Immature B Mature CD38lo B Mature CD38mid B

−0.010

−0.005

0.000

0.005

0.000 0.005 0.010 0.015Reduced Axis 1

Redu

ced

Axis

2

Diffusion Maps

population Immature B Mature CD38lo B Mature CD38mid B

15 20 25 30 35

−30−20

−100

Reduced Axis 1

Reduced Axis 2

t−SNE

population

Mature C

D8+ T

Naive C

D8+ T

15 20 25 30 35

−30−20

−100

Reduced Axis 1

Reduced Axis 2

t−SNE

population

Mature C

D8+ T

Naive C

D8+ T

−0.010

−0.005

0.000

0.0050.0000.005

0.0100.015

Reduced Axis 1

Reduced Axis 2

Diffusion Maps

population

Imm

ature B

Mature CD38lo B

Mature CD38m

id B

Figure 8: Common lymphoid progenitor (CLP) populations present in the two-dimensional embeddings ofthe four dimension reduction methods. (a) CLP populations for each method and (b) subsets of naive andmature CD8+ T cells and immature, mature CD38mid and mature CD38lo B cells for (i) Diffusion Mapsand (ii) t-SNE.

11



https://doi.org/10.1101/273862


4 Discussion

We have compared several popular dimension re-duction techniques for their utility in extracting avariety of information from data generated by masscytometry, a new and powerful technique to performsingle-cell analysis on immune cell populations. Wechose to focus on these techniques since they makeminimal assumptions on the nature of relationshipsbetween the input cells, and thus can be used ondatasets comprising both differentiating cells andcells of different lineages. We used several measuresfor the quality and utility of a reduction. Com-putation time is a relevant measure since CyTOFexperiments are often performed on up to hundredsof thousands of cells, and thus computation timemay become a challenge using certain methods, asthe end-users of these analyses are often experimen-tal biologists and immunologists, who may not haveaccess to high-throughput computing facilities. Wedeveloped a supervised heuristic algorithm, Neigh-borhood Proportion Error (NPE), to help answerthe question of how well each technique preservesthe empirical distribution of like neighbor frequen-cies amongst cell subtypes. We consider this an im-portant property of the data that serves as a proxyof the relative clustering of manually gated cell sub-types, and one that an employed dimension reduc-tion algorithm should ideally seek to preserve. Wesought out similar information via visualization ofthe two-dimensional reductions vis-a-vis the man-ually gated cell types, which also provided visualinformation as to whether the methods preserveddifferentiation trajectories. Finally, we used resid-ual variance to measure the global total informationloss with each dimension reduction method. Isomapperformed poorly on all measures except for RV, andPCA performed poorly on all measures except com-putation time and RV. Since preservation of localrelationships between known cell subtypes is of im-port in a dimension reduction, which was measuredby the NPE and visualization, we consider Isomapand PCA poor performers for CyTOF, despite the

good RV results. Diffusion Maps and t-SNE werethe second-best performers for computation time,after PCA, and Diffusion Maps was able to clearlyseparate known cell subtypes, and identified differ-entiation trajectories. T-SNE was able to separateknown cell types more thoroughly than DiffusionMaps, but was not necessarily able to capture dif-ferentiation trajectories, and does not provide in-formation regarding the strength of similarity be-tween different clusters, as was also noted in [5, 43].While Becher et al. [5] and Wong et al. [43] usedIsomap to determine relationships between cell sub-types, we recommend using Diffusion Maps for thispurpose. Indeed, Haghverdi et al. [18] found thatfor qPCR data of mouse hematopoietic and progen-itor stem cells, as well as qPCR data of mouse cellsin early stages of embryonic development, Diffu-sion Maps showed clearer differentiation trajectoriesthan Isomap. We thus recommend the complemen-tary approach of using t-SNE and Diffusion Mapson CyTOF data. If this data contains cell types ofheterogeneous lineages, each method will be able tocontribute different information that in concert willallow researchers to obtain a better understandingof both the different cell types and potential differ-entiation lineages comprising the population.

Acknowledgments

This research was supported by the National Sci-ence Foundation - Division of Mathematical Sci-ences Award 1460967. Additionally, A.K. acknowl-edges support from the National Cancer Institute ofthe National Institutes of Health postdoctoral fel-lowship award F32CA214030.

References

[1] https://github.com/lmweber/benchmark-data-levine-13-dim.

12



https://doi.org/10.1101/273862


[2] El-ad David Amir, Kara L Davis, Michelle DTadmor, Erin F Simonds, Jacob H Levine,Sean C Bendall, Daniel K Shenfeld, Smita Kr-ishnaswamy, Garry P Nolan, and Dana Pe’er.viSNE enables visualization of high dimen-sional single-cell data and reveals phenotypicheterogeneity of leukemia. Nature biotechnol-ogy, 31(6):545–552, 2013.

[3] Philipp Angerer, Laleh Haghverdi, Maren Büt-tner, Fabian J Theis, Carsten Marr, and Flo-rian Buettner. Destiny: Diffusion Maps forlarge-scale single-cell data in R. Bioinformat-ics, 32(8):1241–1243, 2016.

[4] Dmitry R Bandura, Vladimir I Baranov, Olga IOrnatsky, Alexei Antonov, Robert Kinach,Xudong Lou, Serguei Pavlov, Sergey Vorobiev,John E Dick, and Scott D Tanner. Mass cytom-etry: technique for real time single cell multi-target immunoassay based on inductively cou-pled plasma time-of-flight mass spectrometry.Analytical chemistry, 81(16):6813–6822, 2009.

[5] Burkhard Becher, Andreas Schlitzer, JinmiaoChen, Florian Mair, Hermi R Sumatoh, KarenWei Weng Teng, Donovan Low, ChristianeRuedl, Paola Riccardi-Castagnoli, MichaelPoidinger, Melanie Greter, Florent Ginhoux,and Evan W Newell. High-dimensional anal-ysis of the murine myeloid cell system. NatImmunol, 15(12):1181–9, Dec 2014.

[6] Mats Bemark. Translating transitions - how todecipher peripheral human b cell development.J Biomed Res, 29(4):264–84, Jul 2015.

[7] Sean C Bendall, Kara L Davis, El-Ad DavidAmir, Michelle D Tadmor, Erin F Simonds,Tiffany J Chen, Daniel K Shenfeld, Garry PNolan, and Dana Pe’er. Single-cell trajec-tory detection uncovers progression and regula-tory coordination in human B cell development.Cell, 157(3):714–25, Apr 2014.

[8] Sean C Bendall, Erin F Simonds, Peng Qiu,El-ad D Amir, Peter O Krutzik, Rachel Finck,

Robert V Bruggner, Rachel Melamed, Angel-ica Trejo, Olga I Ornatsky, Robert S Balderas,Sylvia K Plevritis, Karen Sachs, Dana Pe’er,Scott D Tanner, and Garry P Nolan. Single-cell mass cytometry of differential immune anddrug responses across a human hematopoieticcontinuum. Science, 332(6030):687–96, May2011.

[9] Cariad Chester and Holden T Maecker. Algo-rithmic tools for mining high-dimensional cy-tometry data. J Immunol, 195(3):773–9, Aug2015.

[10] Ronald R. Coifman and Lafon Stephane. Dif-fusion Maps. Applied and Computational Har-monic Analysis, 21:5–30, 2006.

[11] Trevor F. Cox and Michael A.A. Cox. Multi-dmensional Scaling. Chapman and Hall/CRC,2nd edition, 2001.

[12] Kevin Dawson, Raymond L Rodriguez, andWasyl Malyj. Sample phenotype clusters inhigh-density oligonucleotide microarray datasets are revealed using isomap, a nonlinearalgorithm. BMC Bioinformatics, 6:195, Aug2005.

[13] J De la Porte, BM Herbst, W Hereman, andSJ Van Der Walt. An introduction to DiffusionMaps. In The 19th Symposium of the PatternRecognition Association of South Africa. Cite-seer, 2008.

[14] Allison Doerr. A flow cytometry revolution.Nat Methods, 8(7):531, Jul 2011.

[15] Joannah R Fergusson, Kira E Smith, Vicki MFleming, Neil Rajoriya, Evan W Newell, RuthSimmons, Emanuele Marchi, Sophia Björkan-der, Yu-Hoi Kang, Leo Swadling, Ayako Ku-rioka, Natasha Sahgal, Helen Lockstone, DilairBaban, Gordon J Freeman, Eva Sverremark-Ekström, Mark M Davis, Miles P Davenport,Vanessa Venturi, James E Ussher, Christian B

13



https://doi.org/10.1101/273862


Willberg, and Paul Klenerman. CD161 de-fines a transcriptional and functional pheno-type across distinct human T cell lineages. CellRep, 9(3):1075–88, Nov 2014.

[16] Ali Ghodsi. Dimensionality reduction: Ashort tutorial. Technical report, Departmentof Statistics and Actuarial Science, Universityof Waterloo, 2006.

[17] Alison L. Gibbs and Francis Edward Su. Onchoosing and bounding probability metrics. In-ternational Statistical Review, 70(3):419–435,2002.

[18] Laleh Haghverdi, Florian Buettner, andFabian J Theis. Diffusion maps for high-dimensional single-cell analysis of differentia-tion data. Bioinformatics, 31(18):2989–2998,2015.

[19] Leonore A Herzenberg, James Tung, Wayne AMoore, Leonard A Herzenberg, and David RParks. Interpreting flow cytometry data: aguide for the perplexed. Nature immunology,7(7):681–685, 2006.

[20] Wolfgang Huber, Vincent J Carey, Robert Gen-tleman, Simon Anders, Marc Carlson, Benil-ton S Carvalho, Hector Corrada Bravo, SeanDavis, Laurent Gatto, Thomas Girke, et al. Or-chestrating high-throughput genomic analysiswith Bioconductor. Nature methods, 12(2):115–121, 2015.

[21] J Krijthe. Rtsne: T-Distributed StochasticNeighbor Embedding using Barnes-Hut im-plementation. R package version 0.10, URLhttp://CRAN. R-project. org/package= Rtsne,2015.

[22] Jacob H Levine, Erin F Simonds, Sean C Ben-dall, Kara L Davis, El-ad D Amir, Michelle DTadmor, Oren Litvin, Harris G Fienberg, As-traea Jager, Eli R Zunder, Rachel Finck,Amanda L Gedman, Ina Radtke, James RDowning, Dana Pe’er, and Garry P Nolan.

Data-driven phenotypic dissection of aml re-veals progenitor-like cells that correlate withprognosis. Cell, 162(1):184–97, Jul 2015.

[23] Laurens van der Maaten. Accelerating t-SNEusing tree-based algorithms. The Journal ofMachine Learning Research, 15(1):3221–3245,January 2014.

[24] Laurens van der Maaten and Geoffrey Hin-ton. Visualizing data using t-SNE. Journalof Machine Learning Research, 9(Nov):2579–2605, 2008.

[25] Gavin M Mason, Katie Lowe, Rossella Mel-chiotti, Richard Ellis, Emanuele de Rinaldis,Mark Peakman, Susanne Heck, Giovanna Lom-bardi, and Timothy I M Tree. Phenotypic com-plexity of the human regulatory T cell compart-ment revealed by mass cytometry. J Immunol,195(5):2030–7, Sep 2015.

[26] Kenneth M. Murphy and Casey Weaver.Janeway’s Immunobiology. Garland Science,New York, NY, 9 edition, 2017.

[27] Evan W Newell and Yang Cheng. Mass cytom-etry: blessed with the curse of dimensionality.Nat Immunol, 17(8):890–5, Jul 2016.

[28] Evan W Newell, Natalia Sigal, Sean C Bendall,Garry P Nolan, and Mark M Davis. Cytometryby time-of-flight shows combinatorial cytokineexpression and virus-specific cell niches withina continuum of CD8+ T cell phenotypes. Im-munity, 36(1):142–52, Jan 2012.

[29] Evan W Newell, Natalia Sigal, Nitya Nair,Brian A Kidd, Harry B Greenberg, andMark M Davis. Combinatorial tetramer stain-ing and mass cytometry analysis facilitate T-cell epitope mapping and characterization. NatBiotechnol, 31(7):623–9, Jul 2013.

[30] Jari Oksanen, Roeland Kindt, Pierre Legen-dre, Bob O’Hara, M Henry H Stevens, Main-tainer Jari Oksanen, and MASS Suggests. The

14



https://doi.org/10.1101/273862


vegan package. Community ecology package,10, 2007.

[31] Peng Qiu, Erin F Simonds, Sean C Bendall,Kenneth D Gibbs, Jr, Robert V Bruggner,Michael D Linderman, Karen Sachs, Garry PNolan, and Sylvia K Plevritis. Extracting acellular hierarchy from high-dimensional cy-tometry data with spade. Nat Biotechnol,29(10):886–91, Oct 2011.

[32] Markus Ringnér. What is principal componentanalysis? Nat Biotechnol, 26(3):303–4, Mar2008.

[33] J Paul Robinson and Mario Roederer. Historyof science. Flow cytometry strikes gold. Sci-ence, 350(6262):739–40, Nov 2015.

[34] Yvan Saeys, Sofie Van Gassen, and Bart NLambrecht. Computational flow cytometry:helping to make sense of high-dimensional im-munology data. Nat Rev Immunol, 16(7):449–62, 07 2016.

[35] Manu Setty, Michelle D Tadmor, ShlomitReich-Zeliger, Omer Angel, Tomer MeirSalame, Pooja Kathail, Kristy Choi, Sean Ben-dall, Nir Friedman, and Dana Pe’er. Wish-bone identifies bifurcating developmental tra-jectories from single-cell data. Nat Biotechnol,34(6):637–45, Jun 2016.

[36] Hao Shi, Baoqun Yin, Yu Kang, Chao Shao,and Jie Gui. Robust l-Isomap with a novellandmark selection method. MathematicalProblems in Engineering, 2017, 2017.

[37] Jon Shlens. A tutorial on principal componentanalysis. arXiv preprint, (arXiv:1404.1100),2014.

[38] Vin D Silva and Joshua B Tenenbaum. Globalversus local methods in nonlinear dimensional-ity reduction. In Advances in neural informa-tion processing systems, pages 705–712, 2002.

[39] R Core Team. R: A Language and Environmentfor Statistical Computing. R Foundation forStatistical Computing, Vienna, Austria, 2016.

[40] J B Tenenbaum, V de Silva, and J C Lang-ford. A global geometric framework fornonlinear dimensionality reduction. Science,290(5500):2319–23, Dec 2000.

[41] Sofie Van Gassen, Britt Callebaut, Mary JVan Helden, Bart N Lambrecht, Piet De-meester, Tom Dhaene, and Yvan Saeys. Flow-som: Using self-organizing maps for visualiza-tion and interpretation of cytometry data. Cy-tometry A, 87(7):636–45, Jul 2015.

[42] Vincent van Unen, Na Li, Ilse Molendijk, MineTemurhan, Thomas Höllt, Andrea E van derMeulen-de Jong, Hein W Verspaget, M LuisaMearin, Chris J Mulder, Jeroen van Bergen,Boudewijn P F Lelieveldt, and Frits Koning.Mass cytometry of the human mucosal immunesystem identifies tissue- and disease-associatedimmune subsets. Immunity, 44(5):1227–39,May 2016.

[43] Michael T Wong, Jinmiao Chen, SriramNarayanan, Wenyu Lin, Rosslyn Anicete,Henry Tan Kun Kiaang, Maria Alicia CurottoDe Lafaille, Michael Poidinger, and Evan WNewell. Mapping the diversity of follicularhelper T cells in human blood and tonsils us-ing high-dimensional mass cytometry analysis.Cell Rep, 11(11):1822–33, Jun 2015.

[44] Eli R. Zunder, Ernesto Lujan, Yury Goltsev,Marius Wernig, and Garry P. Nolan. A con-tinuous molecular roadmap to ipsc reprogram-ming through progression analysis of single-cellmass cytometry. Cell Stem Cell, 16(3):323–337,2015.

15



https://doi.org/10.1101/273862


Date post:	18-Oct-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Comparative Analysis of Linear and Nonlinear Dimension ... · Comparative Analysis of Linear and...

Documents