Exploiting heterogeneity in single-cell transcriptomic analyses ...Exploiting heterogeneity in...

Exploiting heterogeneity in single-cell transcriptomic analyses: how to move beyond comparisons of averages

Keegan Korthauer1,2*, Li-Fang Chu4, Michael A. Newton3, Yuan Li3, James Thomson4, Ron M. Stewart4, and Christina Kendziorski3 1Harvard T.H. Chan School of Public Health, 2Dana-Farber Cancer Institute, 3University of Wisconsin-Madison, 4Morgridge Institute for Research

Keegan Korthauer Harvard T.H. Chan School of Public Health Dana-Farber Cancer Institute [email protected] @keegankorthauer

Contact [1] Korthauer, K. D., Chu, L. F., Newton, M. A., Li, Y., Thomson, J., Stewart, R., & Kendziorski, C. (2016). A statistical approach for identifying

differential distributions in single-cell RNA-seq experiments. Genome Biology, 17(1), 222. [2] Korthauer, K.D. (2017) scDD: Mixture modeling of single-cell RNA-seq data to identify genes with differential distributions. Bioconductor R

Package version 1.0.0 (BioC Release 3.5), https://bioconductor.org/Packages/scDD. [3] Lahav, G., et al. (2004). Dynamics of the p53-Mdm2 feedback loop in individual cells. Nature genetics, 36(2), 147-150. [4] Dobrzyński, M., et al. Nonlinear signalling networks and cell-to-cell variability transform external signals into broadly distributed or bimodal

responses. Journal of The Royal Society Interface 11.98 (2014): 20140383. [5] Jubelin, G., et al. FliZ is a global regulatory protein affecting the expression of flagellar and virulence genes in individual Xenorhabdus

nematophila bacterial cells. PloS Genet 9.10 (2013): e1003915. [6] Kharchenko, P. V., et al. Bayesian approach to single-cell differential expression analysis. Nature methods 11.7 (2014): 740-742. [7] Finak, Greg, et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell

RNA sequencing data. Genome biology 16.1 (2015): 278.

References

The ability to quantify cellular heterogeneity is a major advantage of single-cell technologies. It is now possible to elucidate gene expression dynamics that were invisible using bulk RNA-seq, such as the presence of distinct expression states. However, statistical methods often treat cellular heterogeneity as a nuisance. We have developed a novel method to characterize differences in expression in the presence of distinct expression states within and among biological conditions. This framework can detect differential expression patterns under a wide range of settings. Compared to alternative approaches, this method has higher power to detect subtle differences in gene expression distributions that are more complex than a mean shift, and can characterize those differences. The R package scDD implements the approach, and is available on Bioconductor [2].

Abstract

The scDD [1] algorithm (summarized above) tests whether the distribution (possibly multi-modal) of expression is different between biological conditions and classifies genes into categories that summarize the salient characteristics of the differences.

Differential Expression Analysis in Bulk RNA-seq is blind to cellular heterogeneity

Biological mechanisms such as stochastic burst-like fluctuations [3], unsynchronized oscillations [4], and bistable feedback loops [5] (illustrated above) can give rise to a mixed population of cells at multiple different expression states, which manifests as multi-modal distributions. This multimodality complicates DE analysis methods for single-cell, since most assume a parametric distribution with one mode representing the expressed cells (such as SCDE [6] and MAST [7]).

Biological mechanisms leading to multi-modality

In an analysis of human embryonic stem cell (hESC) types (detailed below), we evaluated pairwise comparisons of four cell lines. Identifying which genes are expressed differently between these conditions can give insight into the differentiation process. scDD generally detects more differential genes than other methods, but the additional are enriched for complex patterns. As expected, cell cycle and pluripotency genes are among those detected only by scDD.

scDD detects and classifies complex patterns

scDD is a novel statistical framework and R package that detects gene expression differences in scRNA-seq experiments while explicitly accounting for potential multimodality among expressed cells. It has comparable performance to alternative methods at detecting mean shifts, but is able to detect and characterize more complex differences that are masked under unimodal assumptions.

Summary

In contrast to single-cell RNA-seq, which allows us to get a measurement for each cell, differential expression (DE) analysis in traditional (or bulk) RNA-seq is blind to any cellular heterogeneity. The illustration above shows an example where the bulk RNA-seq experiment would not detect differential expression, but there is clearly a different pattern of expression between the two populations. This type of pattern may be of great biological significance, so it is important that DE methods for scRNA-seq account for it. However, doing so is complicated by the fact that these types of patterns result in multi-modal expression distributions, which are generally not accommodated for in existing approaches.

scDD Algorithm

Website:

Korthauer et al. Page 27 of 31

Table 2 Power to detect DD genes in simulated data

True Gene Category

Sample Size Method DE DP DM DB Overall (FDR)

scDD 0.893 0.418 0.898 0.572 0.695 (0.029)

50 SCDE 0.872 0.026 0.817 0.260 0.494 (0.004)

MAST 0.908 0.400 0.871 0.019 0.550 (0.026)

scDD 0.951 0.590 0.960 0.668 0.792 (0.031)

75 SCDE 0.948 0.070 0.903 0.387 0.577 (0.003)

MAST 0.956 0.633 0.943 0.036 0.642 (0.022)

scDD 0.972 0.717 0.982 0.727 0.850 (0.033)

100 SCDE 0.975 0.125 0.946 0.478 0.631 (0.003)

MAST 0.977 0.752 0.970 0.045 0.686 (0.022)

scDD 1.000 0.983 1.000 0.905 0.972 (0.035)

500 SCDE 1.000 0.855 0.998 0.787 0.910 (0.004)

MAST 1.000 0.993 1.000 0.170 0.791 (0.022)Average power to detect simulated DD genes by true category. Averages are calculated over 20

replications. Standard errors were < 0.025 (not shown).

Table 3 Correct Classification Rate in simulated data

Gene Category

Sample Size DE DP DM DB

50 0.719 0.801 0.557 0.665

75 0.760 0.732 0.576 0.698

100 0.782 0.678 0.599 0.706

500 0.816 0.550 0.583 0.646Average Correct Classification Rate for detected DD genes. Averages are calculated over 20

replications. Standard errors were < 0.025 (not shown).

Table 4 Average correct classification rates by component mean distance

Sample Gene component mean distance �µ

Size Category 2 3 4 5 6

DP 0.02 0.20 0.78 0.94 0.98

50 DM 0.10 0.23 0.59 0.81 0.89

DB 0.08 0.22 0.59 0.80 0.80

DP 0.02 0.18 0.77 0.94 0.97

75 DM 0.08 0.27 0.69 0.86 0.90

DB 0.09 0.29 0.71 0.83 0.84

DP 0.03 0.16 0.74 0.93 0.95

100 DM 0.10 0.32 0.76 0.87 0.91

DB 0.08 0.32 0.80 0.85 0.84

DP 0.01 0.15 0.72 0.91 0.93

500 DM 0.12 0.33 0.72 0.85 0.89

DB 0.03 0.43 0.85 0.85 0.85Average Correct Classification Rates stratified by �µ. Averages are calculated over 20 replications.

Standard errors were < 0.025 (not shown).

When simulating gene expression from mixtures of negative binomial distributions that represent the patterns depicted above, scDD is comparable or slightly better at detecting the DD genes that have an overall mean shift (as shown in the table to the right). As expected, however, it is superior at detecting the DB category, which has no overall mean shift.

Fig 2, Lahav et al. 2004, Nature Genetics [3]

Stochastic burst fluctuations Bistable Feedback loops

Fig 3, Jubelin et al. 2013, PLOS Genetics [5]

Fig 2, Dobrzynski et al. 2012, CSMB [4]

Unsynchronized Oscillations

0.00

0.25

0.50

0.75

1.00

1 2 3+Number of Modes

Prop

ortio

n of

gen

es (o

r tra

nscr

ipts

)

DatasetGE.50GE.75GE.100LC.77H1.78DEC.64NPC.86H9.87

Modality of Bulk (Reds) vs Single−cell (Blues) RNA−seq datasets

Fig 2, Korthauer et al. 2016, Genome Biology [1]

Modality in scRNA-seq

Con

ditio

n 1

Sample 1 Sample 2

… Sample N2

Con

ditio

n 2

Measurement 1 Measurement 2 Measurement N2

Gene X Sample 1 Sample 2

… Sample N1

Measurement 1 Measurement 2 Measurement N1 Do not observe individual cell states in bulk

Gene X is not DE in bulk

Snapshot of Population of Single Cells

Histogram of Observed Expression Level of Gene X

Number of Cells

(A)

(B) (C)

Expression States of Gene X for Individual Cells Over Time

Low Expression State: µ1 High Expression State: µ2

µ1 µ2

Time

Cell 1

Cell 2

Cell 3 !"! !

Cell J

!"! !

in Condition 2



Number of Cells

(A)

(B) (C)



µ1 µ2

Time

Cell 1

Cell 2

Cell 3 !"! !

Cell J

!"! !

Preprocessing

1. Obtain log Expected Counts normalized for library size 2. Filter genes that are detected in fewer than 25% of cells

Detection

1. Model expressed cells for each gene: DPM of Normals 2. Quantify evidence of Differential Distributions (DD):

-  BF with permutation for expressed component -  GLM LRT for dropout component

Classification

Classify significant DD genes into patterns DE, DP, DM, DB, DZ

DE: Traditional Differential Expression

µ1 µ2

DP: Differential Proportion

µ1 µ2

DM: Differential Modality

µ1 µ2

DB: Both DM and DE

µ1 µ3 µ2

DZ: Differential proportion of Zeroes

0 µ1

DE: Traditional Differential Expression

DP: Differential Proportion DM: Differential Modality DB: Both DM and Differential Component means

DZ: Differential Proportion of Zeroes

Evaluate evidence of DD of expressed cells using an approximate Bayes Factor score comparing: •  Global model for all cells •  Independent models for each

biological condition Assess significance via permutation test or (for large datasets) the Kolmogorov-Smirnov test. If expressed component does not display significant DD, assess evidence for differential proportion of zeroes

Undifferentiated

Differentiated

H1

NPC DEC

H9

hESC types

0

1

2

3

4

DEC H1

DZ: SLAMF7

0

2

4

6

8

DEC H1

DP: FASTKD3

0

2

4

6

DEC H1

DM: KCNE3

0

2

4

6

DEC H1

DB: NCOA3

0

2

4

6

DEC H1

CHEK2

0

2

4

6

DEC H1

CDK7

0

2

4

6

DEC H1

FOXP1

2

4

6

8

DEC H1

PSMD12

log(

EC+1

)

(A) scDD−exclusive Genes

log(

EC+1

)lo

g(EC

+1)

(B)

(C)

Cell Cycle Genes

Pluripotency Genes

0

1

2

3

4

DEC H1

DZ: SLAMF7

0

2

4

6

8

DEC H1

DP: FASTKD3

0

2

4

6

DEC H1

DM: KCNE3

0

2

4

6

DEC H1

DB: NCOA3

0

2

4

6

DEC H1

CHEK2

0

2

4

6

DEC H1

CDK7

0

2

4

6

DEC H1

FOXP1

2

4

6

8

DEC H1

PSMD12

log(

EC+1

)

(A) scDD−exclusive Genes

log(

EC+1

)lo

g(EC

+1)

(B)

(C)

Cell Cycle Genes

Pluripotency Genes

471 DD genes not detected by SCDE or MAST are enriched for complex patterns (1 gene categorized as DE)

Korthauer et al. Page 28 of 31



Number of Cells

(A)

(B) (C)



µ1 µ2

Time

Cell 1

Cell 2

Cell 3 !"! !

Cell J

!"! !

Figure 1 Schematic of the presence of two cell states within a cell population which can lead to

bimodal expression distributions. (A) Time series of the underlying expression state of gene X in a

population of unsynchronized single cells, which switches back and forth between a low and high

state with mean µ1 and µ2, respectively. The color of cells at each time point corresponds to the

underlying expression state. (B) Population of individual cells shaded by expression state of gene

X at a snapshot in time. (C) Histogram of the observed expression level of gene X for the cell

population in (B).

Table 5 Number of DD genes identified in the hESC case study data for scDD, SCDE, and MAST.

Note that the Total for scDD includes genes detected as DD but not categorized.

scDD

Comparison DE DP DM DB DZ Total SCDE MAST

H1 vs NPC 1686 270 902 440 1603 5555 2921 5887

H1 vs DEC 913 254 890 516 911 5295 1616 3724

NPC vs DEC 1242 327 910 389 2021 5982 2147 5624

H1 vs H9 260 55 85 37 145 739 111 1119

Table 6 Number of DD genes identified in the myoblast and mESC case studies for scDD and

MAST. Note that the Total for scDD includes genes detected as DD but not categorized.

scDD

Comparison DE DP DM DB DZ Total MAST

Myoblast: T0 vs T72 312 44 200 36 1311 2134 2904

mESC: Serum vs 2i 5233 76 1259 1128 670 9130 9706

Differentially expressed genes detected by each method

H1 vs DEC

Cyclin genes expressed constitutively in hESCs, oscillatory in differentiated cell types

PSMD12 encodes a subunit of the proteasome complex vital to maintenance of pluripotency and has shown decreased expression in differentiating hESCs

Date post:	01-Jan-2021
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Exploiting heterogeneity in single-cell transcriptomic analyses ...Exploiting heterogeneity in...

Documents