Computational Immunology Steven H. Kleinstein Department of Pathology, Yale University School of...

Computational Immunology

Steven H. Kleinstein

Department of Pathology, Yale University School of Medicine

You can still register until

April 28, 2008

Introductory!

OUTLINECover three broad topics – “new” computational methods

• Promoter Analysis / Cis-regulatory Analysis•Over-representation•Gene Set Enrichment Analysis

• Multiple Hypothesis Testing•Bonferroni•False Discovery Rate

• Dynamic Modeling•Labeling Models•Viral Dynamics

Promoter Analysis

Hands-on Mini course on May 1, 2008 @ 1PM

Sridhar Hannenhalli

Penn Center for Bioinformatics

Department of Genetics,University of Pennsylvania

To register:

http://tsb.mssm.edu/cgi-bin/g/reg/InSilico/reg.cgi

Identifying regulators of TLR responses

Hypothesize that clustered genes are co-regulated and that they share cis-regulatory elements

Hypothesize that clustered genes are co-regulated and that they share cis-regulatory elements

Temporal activation of macrophages by TLR4 agonist bacterial lipopolysaccharide (LPS)

K-means clustering defined 11 groups of genes comprising

regulated ‘waves’ of transcription

Can we identify TFs driving B cell differentiation?

Target genes identified by presence of binding sitesTarget genes identified by presence of binding sites

Implicate TFs by analyzing behavior of target genes

B

B

B

Naive

GC

Memory

Experiment (B cell subset)

Ge

ne

If genes targeted by particular transcription factor are

differentially expressed, then the transcription factor is likely

to play role

DNA Sequence Motifs for TF Binding Sites

For prediction of new sites, need to account for conservationFor prediction of new sites, need to account for conservation

Short, recurring patterns in DNA with presumed biological function

Collection ofbinding sites (ROX1 )

Consensus sequence

Frequency Matrix

Nature Biotechnology 24, 423 - 425 (2006)

Measuring Conservation in the Binding Site

Can be corrected for background frequencies (biased GC)Can be corrected for background frequencies (biased GC)

Information content measures conservation at each site i

ATGATCAATAAA---210 Information content

Total information content related to probability of finding motif in ‘random’ DNA sequence

Frequency of base b at position i

Entropy orShannon Information

Sequence Logos

http://weblogo.berkeley.edu/http://weblogo.berkeley.edu/

Visual expression of frequency and information content

Total information content related to probability of finding motif in ‘random’ DNA sequence

The TRANSFAC Database

Current version contains 834 matrices (601 vertebrate)Current version contains 834 matrices (601 vertebrate)

Eukaryotic transcription factors and their genomic binding sites

TRANSFAC has public (older version)and commercial (more features) versions

Other (free) possibilities:

The TRANSFAC Database

Assumes positions are independentAssumes positions are independent

Eukaryotic transcription factors and their genomic binding sites

Frequency of nucleotide bi to occur at the position i of the matrix (B{A, T, G, C})

Information Vector (higher for conserved positions)

MATCH Score

CCCTGACGTCAACG

CCCTGACGTCAACG

Identifying putative TF binding sites

Threshold can be determined by looking at “random” DNAThreshold can be determined by looking at “random” DNA

Search by scanning the promoter region

MacIsaac KD, Fraenkel E (2006) Practical strategies for discovering regulatory DNA sequence motifs. PLoS Comput Biol 2: e36.

Identifying TF Target Genes

‘Gene Sets’ of target genes for each transcription factor‘Gene Sets’ of target genes for each transcription factor

Look 2 Kb up/down-stream of transcription start site

1. Extract genomic sequence (+/- 2Kb around TSS)

2. Identify conserved regions (Human/Mouse/Rat/Dog)

3. Scan conserved regions for potential binding sites

TF 1 TF 2 … TF M

Gene 1

Gene 2

…

Gene N

Matrix linking transcription factors and potential target genes

Gene Sets of Transcription Factor Targets

Gene sets can also be defined manuallyGene sets can also be defined manually

Molecular Signatures Database at Broad Institute(http://www.broad.mit.edu/gsea/msigdb)

ATP6V0A1 RPIP8 POU4F3 FLJ42486 L1CAM SLC17A6 TRIM9MAPK11 DDX25 SNAP25 DRD3 FGF12 COL5A3 SYT4BDNF POMC GABRB3 TMEM22 GRM1 HES1

MGAT5B TCF1 PCSK2 FLJ44674 VIP FLJ38377 ZNF335GABRG2 LHX3 DNER CHKA NEFH ZNF579 CHAT SCAMP5

CDKN2B SST OGDHL KCNH4 SEZ6 GLRA1 HTR1ARPH3A PRG3 NPPB FGD2 RNF13 SYT6 CHGASLC12A5 ELAVL3 KCNH8 GDAP1L1 HCN1 DRD2 HCN3PAQR4 CALB1 BARHL1 SCN3B CRYBA2 TNRC4 VGFRASGRF1 NEF3 OMG KCNIP2 CDK5R1 ATP2B2 HTR5APHYHIPL SARM1 GHSR INA PTPRN DBC1 CSPG3CHRNB2 GRIN1 STMN2 POU4F2 APBB1 GLRA3

V$NRSF_01 (Neuron Restrictive Silencing Factor)

Genes with promoter regions [-2kb,2kb] around transcription start site containing the motif TTCAGCACCACGGACAGMGCC which matches annotation for REST: RE1-silencing transcription factor

Are ATF3 targets over-represented in Cluster 6?

Which transcription factors are driving dynamics of each cluster?Which transcription factors are driving dynamics of each cluster?

Temporal activation of macrophages by TLR4 agonist bacterial lipopolysaccharide (LPS)

Over-Representation AnalysisIf you draw n marbles at random, what is probability of k red ones?

Red Marbles(K)

Black Marbles(N-K)

k

Pick (n)

n-k

Total marbles (N)

Hypergeometric DistributionHypergeometric Distribution

Significance by Hypergeometric Distribution

( | , , )

K N K

k n kP k n K N

N

n

Over-Representation Analysis

Must choose threshold to define “differential expression”Must choose threshold to define “differential expression”

Is set of TF targets over-represented in differentially expressed genes?

Genes withTFBS

(K)

Genes withoutTFBS(N-K)


k

Differentially-Expressed Genes (n)

n-k

( | , , )

K N K

k n kP k n K N

N

n

All Genes (N)

Over-Representation Analysis

Must choose threshold to define “differential expression”Must choose threshold to define “differential expression”

Assume 17 genes in cluster, 5 with binding site…

Genes withTFBS(100)

Genes withoutTFBS

(1000-100)


5

Genes in Cluster (17)

17-5

17

5

100 1000 100

5 17 5(5 |17,100,1000) 0.017

1000

17

( |17,100,1000) 0.02x

P

P x

All Genes (1000)

Gene Set Enrichment Analysis (GSEA)

Does not require a threshold for differential expressionDoes not require a threshold for differential expression

Are TF targets enriched among most differentially expressed?

(Subramanian et al, PNAS, 2005)

A B

A B

Signal-to-Noise

Running Sum (KS-like Statistic)

Gene Set Enrichment Analysis (GSEA)

Permute class labels or genes to estimate null distributionPermute class labels or genes to estimate null distribution

What is distribution for enrichment score (ES) under null hypothesis?

Random permutations of data

CalculateES

P value is fraction of “random” data with higher ES

Distribution of ES values for “random” data

GSEA Example: SHM Targeting

E2A binding sites enriched among AID-targeted genesE2A binding sites enriched among AID-targeted genes

Are particular motifs over-represented among mutated genes?

Other Applications of Gene Set Enrichment Analysis


Molecular Signatures Database at Broad Institute

Other Applications of Gene Set Enrichment Analysis


Molecular Signatures Database at Broad Institute

BioCarta http://www.biocarta.comSignaling pathway database http://www.grt.kyushu-u.ac.jp/spad/menu.htmlSignaling gateway http://www.signaling-gateway.org/Signal transduction knowledge environment http://stke.sciencemag.org/Human protein reference database http://www.hprd.org/GenMAPP http://www.genmapp.org/KEGG http://www.genome.jp/kegg/Gene ontology http://www.geneontology.orgSigma-Aldrich pathways http://www.sigmaaldrich.comGene arrays, BioScience Corp http://www.superarray.com/Human cancer genome anatomy consortium http://cgap.nci.nih.gov/ http://cgap.nci.nih.gov/NetAffx http://www.affymetrix.com/index.affx

Gene sets from the pathway databases.

http://www.biocarta.com/

http://www.biocarta.com/

http://www.grt.kyushu-u.ac.jp/spad/menu.html

http://www.signaling-gateway.org/

http://www.signaling-gateway.org/

http://stke.sciencemag.org/



http://www.hprd.org/



http://www.genmapp.org/



http://www.genome.jp/kegg/



http://www.geneontology.org/

http://www.geneontology.org/

http://www.sigmaaldrich.com/

http://www.superarray.com/



http://cgap.nci.nih.gov/






http://www.affymetrix.com/index.affx

http://www.affymetrix.com/index.affx

Multiple Testing

P value cutoff () controls type I error

P values are not adequate when number of tests is largeP values are not adequate when number of tests is large

Type I error (False Positive): the error of rejecting a null hypothesis when it is actually true

P P

Null True False Positive

(FP)

True Negative

(TN)

Alternative True True Positive

(TP)

False Negative

(FN)

If probability to reject single hypotheses by mistake not more than = 5%then from 100 tests, 5 are expected to be significant if there are no differences

Family-wise error rate (FWER)

P P

Null True False Positive

(FP)

True Negative

(TN)

Alternative True True Positive

(TP)

False Negative

(FN)

Too conservative if expect many significant features (e.g., microarray)Too conservative if expect many significant features (e.g., microarray)

Pr[FP1]: probability to reject one hypotheses by mistake not more than

Bonfferoni Correction:m

number of tests performed

So if =0.05 and m=1000 tests, then we require P<0.00005

False discovery rate (FDR)

#False Positive

#Significant Features

FPFDR

FP TP

q value for particular feature is expected proportion of false positives incurred when calling that feature significant.

q value for particular feature is expected proportion of false positives incurred when calling that feature significant.

Expected proportion of false positive results among rejected hypotheses

So if FDR=0.05 and m=1000 tests, then we expect 5% of significant results to be false positives

So, if 100 significant results then expect 5 are false positives

Comparison of Methods

Conservative, controls FDR no matter how many of the m tests are true null cases (m0)

Conservative, controls FDR no matter how many of the m tests are true null cases (m0)

Threshold P values when 50 tests are performed with =0.05

FDR

Benjamini & Hochberg FDR is Conservative

0 0.05 1.0 0.05m

m

0m

m

Could improve if estimate the proportion of true null cases (m0/m)Could improve if estimate the proportion of true null cases (m0/m)

Controls FDR no matter how many of the m tests are true null cases (m0)

Actually controls FDR at:

So if =0.05 and null hypothesis always true (m0/m=1.0) then we control at:

but if null hypothesis really false in 20% of tests (m0/m=0.2) and we control at:

0 0.05 0.2 0.04m

m

Estimating the False Discovery Rate (FDR)

Fraction ‘null’ P values estimated by flat part of density histogramFraction ‘null’ P values estimated by flat part of density histogram

Estimating the proportion of true null cases (Storey and Tibshirani, PNAS 2003)

P Value

Den

sity

of

P

Val

ues

P values have uniform distribution under null hypothesis

Multiple Testing Correction

Control of FWER only suitable if penalty of making even one type I error is severe

Control of FWER only suitable if penalty of making even one type I error is severe

P values are not adequate when number of tests is large

Family-wise error rate (FWER) = Pr[FP1]:probability to reject one hypotheses by mistake not more than

BonferroniSequential Bonferroni (Holm’s step-down)

False discovery rate (FDR) = E[FP/(FP+TP)] = E[False Positives / Significant]: expected proportion of false positive results among rejected hypotheses

Benjamini and HochbergStorey and Tibshirani

BrdU Labeling Models

BrdU (Bromodeoxyuridine)Thymidine analog incorporated into DNA of dividing cells during S-phase

How to estimate proliferation rate?How to estimate proliferation rate?

Flow cytometry to quantify labeled B cells…

BrdU incorporatedduring S phase

science.csustan.edu/confocal/Images/SCE/index.SCE.htm

BrdU labeling of CD4+ and CD8+ T lymphocytes SIV-infected and an uninfected macaque. Data are from Mohri et al., Science (1998)

Is there a difference in cell turnover?Is there a difference in cell turnover?

Model of BrdU Labeling

Often can assume population in steady-state (i.e., constant)Often can assume population in steady-state (i.e., constant)

Start with a basic model of cell population dynamics…

BRate of change

in B cell population dB

s pB dBdt


We can express these as sets of ordinary differential equationsWe can express these as sets of ordinary differential equations

Many experiments stop administering label after some time


2

Uu U U

Ll U L L

dBs pB dB

dtdB

s pB pB dBdt

Solve or simulate these equations over timeSolve or simulate these equations over time

Split the B cell population into Labeled (BL) and Unlabeled (BU) subsets

BL

BU


( )1(1 )d p t

Lf t A e

Labeling curve reflects both proliferation AND deathLabeling curve reflects both proliferation AND death

Many experiments stop administering label after some time

BL

BU

1

( )1

( ) ( )U

U L

s d pA

s s d p

Model of BrdU DE-Labeling

Can estimate proliferation AND deathCan estimate proliferation AND death

Stop administering label after some time (te)

BL

BU

( )( )2 3 )ed p t t

Lf t A A e 2 3 2, ( )

( )L

L eU L

sA A f t A

s s

00.020.040.060.08

0.10.120.140.160.18

0.01

00.

0080.

0060.

0040.

0020.

000

-0.0

02

-0.0

04

-0.0

06

-0.0

08

-0.0

10

-0.0

12

Difference in Proliferation Rates (B1-8 - V23)

Bo

ots

tra

p F

req

ue

nc

yInteraction of Computation & Experiment

Continuous cycle of modeling and experimentationContinuous cycle of modeling and experimentation

Compare simulation and experiment using least-squares objective

2ˆ( )

( )i i

i i

y yE

VAR y

Least-squares objective function

Experimental Observations

Fit Model to Data

d

B U p B L

p

d

s

Computational Model

Model Predictions

New Experiments

Bootstrapping Confidence Intervals

Simulated Experiment

How can we estimate underlying rates?How can we estimate underlying rates?

Demonstrate full cycle of fitting model to data to estimate parameters

BrdU withdrawn

Parameters used to create synthetic data

s = 0.003 per hour

p = 0.01 per hour

d = p + s (to achieve steady state)

Random noise added to each data point

0 20 40 60 800

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Time (hours)

Fra

ctio

n La

bele

d

Fitting the Model to Experimental Data

Many options for how to optimize the fitMany options for how to optimize the fit

Compare simulation and experiment using least-squares objective

2( )ii

Error rLeast-squares objective function

Find parameters to minimize error

^

( ) ( )i L Lr f t f t Difference between observed and predicted values

Local and Global Optimization

Global optimization attempts to avoid local minimaGlobal optimization attempts to avoid local minima

Local optimization techniques find optimal fit around given starting point

Parameter Value

Err

or in

Fit Local

Global

0 20 40 60 800

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Time (hours)

Fra

ctio

n La

bele

d

Optimal Parameter Estimates

Is inflow necessary to fit the data? Can we use simpler model?Is inflow necessary to fit the data? Can we use simpler model?

Least-squares fit using lsqnonlin in MATLAB

Parameter estimates

s = 0.0017 per hour

p = 0.0099 per hour0.95 1 1.05

2.25

2.3

2.35x 10

-3

s

Obj

ectiv

e F

unct

ion

0.95 1 1.052

2.5

3

3.5x 10

-3

p

Obj

ectiv

e F

unct

ion

Plot local curvature to check minimization…

Is inflow (s) significant?

Observations Parameters RSS F test (1-fcdf in MATLAB)

(1) No flow 6 1 9.38e-7

(2) Including flow 6 2 0.95e-7 53.1 (p<0.0004)

Inflow is important to explain observationsInflow is important to explain observations

d

B

sp

d

B

p(1) (2)

smaller larger

smaller larger

larger

larger

RSS RSS

RSS

df dfF

df

Reduction in RSS per extra parameter

Measure of ‘noise’ in model

Bootstrapping Parameter Confidence Intervals

Bootstrapping observations also possible – asymptotically equivalentBootstrapping observations also possible – asymptotically equivalent

1) Fit model to data to obtain parameter estimates 2) Draw a bootstrap sample of the residuals 3) Create bootstrap sample of observations by adding randomly sampled

residual to predicted value of each observation

0 20 40 60 800

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Time (hours)

Fra

ctio

n L

abel

ed

ri

Repeat1000x Estimate parameters

for bootstrap samples

Bootstrapping Parameter Confidence Intervals

May not have correct coverage when sampling distribution skewedMay not have correct coverage when sampling distribution skewed

Percentile Method

Contains 95% of the estimates

Calculate the parameter for each bootstrap sample and select (e.g., 0.05)

LCL = /2th percentile.

UCL = (1-/2)th percentile.

Parameter estimates for synthetic data

Estimate of s = 0.0017 [0.0009,0.0030]Estimate of p = 0.0099 [0.0095,0.0100]

0.009 0.0095 0.01 0.0105 0.011 0.0115 0.0120

50

100

150

200

250

300

proliferation rate (p)

Boo

tstr

ap r

uns

Viral Dynamics

Hepatitis C Viral Dynamics and Interferon- Therapy

Short delay followed by biphasic decline in viral loadShort delay followed by biphasic decline in viral load

Modeling 23 patients during 14 days of therapy (daily doses)

Model of Hepatitis C Viral Dynamics

Before therapy, virus load is approximately constantBefore therapy, virus load is approximately constant

Includes virus along with target (T) and infected (I) cells

Model of Hepatitis C Viral Dynamics

Before therapy, virus load is approximately constantBefore therapy, virus load is approximately constant


Target Cells

Infected Cells

Virus (HCV RNA)

Model of Interferon- Therapy

Therapy can reduce the rate of infection, or production of virionsTherapy can reduce the rate of infection, or production of virions


Target Cells

Infected Cells

Virus (HCV RNA)

XX


Average virion production rate of 1.3x1012 virions per dayAverage virion production rate of 1.3x1012 virions per day



Patients with undetectable HCV after 3 months of therapy (filled symbols) had significantly faster cell death rates

Patients with undetectable HCV after 3 months of therapy (filled symbols) had significantly faster cell death rates


Suggests immune control has

important role in lowering viral load

HCV Viral Kinetics : Summary

• Biphasic clearance of serum HCV RNA

• 1st phase rapid; depends on IFN- dose– This appears to be due to dose-dependent efficacy in

blocking HCV production

– Possible to estimate efficacy from measurements of HCV RNA decline over first few days of therapy!

• 2nd phase slower. Slope appears to be a measure of rate of infected cell loss

Slide from Alan Perelson

You can still register until

April 28, 2008

Introductory!

Hands-on Mini course on May 1, 2008 @ 1PM

To register:

http://tsb.mssm.edu/cgi-bin/g/reg/InSilico/reg.cgi

Date post:	16-Jan-2016
Category:	Documents
Upload:	franklin-jonas-byrd
View:	214 times
Download:	0 times

Computational Immunology Steven H. Kleinstein Department of Pathology, Yale University School of...

Documents