Date post: | 16-Jan-2016 |
Category: |
Documents |
Upload: | franklin-jonas-byrd |
View: | 214 times |
Download: | 0 times |
Computational Immunology
Steven H. Kleinstein
Department of Pathology, Yale University School of Medicine
You can still register until
April 28, 2008
Introductory!
OUTLINECover three broad topics – “new” computational methods
• Promoter Analysis / Cis-regulatory Analysis•Over-representation•Gene Set Enrichment Analysis
• Multiple Hypothesis Testing•Bonferroni•False Discovery Rate
• Dynamic Modeling•Labeling Models•Viral Dynamics
Promoter Analysis
Hands-on Mini course on May 1, 2008 @ 1PM
Sridhar Hannenhalli
Penn Center for Bioinformatics
Department of Genetics,University of Pennsylvania
To register:
http://tsb.mssm.edu/cgi-bin/g/reg/InSilico/reg.cgi
Identifying regulators of TLR responses
Hypothesize that clustered genes are co-regulated and that they share cis-regulatory elements
Hypothesize that clustered genes are co-regulated and that they share cis-regulatory elements
Temporal activation of macrophages by TLR4 agonist bacterial lipopolysaccharide (LPS)
K-means clustering defined 11 groups of genes comprising
regulated ‘waves’ of transcription
Can we identify TFs driving B cell differentiation?
Target genes identified by presence of binding sitesTarget genes identified by presence of binding sites
Implicate TFs by analyzing behavior of target genes
B
B
B
Naive
GC
Memory
Experiment (B cell subset)
Ge
ne
If genes targeted by particular transcription factor are
differentially expressed, then the transcription factor is likely
to play role
DNA Sequence Motifs for TF Binding Sites
For prediction of new sites, need to account for conservationFor prediction of new sites, need to account for conservation
Short, recurring patterns in DNA with presumed biological function
Collection ofbinding sites (ROX1 )
Consensus sequence
Frequency Matrix
Nature Biotechnology 24, 423 - 425 (2006)
Measuring Conservation in the Binding Site
Can be corrected for background frequencies (biased GC)Can be corrected for background frequencies (biased GC)
Information content measures conservation at each site i
ATGATCAATAAA---210 Information content
Total information content related to probability of finding motif in ‘random’ DNA sequence
Frequency of base b at position i
Entropy orShannon Information
Sequence Logos
http://weblogo.berkeley.edu/http://weblogo.berkeley.edu/
Visual expression of frequency and information content
Total information content related to probability of finding motif in ‘random’ DNA sequence
The TRANSFAC Database
Current version contains 834 matrices (601 vertebrate)Current version contains 834 matrices (601 vertebrate)
Eukaryotic transcription factors and their genomic binding sites
TRANSFAC has public (older version)and commercial (more features) versions
Other (free) possibilities:
The TRANSFAC Database
Assumes positions are independentAssumes positions are independent
Eukaryotic transcription factors and their genomic binding sites
Frequency of nucleotide bi to occur at the position i of the matrix (B{A, T, G, C})
Information Vector (higher for conserved positions)
MATCH Score
CCCTGACGTCAACG
CCCTGACGTCAACG
Identifying putative TF binding sites
Threshold can be determined by looking at “random” DNAThreshold can be determined by looking at “random” DNA
Search by scanning the promoter region
MacIsaac KD, Fraenkel E (2006) Practical strategies for discovering regulatory DNA sequence motifs. PLoS Comput Biol 2: e36.
Identifying TF Target Genes
‘Gene Sets’ of target genes for each transcription factor‘Gene Sets’ of target genes for each transcription factor
Look 2 Kb up/down-stream of transcription start site
1. Extract genomic sequence (+/- 2Kb around TSS)
2. Identify conserved regions (Human/Mouse/Rat/Dog)
3. Scan conserved regions for potential binding sites
TF 1 TF 2 … TF M
Gene 1
Gene 2
…
Gene N
Matrix linking transcription factors and potential target genes
Gene Sets of Transcription Factor Targets
Gene sets can also be defined manuallyGene sets can also be defined manually
Molecular Signatures Database at Broad Institute(http://www.broad.mit.edu/gsea/msigdb)
ATP6V0A1 RPIP8 POU4F3 FLJ42486 L1CAM SLC17A6 TRIM9MAPK11 DDX25 SNAP25 DRD3 FGF12 COL5A3 SYT4BDNF POMC GABRB3 TMEM22 GRM1 HES1
MGAT5B TCF1 PCSK2 FLJ44674 VIP FLJ38377 ZNF335GABRG2 LHX3 DNER CHKA NEFH ZNF579 CHAT SCAMP5
CDKN2B SST OGDHL KCNH4 SEZ6 GLRA1 HTR1ARPH3A PRG3 NPPB FGD2 RNF13 SYT6 CHGASLC12A5 ELAVL3 KCNH8 GDAP1L1 HCN1 DRD2 HCN3PAQR4 CALB1 BARHL1 SCN3B CRYBA2 TNRC4 VGFRASGRF1 NEF3 OMG KCNIP2 CDK5R1 ATP2B2 HTR5APHYHIPL SARM1 GHSR INA PTPRN DBC1 CSPG3CHRNB2 GRIN1 STMN2 POU4F2 APBB1 GLRA3
V$NRSF_01 (Neuron Restrictive Silencing Factor)
Genes with promoter regions [-2kb,2kb] around transcription start site containing the motif TTCAGCACCACGGACAGMGCC which matches annotation for REST: RE1-silencing transcription factor
Are ATF3 targets over-represented in Cluster 6?
Which transcription factors are driving dynamics of each cluster?Which transcription factors are driving dynamics of each cluster?
Temporal activation of macrophages by TLR4 agonist bacterial lipopolysaccharide (LPS)
Over-Representation AnalysisIf you draw n marbles at random, what is probability of k red ones?
Red Marbles(K)
Black Marbles(N-K)
k
Pick (n)
n-k
Total marbles (N)
Hypergeometric DistributionHypergeometric Distribution
Significance by Hypergeometric Distribution
( | , , )
K N K
k n kP k n K N
N
n
Over-Representation Analysis
Must choose threshold to define “differential expression”Must choose threshold to define “differential expression”
Is set of TF targets over-represented in differentially expressed genes?
Genes withTFBS
(K)
Genes withoutTFBS(N-K)
Significance by Hypergeometric Distribution
k
Differentially-Expressed Genes (n)
n-k
( | , , )
K N K
k n kP k n K N
N
n
All Genes (N)
Over-Representation Analysis
Must choose threshold to define “differential expression”Must choose threshold to define “differential expression”
Assume 17 genes in cluster, 5 with binding site…
Genes withTFBS(100)
Genes withoutTFBS
(1000-100)
Significance by Hypergeometric Distribution
5
Genes in Cluster (17)
17-5
17
5
100 1000 100
5 17 5(5 |17,100,1000) 0.017
1000
17
( |17,100,1000) 0.02x
P
P x
All Genes (1000)
Gene Set Enrichment Analysis (GSEA)
Does not require a threshold for differential expressionDoes not require a threshold for differential expression
Are TF targets enriched among most differentially expressed?
(Subramanian et al, PNAS, 2005)
A B
A B
Signal-to-Noise
Running Sum (KS-like Statistic)
Gene Set Enrichment Analysis (GSEA)
Permute class labels or genes to estimate null distributionPermute class labels or genes to estimate null distribution
What is distribution for enrichment score (ES) under null hypothesis?
Random permutations of data
CalculateES
P value is fraction of “random” data with higher ES
Distribution of ES values for “random” data
GSEA Example: SHM Targeting
E2A binding sites enriched among AID-targeted genesE2A binding sites enriched among AID-targeted genes
Are particular motifs over-represented among mutated genes?
Other Applications of Gene Set Enrichment Analysis
Gene sets can also be defined manuallyGene sets can also be defined manually
Molecular Signatures Database at Broad Institute
Other Applications of Gene Set Enrichment Analysis
Gene sets can also be defined manuallyGene sets can also be defined manually
Molecular Signatures Database at Broad Institute
BioCarta http://www.biocarta.comSignaling pathway database http://www.grt.kyushu-u.ac.jp/spad/menu.htmlSignaling gateway http://www.signaling-gateway.org/Signal transduction knowledge environment http://stke.sciencemag.org/Human protein reference database http://www.hprd.org/GenMAPP http://www.genmapp.org/KEGG http://www.genome.jp/kegg/Gene ontology http://www.geneontology.orgSigma-Aldrich pathways http://www.sigmaaldrich.comGene arrays, BioScience Corp http://www.superarray.com/Human cancer genome anatomy consortium http://cgap.nci.nih.gov/ http://cgap.nci.nih.gov/NetAffx http://www.affymetrix.com/index.affx
Gene sets from the pathway databases.
Multiple Testing
P value cutoff () controls type I error
P values are not adequate when number of tests is largeP values are not adequate when number of tests is large
Type I error (False Positive): the error of rejecting a null hypothesis when it is actually true
P P
Null True False Positive
(FP)
True Negative
(TN)
Alternative True True Positive
(TP)
False Negative
(FN)
If probability to reject single hypotheses by mistake not more than = 5%then from 100 tests, 5 are expected to be significant if there are no differences
Family-wise error rate (FWER)
P P
Null True False Positive
(FP)
True Negative
(TN)
Alternative True True Positive
(TP)
False Negative
(FN)
Too conservative if expect many significant features (e.g., microarray)Too conservative if expect many significant features (e.g., microarray)
Pr[FP1]: probability to reject one hypotheses by mistake not more than
Bonfferoni Correction:m
number of tests performed
So if =0.05 and m=1000 tests, then we require P<0.00005
False discovery rate (FDR)
#False Positive
#Significant Features
FPFDR
FP TP
q value for particular feature is expected proportion of false positives incurred when calling that feature significant.
q value for particular feature is expected proportion of false positives incurred when calling that feature significant.
Expected proportion of false positive results among rejected hypotheses
So if FDR=0.05 and m=1000 tests, then we expect 5% of significant results to be false positives
So, if 100 significant results then expect 5 are false positives
Comparison of Methods
Conservative, controls FDR no matter how many of the m tests are true null cases (m0)
Conservative, controls FDR no matter how many of the m tests are true null cases (m0)
Threshold P values when 50 tests are performed with =0.05
FDR
Benjamini & Hochberg FDR is Conservative
0 0.05 1.0 0.05m
m
0m
m
Could improve if estimate the proportion of true null cases (m0/m)Could improve if estimate the proportion of true null cases (m0/m)
Controls FDR no matter how many of the m tests are true null cases (m0)
Actually controls FDR at:
So if =0.05 and null hypothesis always true (m0/m=1.0) then we control at:
but if null hypothesis really false in 20% of tests (m0/m=0.2) and we control at:
0 0.05 0.2 0.04m
m
Estimating the False Discovery Rate (FDR)
Fraction ‘null’ P values estimated by flat part of density histogramFraction ‘null’ P values estimated by flat part of density histogram
Estimating the proportion of true null cases (Storey and Tibshirani, PNAS 2003)
P Value
Den
sity
of
P
Val
ues
P values have uniform distribution under null hypothesis
Multiple Testing Correction
Control of FWER only suitable if penalty of making even one type I error is severe
Control of FWER only suitable if penalty of making even one type I error is severe
P values are not adequate when number of tests is large
Family-wise error rate (FWER) = Pr[FP1]:probability to reject one hypotheses by mistake not more than
BonferroniSequential Bonferroni (Holm’s step-down)
False discovery rate (FDR) = E[FP/(FP+TP)] = E[False Positives / Significant]: expected proportion of false positive results among rejected hypotheses
Benjamini and HochbergStorey and Tibshirani
BrdU Labeling Models
BrdU (Bromodeoxyuridine)Thymidine analog incorporated into DNA of dividing cells during S-phase
How to estimate proliferation rate?How to estimate proliferation rate?
Flow cytometry to quantify labeled B cells…
BrdU incorporatedduring S phase
science.csustan.edu/confocal/Images/SCE/index.SCE.htm
BrdU labeling of CD4+ and CD8+ T lymphocytes SIV-infected and an uninfected macaque. Data are from Mohri et al., Science (1998)
Is there a difference in cell turnover?Is there a difference in cell turnover?
Model of BrdU Labeling
Often can assume population in steady-state (i.e., constant)Often can assume population in steady-state (i.e., constant)
Start with a basic model of cell population dynamics…
BRate of change
in B cell population dB
s pB dBdt
Model of BrdU Labeling
We can express these as sets of ordinary differential equationsWe can express these as sets of ordinary differential equations
Many experiments stop administering label after some time
Model of BrdU Labeling
2
Uu U U
Ll U L L
dBs pB dB
dtdB
s pB pB dBdt
Solve or simulate these equations over timeSolve or simulate these equations over time
Split the B cell population into Labeled (BL) and Unlabeled (BU) subsets
BL
BU
Model of BrdU Labeling
( )1(1 )d p t
Lf t A e
Labeling curve reflects both proliferation AND deathLabeling curve reflects both proliferation AND death
Many experiments stop administering label after some time
BL
BU
1
( )1
( ) ( )U
U L
s d pA
s s d p
Model of BrdU DE-Labeling
Can estimate proliferation AND deathCan estimate proliferation AND death
Stop administering label after some time (te)
BL
BU
( )( )2 3 )ed p t t
Lf t A A e 2 3 2, ( )
( )L
L eU L
sA A f t A
s s
00.020.040.060.08
0.10.120.140.160.18
0.01
00.
0080.
0060.
0040.
0020.
000
-0.0
02
-0.0
04
-0.0
06
-0.0
08
-0.0
10
-0.0
12
Difference in Proliferation Rates (B1-8 - V23)
Bo
ots
tra
p F
req
ue
nc
yInteraction of Computation & Experiment
Continuous cycle of modeling and experimentationContinuous cycle of modeling and experimentation
Compare simulation and experiment using least-squares objective
2ˆ( )
( )i i
i i
y yE
VAR y
Least-squares objective function
Experimental Observations
Fit Model to Data
d
B U p B L
p
d
s
Computational Model
Model Predictions
New Experiments
Bootstrapping Confidence Intervals
Simulated Experiment
How can we estimate underlying rates?How can we estimate underlying rates?
Demonstrate full cycle of fitting model to data to estimate parameters
BrdU withdrawn
Parameters used to create synthetic data
s = 0.003 per hour
p = 0.01 per hour
d = p + s (to achieve steady state)
Random noise added to each data point
0 20 40 60 800
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Time (hours)
Fra
ctio
n La
bele
d
Fitting the Model to Experimental Data
Many options for how to optimize the fitMany options for how to optimize the fit
Compare simulation and experiment using least-squares objective
2( )ii
Error rLeast-squares objective function
Find parameters to minimize error
^
( ) ( )i L Lr f t f t Difference between observed and predicted values
Local and Global Optimization
Global optimization attempts to avoid local minimaGlobal optimization attempts to avoid local minima
Local optimization techniques find optimal fit around given starting point
Parameter Value
Err
or in
Fit Local
Global
0 20 40 60 800
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Time (hours)
Fra
ctio
n La
bele
d
Optimal Parameter Estimates
Is inflow necessary to fit the data? Can we use simpler model?Is inflow necessary to fit the data? Can we use simpler model?
Least-squares fit using lsqnonlin in MATLAB
Parameter estimates
s = 0.0017 per hour
p = 0.0099 per hour0.95 1 1.05
2.25
2.3
2.35x 10
-3
s
Obj
ectiv
e F
unct
ion
0.95 1 1.052
2.5
3
3.5x 10
-3
p
Obj
ectiv
e F
unct
ion
Plot local curvature to check minimization…
Is inflow (s) significant?
Observations Parameters RSS F test (1-fcdf in MATLAB)
(1) No flow 6 1 9.38e-7
(2) Including flow 6 2 0.95e-7 53.1 (p<0.0004)
Inflow is important to explain observationsInflow is important to explain observations
d
B
sp
d
B
p(1) (2)
smaller larger
smaller larger
larger
larger
RSS RSS
RSS
df dfF
df
Reduction in RSS per extra parameter
Measure of ‘noise’ in model
Bootstrapping Parameter Confidence Intervals
Bootstrapping observations also possible – asymptotically equivalentBootstrapping observations also possible – asymptotically equivalent
1) Fit model to data to obtain parameter estimates 2) Draw a bootstrap sample of the residuals 3) Create bootstrap sample of observations by adding randomly sampled
residual to predicted value of each observation
0 20 40 60 800
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Time (hours)
Fra
ctio
n L
abel
ed
ri
Repeat1000x Estimate parameters
for bootstrap samples
Bootstrapping Parameter Confidence Intervals
May not have correct coverage when sampling distribution skewedMay not have correct coverage when sampling distribution skewed
Percentile Method
Contains 95% of the estimates
Calculate the parameter for each bootstrap sample and select (e.g., 0.05)
LCL = /2th percentile.
UCL = (1-/2)th percentile.
Parameter estimates for synthetic data
Estimate of s = 0.0017 [0.0009,0.0030]Estimate of p = 0.0099 [0.0095,0.0100]
0.009 0.0095 0.01 0.0105 0.011 0.0115 0.0120
50
100
150
200
250
300
proliferation rate (p)
Boo
tstr
ap r
uns
Viral Dynamics
Hepatitis C Viral Dynamics and Interferon- Therapy
Short delay followed by biphasic decline in viral loadShort delay followed by biphasic decline in viral load
Modeling 23 patients during 14 days of therapy (daily doses)
Model of Hepatitis C Viral Dynamics
Before therapy, virus load is approximately constantBefore therapy, virus load is approximately constant
Includes virus along with target (T) and infected (I) cells
Model of Hepatitis C Viral Dynamics
Before therapy, virus load is approximately constantBefore therapy, virus load is approximately constant
Includes virus along with target (T) and infected (I) cells
Target Cells
Infected Cells
Virus (HCV RNA)
Model of Interferon- Therapy
Therapy can reduce the rate of infection, or production of virionsTherapy can reduce the rate of infection, or production of virions
Includes virus along with target (T) and infected (I) cells
Target Cells
Infected Cells
Virus (HCV RNA)
XX
Hepatitis C Viral Dynamics and Interferon- Therapy
Average virion production rate of 1.3x1012 virions per dayAverage virion production rate of 1.3x1012 virions per day
Modeling 23 patients during 14 days of therapy (daily doses)
Hepatitis C Viral Dynamics and Interferon- Therapy
Patients with undetectable HCV after 3 months of therapy (filled symbols) had significantly faster cell death rates
Patients with undetectable HCV after 3 months of therapy (filled symbols) had significantly faster cell death rates
Modeling 23 patients during 14 days of therapy (daily doses)
Suggests immune control has
important role in lowering viral load
HCV Viral Kinetics : Summary
• Biphasic clearance of serum HCV RNA
• 1st phase rapid; depends on IFN- dose– This appears to be due to dose-dependent efficacy in
blocking HCV production
– Possible to estimate efficacy from measurements of HCV RNA decline over first few days of therapy!
• 2nd phase slower. Slope appears to be a measure of rate of infected cell loss
Slide from Alan Perelson
You can still register until
April 28, 2008
Introductory!
Hands-on Mini course on May 1, 2008 @ 1PM
To register:
http://tsb.mssm.edu/cgi-bin/g/reg/InSilico/reg.cgi