Statistical Analysis of cDNA Microarray Genomics Data
Yuehua Cui
Graduate student
Department of Statistics
December 4th, 2002
Outline of the topics• Introduction• Data preprocessing
– Alignment
– Background calculation
– Data transformation
• An example• Normalization Comparison• Post Hoc Analysis
Introduction• New technique introduced in 1995 by Schena.• Quantitatively monitor expression level for thousands of genes
at a time.• All the methods and applications are based on Nylon
membrane microarrays and can be extended to other DNA microarray analysis using other platforms.
• Why normalization: – A number of systematic variations can occur during experiments.
For example, different samples being compared are hybridized on different nylon membranes. Need normalization to remove these sources of variation.
– Well normalized data are the foundation of good analysis results.
• Statistical analysis
AtlasImage Data Preprocessing• Alignment: each gene is represented by two spots. Match these two
spots to a schematic representation of an array. Final intensity for this gene will be the average value of the intensities of these two spots.
• Background calculation– external(global):median intensity of the black space between
different panels.– user-defined external:median intensity of user-defined area– local:median intensity of the space surrounding the gene spot
• Data transformation: – Adjusted intensity = raw intensity - background value– log2, log10 or natural.
part of Atlas nylon membrane array
*Note: the two spots above or below the white bar represent
one gene, i.e. one gene has two spots.
An example RL95 cell line data set• Each Clontech Stress array contains 234 sequences expressed in
response to stress.
• Each insert cDNA is denatured and UV cross-linked to a positively charged membrane
• Samples are treated with DMSO and BaP (Benzo(a)pyrene) dissolved in DMSO. So DMSO is the control and BaP is the treatment.
• DMSO and BaP treated samples are hybridized under the same condition each time. Two membranes are used three times for DMSO and BaP treated samples, respectively.
• Three biological replicates done with the same membrane(s) (correlation occurs)
Control RNA Sample Test RNA Sample
Hybridization to microarray filters
Use Phosphor Imager laser scanner to obtain densities of each spot on filter.
radio-labelled
cDNA probes
Reverse-Transcription 33P - dCTP33P - dCTP
Compare densities at each spot to determine if treatment changes gene expression. Compile subset of differentially expressed genes.
Gene Control Test A 1X 3X : : : Z 1X 0.5X
Scatter plots of adjusted log intensities for paired experiments of DMSO vs BaP
2 3 4 5 6 7
23
45
67
scatter plot of DMSO vs BAP
log(dmso1)
log(b
ap1)
2 3 4 5 6 7 8
23
45
67
8
scatter plot of DMSO vs BAP
log(dmso2)
log(b
ap2)
0 2 4 6 8
12
34
56
78
scatter plot of DMSO vs BAP
log(dmso3)
log(b
ap3)
Normalization• Gobal normalization (AtlasImageTM)
– assumption: given large enough sample size, the average signal intensity (gene expression level) does not change.
– Sum method: Norm coef.(kj) =
Where Imi = intensity of gene i on array Array m, m=1,2
Bm= background intensity on Array m, m=1,2 n = number of genes on the array
– problem: validity of the assumption; stronger signals dominate the summation.
– Median (robust with respect to outliers)
Normalization coefficient (kj) =
n
iii
n
iii
BI
BI
122
111
)(
)(
22
11
ii
ii
BImedian
BImedian
Normalization continued• Housekeeping gene normalization
– Housekeeping genes are a set of genes whose expression levels are not affected by the treatment.
– The normalization coefficient is the ratio of mC/mT, where mC and mT are the means of the selected housekeeping genes for control and treatment respectively.
– Problem: housekeeping genes change their expression level sometimes. The assumption doesn’t hold.
• Trimmed mean normalization(adjusted global method)
trim off 5% highest and lowest extreme values, then globally normalize data. The normalization coefficient is:
where are the trimmed means for the ith treatment
and control respectively.
i
i
T
Ci m
mk
ii TC mandm
Normalization continued
• Regression normalization: – Fit the linear regression model:
– Assumption: all the genes on the array have the same variance (homogeneity)
– Test the significance of the intercept . Fit a linear regression without if it is insignificant.
– Transform the treatment data:
– Problem: • assumption may not hold
• nonlinear trend (the third replicates of RL95 data has a slight quadratic trend) .
iii xy
ii
yy
Scatter plot of log intensity before and after regression normalization
2 3 4 5 6 7
23
45
67
scatter plot of DMSO vs BAP
log(dmso1)
log(b
ap1)
2 3 4 5 6 7 8
24
68
scatter plot of DMSO vs BAP
log(dmso2)
log(b
ap2)
0 2 4 6 8
13
57
scatter plot of DMSO vs BAP
log(dmso3)
log(b
ap3)
2 3 4 5 6 7
23
45
67
scatter plot after norm
log(dmso1)
log(b
ap1)
2 3 4 5 6 7 8
24
68
scatter plot after norm
log(dmso2)lo
g(b
ap2)
2 3 4 5 6 7 8
13
57
scatter plot after norm
log(dmso3)
log(b
ap3)
Normalization continued• Rank normalization: (this method assumes only a small number of
genes will be differentially expressed)– RCjc criteria, j=1,…,g, where c =g 10%, g is the total number of
genes and RCj is the rank for gene j in control.
– choose a set of genes which have a similar expression pattern, ie. RTj(RCjc )
– Normalization coefficient: where and are the means
of the selected genes for the ith treatment and control respectively
– Question: how to choose c?– Rank invariant genes (Eric Schadt, 2001, Journal of Cellular
Biochemistry (supplement) 37:120-125)
i
i
T
Ci m
mk
iCmiTm
Normalization continued• Intensity-dependent normalization (Yang, YH, 2002 )
– Do M-A plot to check the data distribution, where
– Use Lowess function in R to perform normalization
where c(A) is the lowess fit to the M-A plot
– Transform data by M'=M - c(A). – Locally nonparametric method and is robust to a small
number of differentially expressed genes.
CTAandCTM *log/log 22
)/(log)(/log/log 222 kCTAcCTCT
M-A plot of DMSO vs BaP (Before and after intensity-dependent normalization, f=0.3)
2 4 6 8 10
-1.0
0.0
1.0
M-A plot
A
M
2 4 6 8 10
-1.0
0.0
1.0
M-A plot after Lowess norm
A
M`
4 6 8 10 12
-20
12
M-A plot
A
M
4 6 8 10 12-2
01
2
M-A plot after Lowess norm
A
M`
2 4 6 8 10 12
-20
24
M-A plot
A
M
2 4 6 8 10 12
-20
24
M-A plot after Lowess norm
A
M`
Conclusion• Global or local, parametric or nonparametric method
• No unique normalization method for the same data. It depends on what kind of experiment you have and what the data look like.
• No absolute criteria for normalization. Basically, the normalized log ratio should be centered around 0. Combing with post hoc analysis to choose the best one.
Post Hoc Analysis• Before analysis
– Data adjustment: for paired normalization, truncate big ratios first. Quantile criteria (1% or 5%, 95% or 99% quantiles)
– Parametric tests assume that data follow a certain distribution
– Non-parametric tests do not make such assumptions
– Check the validity of the assumptions made for
parametric test and make sure using the right test.
AtlasImageTM 2-fold criteria AtlasImageTM software: report genes with 2-fold
change as up or down-regulated genes.
Fails to account for sample variation.
Low intensity tends to have higher ratio
Ignores the fact that a difference less than 2-fold can also elicit meaningful biological effects.
One sample t-test• Obtain normalized log ratio for each pair (control vs
treatment). Calculate the mean and SD for each gene.
• Hypothesis:
Under the null hypothesis,ie., there is no expression difference, the mean of the log ratio for gene i is 0:
• The test statistic is
• where mi and sdi are the mean and standard deviation of the log ratio for gene i.
• Problem:small sample size;normality assumption; multiple test adjustment.
i
ii sd
mt
0)log(1/ i
iii C
TCT
0:0: 10 ii HvsH
Two sample t-test• Obtain normalized log intensity.
• Let the sample mean and variances of Yij’s for gene j under the two conditions be , the test statistics is:
with df
if unequal variance is assumed and
with df di=2(n-1)
if equal variance is assumed.
• Under the normality assumption for Yij, Zi approximately has a t-distribution with di degree of freedom:
• Problem: small sample size; normality assumption; multiple test adjustment.
)1/()/()1/()/(
)//(2
)2(22
)1(2
2)2(
2)1(
2
kkSkkS
kSkSd
ii
ii
i
)2(2
)1(2
)2()1( ,,, iiii SSYY
kSkS
YYZ
ii
iii
// )2(2
)1(2
)2()1(
nnS
YYZ
p
iii
/1/121
Multiple test adjustment• Hundreds of genes tested at the same time. Assume 1000 genes
are not differentially expressed. P-value of 0.01(false positive rate) means that around 10 genes will nevertheless be significant.
• Bonferroni correction: want to make sure that P[1 gene significant from 1000] 0.05. Consequently, p-value for a single gene to be announced as significant is: P[single gene] 0.05/1000 = 0.00005
• Conservative and lower power.• keep FWR manageable and try some p-value, say 0.001 as the
significant level.• Westfall and Young’s step-down adjusted P-value.
Predictive Interval (PI) method• Use the normalization method discussed above to normalize data.
• Obtain the average log ratio(ALR) which is centered around zero.
• Using normal approximation method. – Step I: Treating the maximum or minimum value of ALR greater
than mean+3*sd or less than mean-3*sd as outlier, delete it from ALR and take it as a differentially expressed gene.
– Step II: calculate the mean and sd for remaining genes and repeat step I.
– Do above steps iteratively until no more ourlier exists. Then, calculate the 95% predictive interval for the remaining genes. Those values outside of the PI are significant.
– The final set of differentially expressed genes include those outliers detected in step I and II and those outside of PI.
Yidong’s algorithm• Assumption:
– Assuming there is constant coefficient of variation c for the entire gene set
– the observed differential expression, Rk=Tk/Ck(ratio of treatment and control intensity at gene k), has a sampling distribution dependent only on c. Rkis approximately normally distributed.
– Assume – The density function of R becomes:
• Use the Maximum likelihood method to estimate the constant c, and use the EM algorithm to get the final estimate of c and m.
• Use the polynomial: to get the CI.
• Measurement errors depends on signal strength
kk TT c kk CC c
2
1
12
2
])1(
)1(1[ˆ
n
i i
i
R
R
nc
kk CT m )1,;/(
1),;( cmrf
mmcrf RR
)1
(1ˆ
1ˆ
1
n
jj
ii r
nm
012
23
3 acacacay
Significant genes list of BaP/DMSOGene dmso bap ratio Gene dmso bap ratio
5H 27.693 44.965 1.62371
7L 42.753 84.959 1.9872 7L 42.753 84.959 1.9872
8B 58.043 94.026 1.61993 8F 32.951 57.004 1.72997
8F 32.951 57.004 1.72997 8I 50.003 102.417 2.04822
8I 50.003 102.417 2.04822 9C 124.219 216.932 1.74637
9C 124.219 216.932 1.74637 18E 53.169 131.328 2.46998
11O 12.758 19.051 1.49324 20C 106.946 549.492 5.13801
18E 53.169 131.328 2.46998 22H 127.946 66.701 0.52132
20C 106.946 549.492 5.13801
22H 127.946 66.701 0.52132
23J 31.097 48.815 1.56978
99% CI ( 0.581621, 1.681510) 99% CI ( 0.48995 , 1.68639 )
95% CI (0.660326, 1.481089) 95% CI ( 0.55649 , 1.68269 )
the left hand side is the list of significant genes using PI the right hand side is the list of genes using Yidong’s algorithm
Permutation test
• For gene i in each paired experiment, permute data within pair to get the permuted sample. Under the assumption that genes do not change their expression pattern under the two conditions of study, we can permute data as follows:
Gene T1 C1 T2 C2 T3 C3
1 X1j Y1j X2j Y2j X3I Y3j
. . . . . . .
. . . . . . .
. . . . . . .g X1g Y1g X2g Y2g X3g Y3g
Permutation test continued• Get the normalized average log ratio for original(ALR) and
permuted data(ALR*)
• calculate the p-value for gene i:
where g is the total number of genes
• permute data n times and obtain n p-value for each gene. Then get the mean and sd for each gene and calculate 95% CI.
• If lower bound is less than 0.05, claim this gene as significant.
g
ALRALRjvaluep ij
i
|}||*:|{#
List of significant genes picked up by permutation test
Gene LB.95 P.mean UB.95 dmso1 dmso2 dmso3 bap1 bap2 bap36I 0.005937 0.00812 0.010302 18 20 15 28 39 87
14K 0.034691 0.040599 0.046506 50 124 7 48 126 127
18E 0.015475 0.01859 0.021705 43 72 38 85 280 75
20C -4.51E-05 0.000641 0.001327 66 241 36 218 1413 200
22H 3.84E-02 0.044445 0.050442 55 107 224 43 86 101
Significance Analysis of Microarrays (SAM)
• Limitation of parametric test: – Estimation of Variance:limited sample size (= few replicates)– Normal Distribution assumptions: error model still not clear– Multiple Testing
• Excel add-in performing robust method for differential analysis of microarray data.(Method developed and implemented by the Tibshirani group at Stanford (free for academic use)
• Permutation technique:Assuming no difference between conditions, all genes are from the same population.
• False Discovery Rate: Number of falsely called genes divided by number of differential genes in original data
• need large number of replicates
SAM test Statistic
0ss
rd
i
ii
• di = Score • si = Standard Deviation• s0 = Fudge Factor
21 iii xxr
2
)()(11
21
2
22
1
21
21
nn
xxxx
nns Cj
iijCj
iij
i
The SAM process• Perform permutation and compute test statistics
for each permutation• Rank test statistics in ascending order• Compute mean test statistics for each “rank” over
all permutations• Plot original “ranked” test Statistic Versus Mean
test statistic from permutations• Define distance from mean permuted value you
call significant• Compute false discovery rate for this value• Iterate until you get appropriate FDR
SAM analysis
Significant genes list
Row Gene Name Score(d) Numerator(r) Denominator(s+s0) Fold Change q-value (%)
201 20C 1.76581104 556 314.8694775 5.86297 8.870599
185 18E 1.64637644 95.66666667 58.10740719 2.87582 8.870599
51 6I 1.58325948 33.66666667 21.26414969 2.90566 8.870599
36 5H 1.22363840 30.66666667 25.06187011 2.31429 8.870599
124 11L 1.06476227 24.33333333 22.8533015 2.04286 8.870599
76 8F 0.96626427 41 42.43145579 2.30851 8.870599
68 7L 0.93100607 65.66666667 70.5330163 2.60163 8.870599
72 8B 0.92740987 61.66666667 66.49343301 2.10119 8.870599
79 8I 0.88594788 78.33333333 88.41754107 2.65493 8.870599
Other Methods and Software
• ANOVA • Likelihood ratio test
• Bayesian analysis• GeneSpring, GenePix etc.
• http://www.cs.tcd.ie/Nadia.Bolshakova/softwarelist.html
Conclusion Cutoff point determination: set up critical point to eliminate
genes whose intensity is less than this point. Statistically significant? No unique method to analyze data.
Some methods are better for one data set, but may not be good for other data sets. In practice, we have to try different ways to see which methods work well.
Biologically significant? For those genes picked up by statistics, we have to be careful to draw conclusions. Some genes shown to be significant may not be functionally meaningful. Conversely, genes that do not show up significant may be significant,
especially for those genes at the boarder line in the statistical test.
AcknowledgementsAcknowledgementsDept. of Pharmacology &
TherapeuticsDr. Shiverick Terry MedranoRenita Handayani
Dept. of StatisticsDr. Booth
Presentation download:http://www.stat.ufl.edu/~ycui