1
Data Analysis for Gene Chip DataPart I: One-gene-at-a-time methods
Min-Te Chao
2002/10/28
2
Outline
• Simple description of gene chip data
• Earlier works
• Mutiple t-test and SAM
• Lee’s ANOVA
• Wong’s factor models
• Efron’s empirical Bayes
3
Remarks
• Most works are statistical analysis, not really machine learning type
• Very small set of training sample – not to mention the test sample
• Medical research needs scientific rigor when we can
4
Arthritis and Rheumatism
• Guidelines for the submission and reviews of reports involving microarray technology
v.46, no. 4, 859-861
5
Reproducibility
• Should document the accuracy and precision of data, including run-to-run variability of each gene
• No arbitrary setting of threshold (e.g., 2-fold)
• Careful evaluation of false discovery rate
6
Statistical Analysis
• Statistical analysis is absolutely necessary to support claims of an increase or decrease of gene expression
• Such rigor requires multiple experiments and analysis of standard statistical instruments.
7
Sample Heterogenenity
• … Strongly recommends that investigators focus studies on homogenous cell populations until other methodological and data analysis problems can be resolved.
8
Independent Confirmation
• It is important that the findings be confirmed using an independent method, preferably with separate samples rather than restating of the original mRNA.
9
Microarray
• Other terms:
DNA array
DNA chips
biochips
Gene chips
10
• The underlying principle is the same for all microarrays, no matter how they are made
• Gene function is the key element researchers want to extract from the sequence
• DNA array is one of the most important tools
(Nature, v.416, April 2002 885-891)
11
2 types of microarray
• cDNA
• Oligonucleotides
• DIY type
12
• Microarray
allows the researchers to determine which genes are being expressed in a given cell type at a particular time and under particular condition
Gene-expression
13
Basic data form
• On each array, there are p “spots” (p>1000, sometimes 20000). Each spot has k probes (k=20 or so). There are usually 2k measurements (expressions) per spot, and the k differences, or the difference of logs, are used.
• Sometimes they only give you a summary statistics, e.g. median, mean,.. per spot
14
• Each spot corresponding to a “gene”• For each study, we can arrange the chips
so that the i-th spot represents the i-th gene. (genes close in index may not be close physically at all)
• This means that when we read the i-th spot of all chips in one study, we know we get different measurements of the same i-th gene
15
• Data of one chip can be arranged in a matrix form,
Y; X_1, X_2, …, X_p
Just as in a regression setup. But in practice, n (chips used) is small compared with p.
Y is the response: cell type, experimental condition, survival time, …
16
• For a spot with 20 probes, see Efron et al. (2001, JASA, p.1153).
17
Earlier works
• Cluster analysis
• Fold methods
• Multiple t with Bonferroni correction
18
Multiple t with Bonferroni correction
• It is too conservative
• Family wise error rate
Among G tests, the probability of at least one false reject – basically goes to 1 with exponential rate in G
19
Sidak’s single-step adjusted p-value
p’=1-(1-p)^G
Bonferroni’s single-step adjusted p-value
p’=min{Gp,1}
All are very conservative
20
FDR –false discovery rate
• Roughly: Among all rejected cases, how many are rejected wrong?
(Benjamini and Hochberg 1995 JRSSB, 289-300) “Sequential p-method”
21
Sequential p-method
• Using the observed data, it estimates the rejection regions so that the
FDR < alpha
Order all p-values, from small to large, and obtain a k so the first k hypotheses (wrt the smallest k p-values) are rejected.
22
• Since we have a different definition for error to control, it will increase the “power”
• For modifications, see Storey (2002, JRSSB, 479-498)
• These are criteria specifically designed to handle risk assessment when G is large
23
Role of permutation
• For tests (multiple or not), it is important to use a null distribution
• It is generated by a well-designed permutation (of the columns of the data matrix) –column refers to observations, not genes.
24
One simple example
• Let us say we look at the first gene, with n_1 arrays for treatment and n_2 arrays for control
• We use a t-statistics, t_1, say. What is the p-value corresponding to this observed t_1?
25
• Permute the n=n_+n_2 columns of data of the data matrix. Look at first row (corresponds to the first gene)
• Treat the first n_1 numbers as a fake “treatment”, the last n_2 numbers as a fake “control” , compute a t-value, say we get s_1
26
• Permute again and do the same thing and we get s_2, ….
• Do it B times and get s_1, s_2, …., s_B
• Treat these s’s as a (bootstrap) sample for the null distribution of the t_1 statistic
• The p-value of the earlier t_1 is found from the ecdf of the s_j, j=1,2,…,B
27
• Permutation plays a major role --- finding a reference measure of variation in various situations
• For a well designed experiment with microarray, DOE techniques will play an important role in determining how to do proper permutations.
28
SAM– significance analysis of microarray
• A standard method of microarray analysis, taught many times in Stanford short courses of data mining
• Modified multiple t-tests
• Using the permutation of certain data columns to evaluate variation of data in each gene
29
• Original paper is hard to read:
(Tusher, Tibshirani and Chu, PNAS 2001, v.98, no.9, 5116-5121)
But the SAM manual is a lot easier to read for statisticians: (free software for academia use)
30
• D(i)={X_treatment – X_control} over
{s(i)+s_0}
i=1,2,…,G
D(1)<D(2)<…..
Used in SAM, s_0 is a carefully determined constant >0.
31
• D(i)* are used with certain group of permutations of the columns; D(i)* are also ordered
• Plot D vs. D*, points outside the 45-degree line by a threshold Delta are signals of significant expression change.
• Control the value of Delta to get different FDR.
32
Other model-based methods
• Wong’s model
PM-MM= \theta \phi + \epsilon
Outlier detection
Model validation
Li and Wong (2001, PNAS v.98, no.1, 31-36)
33
Lee’s work
• ANOVA based
• May do unbalanced data – e.g., 7 microarray chips
(Lee et al. 2000, PNAS, v.97, 9834-9839)
34
Empirical Bayes
• (Efron et al. (2001) JASA, v.96, 1151-1160)
• Use a mix model
f(z)=p_0 f_0(z)+p_1 f_1(z)
with f_0, f_1 estimated by data.
p_1=prior prob that a gene expression is affected (by a treatment)
35
• A key idea is to use permuted (columns) data to estimate f_0
• Use a tricky logistic regression method
• Eventually found
p_1(Z)= the a posteriori probability that a gene at expression level Z is affected
36
Part I conclusion
• Earlier methods are relatively easy to understand, but to get familiar with the bio-language needs time
• More powerful data analytic methods will continue to develop
• It is important to first understand the basic problems of biologist before we jump with the fancy stat methods
37
• We may do the wrong problem …
• But if the problem is relevant, even simple methods can get good recognition
• All methods so far are “first moment only” – ie, not too much different from multiple t tests; or, they all are one-gene-at-a-time methods.
38
• We did not address issues about data cleaning, outlier detection, normalization, etc. Microarray data are highly noisy, these problems are by no means trivial.
• As the cost per chip goes down, the number of chips per problem may grow. But still well-designed experiments, e.g., fractional factorial, has room to play in this game
39
• Statistical methods, as compared with machine learn based methods, will play a more important role for this type of data since, with a model, parametric or not, one can attach a measure of confidence to the claimed result. This is crucial for scientific development.
40
Quote:
• The statistical literature for microarrays, still in its infancy and with much of it unpublished, has tended to focus on frequentist data-analytical devices, such as cluster analysis, bootstrapping and linear models. (Efron, B. 2001)