Date post: | 30-Dec-2015 |
Category: |
Documents |
Upload: | flora-oneal |
View: | 214 times |
Download: | 1 times |
11
Use of the Half-Normal Probability Use of the Half-Normal Probability Plot to Identify Significant Effects for Plot to Identify Significant Effects for
Microarray DataMicroarray Data
C. F. Jeff WuC. F. Jeff WuUniversity of MichiganUniversity of Michigan
(joint work with G. Dyson)(joint work with G. Dyson)
22
OutlineOutline
• Current MethodsCurrent Methods
• Proposed MethodologyProposed Methodology
• Analysis PlanAnalysis Plan
• ExampleExample
• ConclusionsConclusions
33
What are microarrays?What are microarrays?
• Two major typesTwo major types– Oligonucleotide gene chipsOligonucleotide gene chips– Spotted glass arraysSpotted glass arrays
• Perfect match (PM) and mismatch (MM) Perfect match (PM) and mismatch (MM) probes are spotted onto a gene chipprobes are spotted onto a gene chip– ~20 probes make up a probe set (or gene)~20 probes make up a probe set (or gene)– MM probe for each gene has the middle base MM probe for each gene has the middle base
set to the complement of its PM probe set to the complement of its PM probe – Hybridize labeled RNA corresponding to PM Hybridize labeled RNA corresponding to PM
probesprobes
• Glass arrays involve the competitive Glass arrays involve the competitive hybridization of two RNA pools to cDNA hybridization of two RNA pools to cDNA spotted onto a glass slidespotted onto a glass slide
• Typically thousands on genes on a slideTypically thousands on genes on a slide
44
Multiplicity ProblemMultiplicity Problem
• When we make more than one When we make more than one comparison in a hypothesis testing comparison in a hypothesis testing situation, p-value interpretation falls situation, p-value interpretation falls through through
• Control of family error rate is necessary in Control of family error rate is necessary in order to preserve nominal type I error rateorder to preserve nominal type I error rate
• Various approaches to correct the chance Various approaches to correct the chance of making a type I error for multiplicity, of making a type I error for multiplicity, including Tukey, Bonferroni and Holmsincluding Tukey, Bonferroni and Holms
55
Microarray Analysis Microarray Analysis Techniques Techniques
• Westfall Young step down (WY)Westfall Young step down (WY)
• Significance Analysis of Microarrays Significance Analysis of Microarrays (SAM)(SAM)
• Empirical Bayes (EB)Empirical Bayes (EB)
• Bayesian (MCMC)Bayesian (MCMC)
• Mixture ModelingMixture Modeling
• Dimension reduction techniquesDimension reduction techniques
• Machine learningMachine learning
66
Westfall Young (WY)Westfall Young (WY)
• Compute ranks of original test statistic Compute ranks of original test statistic rrjj
such that such that
• Construct Construct b b balanced permutations of the balanced permutations of the samples, computing the same test samples, computing the same test statistic as above for each statistic as above for each bb
• Compute Compute
• Repeat B times and calculate the adjust Repeat B times and calculate the adjust p-value as p-value as
• Less conservative than BonferroniLess conservative than Bonferroni
)()(1 ,, b
kb tt
andand
77
Significance Analysis of Significance Analysis of Microarrays (SAM)Microarrays (SAM)
• Use a t-like statistic Use a t-like statistic
• Use balanced permutation method from Use balanced permutation method from previous slide to estimate null distribution, previous slide to estimate null distribution, assuming all effects are nullassuming all effects are null
• Call genes that fall outside Call genes that fall outside bars bars significant significant
88
Half-Normal AnalysisHalf-Normal Analysis
99
Microarray Specific ProblemMicroarray Specific Problem
1010
Analysis PlanAnalysis Plan
• Robust measures of location and scaleRobust measures of location and scale
• Summary statisticSummary statistic
• Two half-normal plots (for upward-Two half-normal plots (for upward-regulated and downward-regulated regulated and downward-regulated genes)genes)
• Segment determinationSegment determination– FindFind– insignificant, borderline, significantinsignificant, borderline, significant
• Repeat the procedure, using as base Repeat the procedure, using as base
NCJJ ,
NCJ
1111
Robust Measures of Location and Robust Measures of Location and ScaleScale
• Perform transformation and suitable Perform transformation and suitable normalizationnormalization
• Compute median and Maximum Absolute Compute median and Maximum Absolute Deviation (MAD) for each geneDeviation (MAD) for each gene– Reasonable estimatesReasonable estimates– Less affected by outliers than mean and SDLess affected by outliers than mean and SD– Interested in robustness rather than efficiencyInterested in robustness rather than efficiency
1212
• Compute quasi two-sample t-statistic Compute quasi two-sample t-statistic using robust values from above:using robust values from above:
• cc is chosen to minimize is chosen to minimize
for the middle 100*(1-2for the middle 100*(1-2)% of the )% of the ssssll..
• Tusher Tusher et al.et al. (2001) chose (2001) chose cc to minimize to minimize the coefficient of variationthe coefficient of variation
• Efron Efron et al. et al. (2001)(2001) used the 90used the 90thth percentile of the gene standard error percentile of the gene standard error estimates for estimates for cc
Summary StatisticSummary Statistic
1313
• Construct two half-normal plots: one for Construct two half-normal plots: one for the the p p positive and positive and r r negative negative ssssll..
• Run the procedure separately on each Run the procedure separately on each setset
• Denote the ordered Denote the ordered pp positive effects by positive effects by
• Plot Plot abssabssii against half-normal distribution against half-normal distribution
quantiles, i.e. the points quantiles, i.e. the points
• Goal: obtain set of noise effectsGoal: obtain set of noise effects• Yield a baseline against which to test the Yield a baseline against which to test the
rest of the effectsrest of the effects
Two Half-Normal PlotsTwo Half-Normal Plots
)),/]5.[5.5(. )(1
iabsspi
1414
• Given Given initialize null set as points initialize null set as points abssabss11 :: abss absskk
• Regress null set on Regress null set on 1:k 1:k half-normal half-normal quantiles (quantiles (QQ11:Q:Qkk))
• Produce predicted values at the Produce predicted values at the remaining quantile values (remaining quantile values (QQhh:h>k:h>k) )
• Compute predicted statistics Compute predicted statistics
withwith
• Find Find
Segment Determination: Segment Determination: J
hy
1515
Segment Determination: (cont)Segment Determination: (cont)
• The initial null set of The initial null set of kk genes becomes genes becomes k k + + m m (= ) null genes(= ) null genes
• Now re-do the segment determination Now re-do the segment determination procedure, using the procedure, using the k k + + m m genes as genes as base null setbase null set
• Continue until no new genes are addedContinue until no new genes are added
• Do for each Do for each k k less than less than p-1p-1
• Store the end point Store the end point
• Set the most frequent to Set the most frequent to kJ J
J
kJ
kJ
1616
SampleSample
• Let Let k k = 200, total effects = 500= 200, total effects = 500– First 200 ordered positive effects regressed on First 200 ordered positive effects regressed on
first 200 half-normal quantiles first 200 half-normal quantiles – Test ordered effects 201 to 500 using absolute Test ordered effects 201 to 500 using absolute
value of predicted statisticsvalue of predicted statistics– For example, effect 239 is the largest For example, effect 239 is the largest hh less than less than
the t-critical valuethe t-critical value– So would initially be 239So would initially be 239
• Redo the above, with k = 239 effects; so we Redo the above, with k = 239 effects; so we
test effects 240 to 500test effects 240 to 500– Say statistic 242 is the largest Say statistic 242 is the largest hh less than t-critical less than t-critical
value based on new regression linevalue based on new regression line– So the new would be 242So the new would be 242
• Redo the above again with k = 242, test Redo the above again with k = 242, test
effects 243 to 500effects 243 to 500– No statistics are less than t critical valueNo statistics are less than t critical value
• So is 242So is 242
200kJ
200J
200J
1717
ExampleExample
J
3116J
1818
• Will test all effects after using same Will test all effects after using same statisticsstatistics
• To adjust for multiple testing, define To adjust for multiple testing, define NCNC as the number of consecutive significant as the number of consecutive significant effects necessary to call all subsequent effects necessary to call all subsequent effects significant effects significant
• Use the Bonferroni adjustment (does not Use the Bonferroni adjustment (does not require independence): require independence):
• Instead of doing thousands of Instead of doing thousands of comparisons, only need to do comparisons, only need to do NCNC to to determine significancedetermine significance
• DefineDefine
• Now we have identified the Now we have identified the change pointschange points in the graph for segment detectionin the graph for segment detection
Find Find NCJ
J
1919
Example: Downward- Example: Downward- regulated Speed Mouse Dataregulated Speed Mouse Data
2020
Example: Downward Regulated Example: Downward Regulated Speed Mouse Data (cont)Speed Mouse Data (cont)
J
NCJ
2121
Error Rate Estimation: FDRError Rate Estimation: FDR
• False Discovery Rate (FDR) is the False Discovery Rate (FDR) is the expected proportion of falsely rejected expected proportion of falsely rejected hypotheseshypotheses
• Permute the condition labels, maintaining Permute the condition labels, maintaining balancebalance– Example: 8 replicates in conditions A and BExample: 8 replicates in conditions A and B– Each A’ and B’ will have 4 replicates from A Each A’ and B’ will have 4 replicates from A
and 4 from Band 4 from B– Compute the robust statistics, keeping the Compute the robust statistics, keeping the
same same cc from the actual data from the actual data
• Determine the average number of effects Determine the average number of effects that fall above the positive or below the that fall above the positive or below the negative boundary of the significant setsnegative boundary of the significant sets
• Divide that number by the total number of Divide that number by the total number of called significant effectcalled significant effect
2222
Speed Data: Analysis and Speed Data: Analysis and ComparisonComparison
• WY found 8 genes significant, with Type I WY found 8 genes significant, with Type I error = 0.05error = 0.05
2323
• WY found 253 genes significant, with WY found 253 genes significant, with Type I error = 0.05Type I error = 0.05
Lemon Data: Analysis and Lemon Data: Analysis and ComparisonComparison
2424
ConclusionsConclusions
• Proposed a new method for determining Proposed a new method for determining differential expression in genesdifferential expression in genes
• Dealt with the multiplicity problem by using Dealt with the multiplicity problem by using only a small subset of genesonly a small subset of genes
• Can extend to other large data setsCan extend to other large data sets
• Allow scientists to play a role in sequential Allow scientists to play a role in sequential decision makingdecision making
• Incorporate Incorporate a prioria priori knowledge of experiment knowledge of experiment with selection of with selection of cc