1 Use of the Half-Normal Probability Plot to Identify Significant Effects for Microarray Data C. F....

11

Use of the Half-Normal Probability Use of the Half-Normal Probability Plot to Identify Significant Effects for Plot to Identify Significant Effects for

Microarray DataMicroarray Data

C. F. Jeff WuC. F. Jeff WuUniversity of MichiganUniversity of Michigan

(joint work with G. Dyson)(joint work with G. Dyson)

22

OutlineOutline

• Current MethodsCurrent Methods

• Proposed MethodologyProposed Methodology

• Analysis PlanAnalysis Plan

• ExampleExample

• ConclusionsConclusions

33

What are microarrays?What are microarrays?

• Two major typesTwo major types– Oligonucleotide gene chipsOligonucleotide gene chips– Spotted glass arraysSpotted glass arrays

• Perfect match (PM) and mismatch (MM) Perfect match (PM) and mismatch (MM) probes are spotted onto a gene chipprobes are spotted onto a gene chip– ~20 probes make up a probe set (or gene)~20 probes make up a probe set (or gene)– MM probe for each gene has the middle base MM probe for each gene has the middle base

set to the complement of its PM probe set to the complement of its PM probe – Hybridize labeled RNA corresponding to PM Hybridize labeled RNA corresponding to PM

probesprobes

• Glass arrays involve the competitive Glass arrays involve the competitive hybridization of two RNA pools to cDNA hybridization of two RNA pools to cDNA spotted onto a glass slidespotted onto a glass slide

• Typically thousands on genes on a slideTypically thousands on genes on a slide

44

Multiplicity ProblemMultiplicity Problem

• When we make more than one When we make more than one comparison in a hypothesis testing comparison in a hypothesis testing situation, p-value interpretation falls situation, p-value interpretation falls through through

• Control of family error rate is necessary in Control of family error rate is necessary in order to preserve nominal type I error rateorder to preserve nominal type I error rate

• Various approaches to correct the chance Various approaches to correct the chance of making a type I error for multiplicity, of making a type I error for multiplicity, including Tukey, Bonferroni and Holmsincluding Tukey, Bonferroni and Holms

55

Microarray Analysis Microarray Analysis Techniques Techniques

• Westfall Young step down (WY)Westfall Young step down (WY)

• Significance Analysis of Microarrays Significance Analysis of Microarrays (SAM)(SAM)

• Empirical Bayes (EB)Empirical Bayes (EB)

• Bayesian (MCMC)Bayesian (MCMC)

• Mixture ModelingMixture Modeling

• Dimension reduction techniquesDimension reduction techniques

• Machine learningMachine learning

66

Westfall Young (WY)Westfall Young (WY)

• Compute ranks of original test statistic Compute ranks of original test statistic rrjj

such that such that

• Construct Construct b b balanced permutations of the balanced permutations of the samples, computing the same test samples, computing the same test statistic as above for each statistic as above for each bb

• Compute Compute

• Repeat B times and calculate the adjust Repeat B times and calculate the adjust p-value as p-value as

• Less conservative than BonferroniLess conservative than Bonferroni

)()(1 ,, b

kb tt

andand

77

Significance Analysis of Significance Analysis of Microarrays (SAM)Microarrays (SAM)

• Use a t-like statistic Use a t-like statistic

• Use balanced permutation method from Use balanced permutation method from previous slide to estimate null distribution, previous slide to estimate null distribution, assuming all effects are nullassuming all effects are null

• Call genes that fall outside Call genes that fall outside bars bars significant significant

88

Half-Normal AnalysisHalf-Normal Analysis

99

Microarray Specific ProblemMicroarray Specific Problem

1010

Analysis PlanAnalysis Plan

• Robust measures of location and scaleRobust measures of location and scale

• Summary statisticSummary statistic

• Two half-normal plots (for upward-Two half-normal plots (for upward-regulated and downward-regulated regulated and downward-regulated genes)genes)

• Segment determinationSegment determination– FindFind– insignificant, borderline, significantinsignificant, borderline, significant

• Repeat the procedure, using as base Repeat the procedure, using as base

NCJJ ,

NCJ

1111

Robust Measures of Location and Robust Measures of Location and ScaleScale

• Perform transformation and suitable Perform transformation and suitable normalizationnormalization

• Compute median and Maximum Absolute Compute median and Maximum Absolute Deviation (MAD) for each geneDeviation (MAD) for each gene– Reasonable estimatesReasonable estimates– Less affected by outliers than mean and SDLess affected by outliers than mean and SD– Interested in robustness rather than efficiencyInterested in robustness rather than efficiency

1212

• Compute quasi two-sample t-statistic Compute quasi two-sample t-statistic using robust values from above:using robust values from above:

• cc is chosen to minimize is chosen to minimize

for the middle 100*(1-2for the middle 100*(1-2)% of the )% of the ssssll..

• Tusher Tusher et al.et al. (2001) chose (2001) chose cc to minimize to minimize the coefficient of variationthe coefficient of variation

• Efron Efron et al. et al. (2001)(2001) used the 90used the 90thth percentile of the gene standard error percentile of the gene standard error estimates for estimates for cc

Summary StatisticSummary Statistic

1313

• Construct two half-normal plots: one for Construct two half-normal plots: one for the the p p positive and positive and r r negative negative ssssll..

• Run the procedure separately on each Run the procedure separately on each setset

• Denote the ordered Denote the ordered pp positive effects by positive effects by

• Plot Plot abssabssii against half-normal distribution against half-normal distribution

quantiles, i.e. the points quantiles, i.e. the points

• Goal: obtain set of noise effectsGoal: obtain set of noise effects• Yield a baseline against which to test the Yield a baseline against which to test the

rest of the effectsrest of the effects

Two Half-Normal PlotsTwo Half-Normal Plots

)),/]5.[5.5(. )(1

iabsspi

1414

• Given Given initialize null set as points initialize null set as points abssabss11 :: abss absskk

• Regress null set on Regress null set on 1:k 1:k half-normal half-normal quantiles (quantiles (QQ11:Q:Qkk))

• Produce predicted values at the Produce predicted values at the remaining quantile values (remaining quantile values (QQhh:h>k:h>k) )

• Compute predicted statistics Compute predicted statistics

withwith

• Find Find

Segment Determination: Segment Determination: J

hy

1515

Segment Determination: (cont)Segment Determination: (cont)

• The initial null set of The initial null set of kk genes becomes genes becomes k k + + m m (= ) null genes(= ) null genes

• Now re-do the segment determination Now re-do the segment determination procedure, using the procedure, using the k k + + m m genes as genes as base null setbase null set

• Continue until no new genes are addedContinue until no new genes are added

• Do for each Do for each k k less than less than p-1p-1

• Store the end point Store the end point

• Set the most frequent to Set the most frequent to kJ J

J

kJ

kJ

1616

SampleSample

• Let Let k k = 200, total effects = 500= 200, total effects = 500– First 200 ordered positive effects regressed on First 200 ordered positive effects regressed on

first 200 half-normal quantiles first 200 half-normal quantiles – Test ordered effects 201 to 500 using absolute Test ordered effects 201 to 500 using absolute

value of predicted statisticsvalue of predicted statistics– For example, effect 239 is the largest For example, effect 239 is the largest hh less than less than

the t-critical valuethe t-critical value– So would initially be 239So would initially be 239

• Redo the above, with k = 239 effects; so we Redo the above, with k = 239 effects; so we

test effects 240 to 500test effects 240 to 500– Say statistic 242 is the largest Say statistic 242 is the largest hh less than t-critical less than t-critical

value based on new regression linevalue based on new regression line– So the new would be 242So the new would be 242

• Redo the above again with k = 242, test Redo the above again with k = 242, test

effects 243 to 500effects 243 to 500– No statistics are less than t critical valueNo statistics are less than t critical value

• So is 242So is 242

200kJ

200J

200J

1717

ExampleExample

J

3116J

1818

• Will test all effects after using same Will test all effects after using same statisticsstatistics

• To adjust for multiple testing, define To adjust for multiple testing, define NCNC as the number of consecutive significant as the number of consecutive significant effects necessary to call all subsequent effects necessary to call all subsequent effects significant effects significant

• Use the Bonferroni adjustment (does not Use the Bonferroni adjustment (does not require independence): require independence):

• Instead of doing thousands of Instead of doing thousands of comparisons, only need to do comparisons, only need to do NCNC to to determine significancedetermine significance

• DefineDefine

• Now we have identified the Now we have identified the change pointschange points in the graph for segment detectionin the graph for segment detection

Find Find NCJ

J

1919

Example: Downward- Example: Downward- regulated Speed Mouse Dataregulated Speed Mouse Data

2020

Example: Downward Regulated Example: Downward Regulated Speed Mouse Data (cont)Speed Mouse Data (cont)

J

NCJ

2121

Error Rate Estimation: FDRError Rate Estimation: FDR

• False Discovery Rate (FDR) is the False Discovery Rate (FDR) is the expected proportion of falsely rejected expected proportion of falsely rejected hypotheseshypotheses

• Permute the condition labels, maintaining Permute the condition labels, maintaining balancebalance– Example: 8 replicates in conditions A and BExample: 8 replicates in conditions A and B– Each A’ and B’ will have 4 replicates from A Each A’ and B’ will have 4 replicates from A

and 4 from Band 4 from B– Compute the robust statistics, keeping the Compute the robust statistics, keeping the

same same cc from the actual data from the actual data

• Determine the average number of effects Determine the average number of effects that fall above the positive or below the that fall above the positive or below the negative boundary of the significant setsnegative boundary of the significant sets

• Divide that number by the total number of Divide that number by the total number of called significant effectcalled significant effect

2222

Speed Data: Analysis and Speed Data: Analysis and ComparisonComparison

• WY found 8 genes significant, with Type I WY found 8 genes significant, with Type I error = 0.05error = 0.05

2323

• WY found 253 genes significant, with WY found 253 genes significant, with Type I error = 0.05Type I error = 0.05

Lemon Data: Analysis and Lemon Data: Analysis and ComparisonComparison

2424

ConclusionsConclusions

• Proposed a new method for determining Proposed a new method for determining differential expression in genesdifferential expression in genes

• Dealt with the multiplicity problem by using Dealt with the multiplicity problem by using only a small subset of genesonly a small subset of genes

• Can extend to other large data setsCan extend to other large data sets

• Allow scientists to play a role in sequential Allow scientists to play a role in sequential decision makingdecision making

• Incorporate Incorporate a prioria priori knowledge of experiment knowledge of experiment with selection of with selection of cc

Date post:	30-Dec-2015
Category:	Documents
Upload:	flora-oneal
View:	214 times
Download:	1 times

1 Use of the Half-Normal Probability Plot to Identify Significant Effects for Microarray Data C. F....

Documents