Download - Microarray Data Normalization and Analysiscamda2009.bioinformatics.northwestern.edu/.../quackenbush/presen… · Microarray Data Normalization and Analysis John Quackenbush CAMDA

Microarray Data NormalizationMicroarray Data Normalizationand Analysisand Analysis

John QuackenbushJohn QuackenbushCAMDACAMDA

12 November 200312 November 2003

Microarray Analysis at TIGRMicroarray Analysis at TIGR

Step 1: Experimental DesignStep 1: Experimental Design

Step 2: Data CollectionStep 2: Data Collection

Step 3: Data AnalysisStep 3: Data Analysis

Step 4: Consulting with the Step 4: Consulting with the ArraySWArraySW gang in the trailergang in the trailer

Step 5: Sharing data with our collaboratorsStep 5: Sharing data with our collaborators

TM4TM4 ResourcesResources

Saeed, A.I., Sharov, V., White, J., Li, J., Saeed, A.I., Sharov, V., White, J., Li, J., Liang, W., Bhagabati, N., Braisted, J., et Liang, W., Bhagabati, N., Braisted, J., et al. 2003. TM4: A Free, Openal. 2003. TM4: A Free, Open--Source Source System for Microarray Data System for Microarray Data Management and Analysis. Management and Analysis. BioTechniquesBioTechniques 34: 37434: 374--378378

TM4 ReferenceTM4 Reference

•• All TM4 ApplicationsAll TM4 Applications•• User ManualsUser Manuals•• Supplementary DocumentationSupplementary Documentation•• Sample Data SetsSample Data Sets

CD ContentsCD Contents

TM4 WebsiteTM4 Website •• Application DownloadsApplication Downloads•• Documentation and Documentation and FAQsFAQs•• And Much More!And Much More!http://www.tigr.org/software/tm4/http://www.tigr.org/software/tm4/

Microarray Data Flow

Database

AGED

Database

Others…

Database

MAD

Raw Gene Expression Data

Normalized Data with Gene Annotation

Interpretation of Analysis Results

.tiff Image File

Gene Annotation

ScannerPrinter

Image Analysis

Normalization / Filtering

Expression Analysis


Database

AGED

Database

Others…

Database

MAD




.tiff Image File

Gene Annotation

ScannerPrinter

Image Analysis


Expression Analysis

MAD Database SchemaMAD Database Schema

MADAMMADAM


Database

AGED

Database

Others…

Database

MAD




.tiff Image File

Gene Annotation

ScannerPrinter

Image Analysis


Expression Analysis

TIGR SpotfinderTIGR Spotfinder


Database

AGED

Database

Others…

Database

MAD




.tiff Image File

Gene Annotation

ScannerPrinter

Image Analysis


Expression Analysis

MIDASMIDAS


Database

AGED

Database

Others…

Database

MAD




.tiff Image File

Gene Annotation

ScannerPrinter

Image Analysis


Expression Analysis

MeVMeV

The Experimental Design dictates a good deal of what The Experimental Design dictates a good deal of what you can do with the datayou can do with the data

Good normalization and processing reflects the Good normalization and processing reflects the experimental designexperimental design

The design also facilitates certain comparisons between The design also facilitates certain comparisons between samples and provides the statistical power you need for samples and provides the statistical power you need for assigning confidence limits to individual measurementsassigning confidence limits to individual measurements

The design must reflect experimental realityThe design must reflect experimental reality

The most straightThe most straight--forward designs compare expression forward designs compare expression in two classes of samples to look for patterns that in two classes of samples to look for patterns that distinguish them.distinguish them.

The Experimental Design

Sample Pairing for CoSample Pairing for Co--Hybridization ExperimentsHybridization ExperimentsDirect Comparison with Dye Swap:Direct Comparison with Dye Swap:

AA11

AA11

BB11

BB11

AA22

AA22

BB22

BB22

AA11 BB11 AA22 BB22

Balanced Block Design:Balanced Block Design:

AA33

AA33

BB33

BB33

AA33 BB33 AA44 BB44

AA44

AA44

BB44

BB44

•• RNA sample is RNA sample is notnot limiting (e.g. plenty of sample)limiting (e.g. plenty of sample)•• Flip dyes account for any geneFlip dyes account for any gene--dye effectsdye effects

•• RNA sample is limitingRNA sample is limiting•• Balanced blocking accounts for any geneBalanced blocking accounts for any gene--dye effectsdye effects

Multiple Sample PairingsMultiple Sample PairingsReference Design (Indirect Comparison):Reference Design (Indirect Comparison):

AA

CC

BB

DD

AA BB CC

RR

DD

AA

CC

BB

EE

FF

DD

Loop Design:Loop Design:

•• More than two samples are comparedMore than two samples are compared(e.g. tumor classification, time course)(e.g. tumor classification, time course)

•• Flip dyes are not necessary but can be Flip dyes are not necessary but can be done to increase precisiondone to increase precision

•• Ratio values are inferred (indirect)Ratio values are inferred (indirect)•• Suited for cluster analysis Suited for cluster analysis –– need common need common

referencereference

Why perform flipWhy perform flip--dye experiments?dye experiments?Microarray experiments employing a single direction designMicroarray experiments employing a single direction design::QQ11 CC11 QQ22 CC22 QQ33 CC33

FlipFlip--dye or dyedye or dye--reversal designreversal design::

QQ11 CC11 QQ22 CC22

FlipFlip--dye experiments account for genedye experiments account for gene--dye effectsdye effects

QQ33 CC33

A B C D E

R R R R R

10 hybs Standard flip-dye expt

A

B

CD

E10 hybs Proposed loop expt

0 new hybs

A

B

CD

R Proposed loop expt withreference to provide

direct comparison with reference

A

B

DC

E 3 new hybsProposed loop expt with

reference to showinvariance w.r.t order

Loops and Reference Designs

S. Wang , K. Kerr, J. Quackenbush, G. ChurchillS. Wang , K. Kerr, J. Quackenbush, G. Churchill

23 Hybs23 Hybs

Loops and Reference Designs

S. Wang , K. Kerr, J. Quackenbush, G. ChurchillS. Wang , K. Kerr, J. Quackenbush, G. Churchill

Both approaches can give equivalent resultsBoth approaches can give equivalent results

Loop designLoop designCan provide direct measurementsCan provide direct measurementsGive more data on each experimental sample with Give more data on each experimental sample with

the same number of the same number of hybshybsRequire more RNA per sampleRequire more RNA per sampleCan Can ““unwindunwind”” with a bad sample or for a gene with a bad sample or for a gene

with bad datawith bad data

Reference designReference designEasily extensibleEasily extensibleSimple interpretation of all resultsSimple interpretation of all resultsRequires less RNA per sampleRequires less RNA per sampleLess sensitive to bad RNA samples and bad arrayLess sensitive to bad RNA samples and bad array

elementselements

Loop vs. Reference DesignsLoop vs. Reference Designs

A1A1 B1B1

A2A2

B2B2

A3A3B3B3

A4A4

B4B4

A1A1

B1B1

A2A2

B2B2

A3A3

B3B3

A4A4

B4B4

A1A1

B2B2

A2A2

B3B3

A3A3

B4B4

A4A4

B1B1

A1A1

B3B3

A2A2

B4B4

A3A3

B1B1

A4A4

B2B2

A1A1

B4B4

A2A2

B3B3

A3A3

B2B2

A4A4

B1B1

Experimental DesignExperimental Design

Keep it simple!Keep it simple!

XXXX

XX

Parental Parental -- stressedstressed

Parental Parental -- unstressedunstressed

Derived Derived -- stressedstressed

Derived Derived -- unstressedunstressed

EnvironmentEnvironment

GenotypeGenotype

One Possible Experimental ParadigmOne Possible Experimental Paradigm::Examining Genotype, Phenotype, and EnvironmentExamining Genotype, Phenotype, and Environment

Reference SampleReference Sample

Assay VariationAssay Variation

Biological replicas are more informative than Biological replicas are more informative than correlated replicas (independent RNA, independent correlated replicas (independent RNA, independent slides)slides)

More replicas are better More replicas are better –– higher statistical powerhigher statistical power

For loops, hybridizations of individual samples should For loops, hybridizations of individual samples should be be ““balancedbalanced”” (as many Cy3 as Cy5 (as many Cy3 as Cy5 labelingslabelings))

SelfSelf--self hybs add data on reproducibility and can be self hybs add data on reproducibility and can be used to produce error modelsused to produce error models

At a minimum, should use dye swap replicates to At a minimum, should use dye swap replicates to compensate for any dye biases in labeling or detectioncompensate for any dye biases in labeling or detection

Basic Design PrinciplesBasic Design Principles

How Many Replicates?How Many Replicates?

Where zWhere zαα /2/2 and and zzββ are normal percentile values at are normal percentile values at significance level significance level αα and and falsefalse negative rate negative rate ββ; ; δδ represents the minimum detectable logrepresents the minimum detectable log22 ratioratio;;and and σσ represents the SD of log ratio valuesrepresents the SD of log ratio values..

For For αα = 0.001 and = 0.001 and ββ = 0.05, then z= 0.05, then zαα /2/2 = = --3.29 and 3.29 and zzββ = = --1.65.1.65.

Assume Assume δδ = 1.0 (2= 1.0 (2--fold change) and fold change) and σσ = 0.25,= 0.25,

Therefore n = 12 samples (6 query and 6 control).Therefore n = 12 samples (6 query and 6 control).

(Simon et al., (Simon et al., Genetic EpidemiologyGenetic Epidemiology 23: 2123: 21--36, 2002)36, 2002)

n = [4(zn = [4(zαα /2/2 + z+ zββ))22] / [(] / [(δδ/1.4/1.4σσ))22]]

Some Concepts from Some Concepts from StatisticsStatistics

Probability distributionsProbability distributions

The The probabilityprobability of an event is the likelihood of its occurring.of an event is the likelihood of its occurring.

It is sometimes computed as a relative It is sometimes computed as a relative frequency (frequency (rfrf)), where, where

The probability of an event can sometimes be inferred from aThe probability of an event can sometimes be inferred from a“theoretical” “theoretical” probability distributionprobability distribution, such as a normal distribution., such as a normal distribution.

the number of “favorable” outcomes for an eventthe number of “favorable” outcomes for an eventthe total number of possible outcomes for that eventthe total number of possible outcomes for that eventrfrf ==

Normal distributionNormal distribution

σσ = standard = standard deviationdeviationof the of the distributiondistribution

X = X = µµ (mean of the distribution)(mean of the distribution)

Population 1Population 1

Mean 1Mean 1

Population 2Population 2

Mean 2Mean 2

Less than a 5 %Less than a 5 % chance that the sample with mean chance that the sample with mean ss came from came from Population 1Population 1

ss is significantly different from is significantly different from Mean 1Mean 1 at the p < 0.05 significance level. at the p < 0.05 significance level.

But we cannot reject the hypothesis that the sample came from But we cannot reject the hypothesis that the sample came from Population 2Population 2

Sample mean “Sample mean “ss””

Many biological variables, such as height and weight, can Many biological variables, such as height and weight, can reasonably be assumed to approximate the normal reasonably be assumed to approximate the normal distribution. distribution.

But expression measurements? Probably not.But expression measurements? Probably not.

Fortunately, many statistical tests are considered to be fairly Fortunately, many statistical tests are considered to be fairly robust to violations of the normality assumption, and other robust to violations of the normality assumption, and other assumptions used in these tests.assumptions used in these tests.

Randomization / Randomization / resamplingresampling based tests can be used to get based tests can be used to get around the violation of the normality assumption.around the violation of the normality assumption.

Even when parametric statistical tests (the ones that make use Even when parametric statistical tests (the ones that make use of normal and other distributions) are valid, randomization of normal and other distributions) are valid, randomization tests are still useful.tests are still useful.

Probability and Expression DataProbability and Expression Data

1. Compute the value of interest (i.e., the 1. Compute the value of interest (i.e., the testtest--statistic sstatistic s) ) from your data set.from your data set.

Outline of a randomization test Outline of a randomization test -- 11

Original data setOriginal data set

ss

2. Make 2. Make ““fakefake”” data sets from your original data, by taking a data sets from your original data, by taking a random subrandom sub--sample of the data, or by resample of the data, or by re--arranging the data arranging the data in a random fashion. in a random fashion. ReRe--compute compute ss from the from the ““fakefake”” data set.data set.

““fakefake”” ss““fakefake”” ss““fakefake”” ss

. . .. . .Randomized Randomized ““fakefake”” data setsdata sets

3. Repeat step 2 many times (often several hundred to 3. Repeat step 2 many times (often several hundred to several thousand times) and record of the several thousand times) and record of the ““fakefake”” ssvalues from step 2values from step 2

4. Draw inferences about the significance of your original s 4. Draw inferences about the significance of your original s value by comparing it with the value by comparing it with the distribution of the distribution of the randomized (randomized (““fakefake””) ) ss valuesvalues

Range of randomized Range of randomized ss valuesvalues

Original Original ss value could be value could be significantsignificantas it exceeds most of the randomized as it exceeds most of the randomized ss valuesvalues


RationaleRationale

Ideally, we want to know the Ideally, we want to know the ““behaviorbehavior”” of the larger of the larger population from which the sample is drawn, in order to make population from which the sample is drawn, in order to make statistical inferences. statistical inferences.

Here, we donHere, we don’’t know that the larger population t know that the larger population ““behavesbehaves”” like like a normal distribution, or some other idealized distribution. Alla normal distribution, or some other idealized distribution. Allwe have to work with are the data in hand.we have to work with are the data in hand.

Our Our ““fakefake”” data sets are our best guess about this behavior data sets are our best guess about this behavior (i.e., if we had been pulling data at random from an infinitely (i.e., if we had been pulling data at random from an infinitely large population, we might expect to get a distribution similar large population, we might expect to get a distribution similar to what we get by pulling random subto what we get by pulling random sub--samples, or by samples, or by reshuffling the order of the data in our sample)reshuffling the order of the data in our sample)


Let’s imagine there are 10,000 genes on a chip, Let’s imagine there are 10,000 genes on a chip, andand

none of them is differentially expressed.none of them is differentially expressed.

Suppose we use a statistical test for differential Suppose we use a statistical test for differential expression, where we consider a gene to be expression, where we consider a gene to be differentially expressed if it meets the criterion at a differentially expressed if it meets the criterion at a pp--value of p < 0.05.value of p < 0.05.

The problem of multiple testingThe problem of multiple testing(adapted from presentation by (adapted from presentation by Anja von Anja von HeydebreckHeydebreck, Max, Max––PlanckPlanck––Institute for Molecular Institute for Molecular

Genetics, Dept. Computational Molecular Biology, Berlin, GermanyGenetics, Dept. Computational Molecular Biology, Berlin, Germanyhttp://www.bioconductor.org/workshops/Heidelberg02/mult.pdfhttp://www.bioconductor.org/workshops/Heidelberg02/mult.pdf))

Let’s say that applying this test to gene “G1” yields Let’s say that applying this test to gene “G1” yields a a pp--value of value of pp = 0.01= 0.01

Remember that a Remember that a pp--value of 0.01 means that there is value of 0.01 means that there is a 1% chance that the gene is not differentially a 1% chance that the gene is not differentially expressed, i.e., expressed, i.e.,

Even though we conclude that the gene is Even though we conclude that the gene is differentially expressed (because differentially expressed (because pp < 0.05), there is a < 0.05), there is a 1% chance that our conclusion is wrong.1% chance that our conclusion is wrong.

We might be willing to live with such a low We might be willing to live with such a low probability of being wrongprobability of being wrong

BUT ..... BUT .....

The problem of multiple testing The problem of multiple testing –– 22

We are testing 10,000 genes, not just one!!!We are testing 10,000 genes, not just one!!!

Even though none of the genes is differentially Even though none of the genes is differentially expressed, about 5% of the genes (i.e., 500 expressed, about 5% of the genes (i.e., 500 genes) will be erroneously concluded to be genes) will be erroneously concluded to be differentially expressed, because we have differentially expressed, because we have decided to “live with” a decided to “live with” a pp--value of 0.05value of 0.05

If only one gene were being studied, a 5% If only one gene were being studied, a 5% margin of error might not be a big deal, but 500 margin of error might not be a big deal, but 500 false conclusions in one study? That doesn’t false conclusions in one study? That doesn’t sound too good.sound too good.


There are “tricks” we can use to reduce the There are “tricks” we can use to reduce the severity of this problem.severity of this problem.

They all involve “slashing” the pThey all involve “slashing” the p--value for each value for each test (i.e., gene), so that while the critical test (i.e., gene), so that while the critical pp--value value for the entire data set might still equal 0.05, each for the entire data set might still equal 0.05, each gene will be evaluated at a lower gene will be evaluated at a lower pp--value.value.

We’ll go into some of these techniques later. We’ll go into some of these techniques later.


Don’t get too hung up on Don’t get too hung up on pp--values.values.

Ultimately, what matters is biological relevance. Ultimately, what matters is biological relevance.

PP--values should help you evaluate the strength of values should help you evaluate the strength of the evidence, rather than being used as an the evidence, rather than being used as an absolute yardstick of significance. absolute yardstick of significance.

Statistical significance is not necessarily the Statistical significance is not necessarily the same as biological significance.same as biological significance.


You don’t want to belong to “that You don’t want to belong to “that group of people whose aim in life is to group of people whose aim in life is to be wrong 5% of the time”!!!be wrong 5% of the time”!!! **

* Kempthorne, O., and T.E. Deoerfler 1969 The behaviour of some significance tests under experimental randomization. Biometrika56:231-248, as cited in Manly, B.J.F. 1997. Randomization, bootstrap and Monte Carlo methods in biology: pg. 1. Chapman and Hall / CRC

Indicates the degree to which a linear relationship can be Indicates the degree to which a linear relationship can be approximated between two variables.approximated between two variables.

Can range from (Can range from (––1.0) to (+1.0).1.0) to (+1.0).

Positive Positive rr between two variables between two variables xx and and yy: as : as xx increases, so increases, so does does yy on the whole.on the whole.

xx

yy

Negative Negative rr: as : as xx increases, increases, yy generally decreases.generally decreases.

The higher the magnitude of The higher the magnitude of rr (in the positive or negative (in the positive or negative direction), the more linear the relationship.direction), the more linear the relationship.

yy

xx

Pearson correlation coefficient Pearson correlation coefficient –– rr

Pearson correlation coefficient Pearson correlation coefficient –– 22Sometimes, a Sometimes, a pp--value is associated with the value is associated with the correlation coefficient correlation coefficient rr. .

This This pp--value is computed from the “theoretical” value is computed from the “theoretical” distribution of the correlation coefficient, similar to the distribution of the correlation coefficient, similar to the normal distribution.normal distribution.

Population correlation coefficient = 0

Sample correlation coefficient r

p < 0.05 range, i.e., reject the null hypothesis that the variables are not correlated, since the sample correlation coefficient is in the rejection range of the correlation coefficient distribution that has a mean = 0

Population correlation coefficient = 0

Sample correlation coefficient r

This is the This is the pp--value for the null hypothesis that the value for the null hypothesis that the xx and and yydata for our sample come from a population in which their data for our sample come from a population in which their correlation is zero, i.e., the null hypothesis is that there is correlation is zero, i.e., the null hypothesis is that there is no linear relationship between no linear relationship between xx and and yy..

If p is sufficiently small (often If p is sufficiently small (often pp < 0.05), we can reject the < 0.05), we can reject the null hypothesis, i.e., we conclude that there is indeed a null hypothesis, i.e., we conclude that there is indeed a linear relationship between linear relationship between xx and and yy..

Pearson correlation coefficient Pearson correlation coefficient –– 22

The square of the Pearson correlation, The square of the Pearson correlation, rr22, , also also known asknown as the the ““coefficient of determinationcoefficient of determination,,”” is a is a measure of the measure of the ““strengthstrength”” of the linear of the linear relationship between relationship between xx and and yy..

It is the proportion of the total variation in It is the proportion of the total variation in xx and and yythat is explained by a linear relationship.that is explained by a linear relationship.

Pearson correlation coefficient Pearson correlation coefficient –– rr

Dealing with DataDealing with Data

Before any pattern analysis can be done, one Before any pattern analysis can be done, one must first normalize and filter the data.must first normalize and filter the data.

Normalization facilitates comparisons Normalization facilitates comparisons between datasets.between datasets.

Filtering transformations can eliminate Filtering transformations can eliminate questionable data and reduce complexity.questionable data and reduce complexity.

Expression ElementsExpression Elements

SamplesSamples

Exp.Exp.

Ref.Ref.

DyesDyes

Cy5Cy5

Cy3Cy3

Measured Measured IntensitiesIntensities

216108216108

7710377103

LogLog22 RatioRatio

1.4851.485

ImagesImages RatioRatio

2.7982.798

00--3.03.0 3.03.0

False Color False Color RepresentationRepresentation

Ratio vs. logRatio vs. log--ratioratio

i

ii B

AR =AAii : : Red intensityRed intensity

BBii: : Green intensityGreen intensityLetLet

=

i

ii B

AR 22 loglog

Gene1: Gene1: RR1 1 = 4, log= 4, log22RR1 1 = 2= 2

Gene2: Gene2: RR2 2 = 1/4, log= 1/4, log22RR2 2 = = --22

RR

A*BA*B

44

22

00Gene2Gene2

Gene1Gene1

33

11

loglog22(A*B)(A*B)

Advantages of log transformation:Advantages of log transformation:

Treat upTreat up--regulated and downregulated and down--regulated genes symmetrically!regulated genes symmetrically!

Transfer multiplication operations to addition operations! BTransfer multiplication operations to addition operations! Because:ecause:

BABA

222 logloglog −=

loglog22RR

00

--22Gene2Gene2

Gene1Gene122

Expression VectorsExpression VectorsGene Expression VectorsGene Expression Vectors represent represent the expression of a gene over a set the expression of a gene over a set of experimental conditions or of experimental conditions or sample types.sample types.

-0.8 0.81.5 1.8 0.5 -1.3-0.4 1.5

log2(cy5/cy3)

Goal is to measure ratios of gene expression levelsGoal is to measure ratios of gene expression levels(ratio)(ratio)ii = = RRii/G/Gii

where where RRii/G/Gii are, respectively , the measured are, respectively , the measured intensities for the intensities for the iithth spot.spot.

In a selfIn a self--self hybridization, we would expect all ratios self hybridization, we would expect all ratios to be equal to one:to be equal to one:

RRii/G/Gii = 1 for all = 1 for all ii. But they may not be.. But they may not be.

Why not?Why not?Unequal labeling efficiencies for Cy3/Cy5Unequal labeling efficiencies for Cy3/Cy5Noise in the systemNoise in the systemDifferential expressionDifferential expression

Normalization brings (appropriate) ratios back to one.Normalization brings (appropriate) ratios back to one.

Why Normalize Data?Why Normalize Data?

Ratio Histogram

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

Ratio

Freq

uenc

y

The Starting Point: The RatioThe Starting Point: The Ratio

Log(ratio) Histogram

0

500

1000

1500

2000

2500

3000

-2 -1.8

-1.6

-1.4

-1.2 -1 -0.8

-0.6

-0.4

-0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Log(ratio)

Freq

uenc

y

LogLog22(ratio) measures treat up(ratio) measures treat up-- and downand down--regulated genes equally regulated genes equally

loglog22(1) = 0(1) = 0 loglog22(2) = 1(2) = 1 loglog22(1/2) = (1/2) = --11

Normalization Approaches: A variety existNormalization Approaches: A variety existTotal IntensityTotal IntensityLinear RegressionLinear RegressionRatio statistics described by Chen, Dougherty, & Bittner Ratio statistics described by Chen, Dougherty, & Bittner

J. J. BiomedBiomed. Optics. Optics (1997) 2(4) 364(1997) 2(4) 364--374374Iterative log(ratio) mean centeringIterative log(ratio) mean centeringLowess CorrectionLowess Correction

And othersAnd others

Any of these using:Any of these using:Entire Data SetEntire Data SetUserUser--defined Data Set/Controlsdefined Data Set/Controls

Normalization ApproachesNormalization ApproachesUsing the Entire Data SetUsing the Entire Data Set

Probe Quantification less importantProbe Quantification less importantNo assumption on which genes constitute No assumption on which genes constitute

“housekeeping” set“housekeeping” setUses all the dataUses all the dataNo independent confirmationNo independent confirmation

UserUser-- defined Data Set/Controlsdefined Data Set/Controls

Requires definition of “housekeeping” set Requires definition of “housekeeping” set oror good added controlsgood added controls

Requires good RNA Requires good RNA quantitationquantitationIgnores much dataIgnores much data

Normalization ApproachesNormalization Approaches

The Solution(?)The Solution(?)The best technique is experiment dependentThe best technique is experiment dependentA good approach is to use a combinationA good approach is to use a combination

of techniquesof techniquesAll analysis methods depend on an intelligentAll analysis methods depend on an intelligent

Experimental designExperimental design

Resource: A. Resource: A. thalianathaliana DNA Clones for SpikingDNA Clones for Spiking

• chlorophyll a/b binding protein (Cab)chlorophyll a/b binding protein (Cab)

•• RUBISCO RUBISCO activaseactivase (RCA)(RCA)

•• ribuloseribulose--1,51,5--bisphosphate bisphosphate carboxylase/oxygenasecarboxylase/oxygenase

((RbcLRbcL))

•• lipid transfer protein 4 (LTP4)lipid transfer protein 4 (LTP4)

•• lipid transfer protein 6 (LTP6)lipid transfer protein 6 (LTP6)

•• papainpapain--type type cysteinecysteine endopeptidaseendopeptidase (XCP2)(XCP2)

•• root cap 1 (RCP1)root cap 1 (RCP1)

•• NAC1NAC1

•• triosphosphatetriosphosphate isomeraseisomerase (TIM)(TIM)

•• ribuloseribulose--55--phosphate phosphate kinasekinase ((PRKasePRKase))

M13R

pSP64 Poly(A)Vector

(3030bp)

5’…ATTTA GGTGA CACTA TAGAA TACAA GCTTG GGCTG CAGGT CGACT CTAGA

GGATC CCCGG GCGAG CTCCC AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA CCGAA TTC…3’

SP6 PromoterHindIII PstI SatI

AccIHincII

XbaI

EcoRISacIAvaISmaI

BamHi

SP6 Transcription Start

Clone set available atClone set available at::<http://<http://pga.tigr.orgpga.tigr.org>>

Resource: B. Resource: B. subtillussubtillus DNA Clones for SpikingDNA Clones for Spiking

•• pGIBSpGIBS--lyslys ATCC 87482ATCC 87482

•• pGIBSpGIBS--phephe ATCC 87483ATCC 87483

•• pGIBSpGIBS--thrthr ATCC 87484ATCC 87484

•• pGIBSpGIBS--trptrp ATCC 87485ATCC 87485

•• pGIBSpGIBS--dapdap ATCC 87486ATCC 87486

Artificial Artificial polyApolyA added to the 3’endadded to the 3’end

f1(+) origin

lacZSacI

pUCorigin

MCS

P lac

ampicillin

pBluescript II KS+3.0 kb KpnI

Clone set available atClone set available at::<http://<http://www.atcc.orgwww.atcc.org>>

Normalization Approaches: Total IntensityNormalization Approaches: Total IntensityConceptually, this is the simplest approach Conceptually, this is the simplest approach

Assumption: Total RNA (mass) used is same for Assumption: Total RNA (mass) used is same for both samples.both samples.

So, averaged across thousands of genes, total So, averaged across thousands of genes, total hybridization should be the same for both sampleshybridization should be the same for both samples

∑

∑

=

==array

k

array

k

N

k

N

k

G

RN

1

1Normalization Factor:

Normalization: kk NGG =′ and kk RR =′ .

Before and After NormalizationBefore and After Normalization

Data exhibits an intensityData exhibits an intensity--dependent structuredependent structure

Uncertainty in measurements is greater at Uncertainty in measurements is greater at lower intensitieslower intensities

Uncertainty in ratio measurements generally Uncertainty in ratio measurements generally greater at lower intensitiesgreater at lower intensities

Plot logPlot log22(R/G) (R/G) vs.vs. loglog22(R*G)(R*G)[variation: Terry Speed[variation: Terry Speed’’s Ms M--A plot withA plot with

((½½ )*log)*log22(R*G)](R*G)]

The Starting Point: The RThe Starting Point: The R--I PlotI Plot

Good Data

Bad Data from Parts Unknown

Gary ChurchillGary ChurchillEach “pen group” is colored differentlyEach “pen group” is colored differently

Lowess NormalizationLowess NormalizationWhy LOWESS?Why LOWESS?

-3

-2

-1

0

1

2

3

7 8 9 10 11 12 13 14

log(Cy3*Cy5)

A SD = 0.346

ObservationsObservations1.1. IntensityIntensity--dependent structuredependent structure2.2. Data not mean centered at logData not mean centered at log22(ratio) = 0(ratio) = 0

LOWESS (Cont’d)LOWESS (Cont’d)

Local linear regression model Local linear regression model

TriTri--cube weight function cube weight function

Least SquaresLeast Squares

Estimated values Estimated values of logof log22(Cy5/Cy3) as (Cy5/Cy3) as function of function of loglog22(Cy3*Cy5)(Cy3*Cy5)

-3

-2

-1

0

1

2

3

7 8 9 10 11 12 13 14

log(Cy3*Cy5)

A SD = 0.346

WYXWXX

xyxw

xyxw

xy

iii

iii

ii

')'(

0)()(

)()(

1

2

2

−∧

∧

∧

=Β

=⋅−⋅∂∂=

∂∂

⋅−⋅=

⋅=

∑

∑

βββ

β

ββ

β

LOWESS ResultsLOWESS Results

In every array, there are questionable In every array, there are questionable –– or bad or bad ––data for some elementsdata for some elements

Replicates can help identify those elements Replicates can help identify those elements

We can use an unbiased filter to eliminate those We can use an unbiased filter to eliminate those from future considerationfrom future consideration

Replicates: Filtering Questionable DataReplicates: Filtering Questionable Data

Consider two replicates with dyes swappedConsider two replicates with dyes swappedAA11 andand BB22BB11 AA22

Replicates: Applied to Filtering DataReplicates: Applied to Filtering Data

We can calculate We can calculate σσRR and eliminate spots with and eliminate spots with the greatest uncertainty:the greatest uncertainty: σσRR >2>2

We expect to seeWe expect to seeAA11 BB22BB11 AA22

= 1= 1**

The Effects of flipThe Effects of flip--dye replicate trimdye replicate trim

Red data are eliminated as inconsistentRed data are eliminated as inconsistent

The uncertainty in measurements increases as The uncertainty in measurements increases as intensity decreasesintensity decreases

Measurements close to the detection limit are the Measurements close to the detection limit are the most uncertainmost uncertain

FoldFold--change measurements ignore these effectschange measurements ignore these effects

We can calculate an intensityWe can calculate an intensity--dependent dependent ZZ--score score that measures the ratio relative to the standard that measures the ratio relative to the standard deviation in the data:deviation in the data:

wherewhere µµ is theis themean mean log(ratiolog(ratio))

Significance: Significance: ZZ--scoresscores

ZZ = = loglog22(R(Rii/G/Gii))--µµσσ loglog22(R/G(R/G))

locallocal

NormalityNormality

y = y = RRii/G/Gii , , f(y) is frequency of f(y) is frequency of RRii/G/Gii , , µµ ~ 1~ 1

y = logy = log22((RRii/G/Gi i ), ), f(y) is frequency of f(y) is frequency of loglog22((RRii/G/Gi i ), ), µµ ~ 0~ 0

Standard deviation (Standard deviation (σσ))

““Slice Analysis” (IntensitySlice Analysis” (Intensity--dependent dependent ZZ--score)score)

ZZ > 2 is at the 95.5% confidence level> 2 is at the 95.5% confidence level

Problem is to estimate the variability in the Problem is to estimate the variability in the data based on empirical measurementdata based on empirical measurement

This requires a number of selfThis requires a number of self--self self hybridizations to create an estimate of the hybridizations to create an estimate of the inherent variability in the assayinherent variability in the assay

This can be done as a function of intensity or This can be done as a function of intensity or as an estimate of the variability for individual as an estimate of the variability for individual genesgenes

Genes failing to meet the significance criteriaGenes failing to meet the significance criteria

Error ModelsError Models

SelfSelf--self Hybridizations Estimate Variabilityself Hybridizations Estimate Variability

This is then used to This is then used to construct an error construct an error

modelmodel

R-I Plot

-3

-2

-1

0

1

2

3

7 8 9 10 11 12 13 14

log10(R*G)

log 2

(R/G

)

R-I Plot

-3

-2

-1

0

1

2

3

7 8 9 10 11 12 13 14

log10(R*G)

log 2

(R/G

)The error model is used to exclude The error model is used to exclude nonsignificantnonsignificant genesgenes

Measurements of expression vary between Measurements of expression vary between any two assaysany two assays

This can be affected by changes in the mean This can be affected by changes in the mean expression level, but normalization can help expression level, but normalization can help reduce those differencesreduce those differences

However, the variance, or spread in the data, However, the variance, or spread in the data, can be quite different between replicates (or can be quite different between replicates (or pen groups)pen groups)

Variance stabilization can rescale the data Variance stabilization can rescale the data for each experiment to make these more for each experiment to make these more comparablecomparable

Variance stabilization/regularizationVariance stabilization/regularization

A Box Plot can show the difference in varianceA Box Plot can show the difference in variancebetween replicatesbetween replicates

Standard Deviation RegularizationStandard Deviation Regularization

35log2 Cy

Cyaij =

Let Let aaijij be the raw log ratio for the be the raw log ratio for the jjthth spot in spot in iithth block (or slide)block (or slide)

M

M j

jijij

Nijij

Nijij

aa

aa

aa

∏−∑

−∑

−

−⋅=

1

1'

)(

)(

2

2

where where NNjj denotes the number of genes denotes the number of genes iithth block or block or iithth

slide, slide, MM denotes the number of blocks or slides, denotes the number of blocks or slides, aaijij

denotes the log ratio mean of denotes the log ratio mean of iithth block (or block (or iithth slide)slide)

a’a’ijij be the scaled log ratio for the be the scaled log ratio for the jjthth spot in spot in iithth block (or slide)block (or slide)

MIDAS Normalization MethodsMIDAS Normalization Methods(Standard deviation regularization)(Standard deviation regularization)

Assumption: logAssumption: log--ratio standard deviations within each block orratio standard deviations within each block or

slide are the same.slide are the same.

Standard deviation regularizationStandard deviation regularization

Variance regularization can remove the biasVariance regularization can remove the bias

There are Limits to what you There are Limits to what you can Measurecan Measure

The Limits of logThe Limits of log--ratios: The space we exploreratios: The space we explore



Microarray OverviewMicroarray Overview

MAD

CollectCollectHybridizationHybridization

DataData

MAD

ExploreExplorepatterns ofpatterns ofexpressionexpression

NormalizationNormalizationand Filteringand Filtering

Normalize dataNormalize dataand reduceand reducecomplexitycomplexity

Assume we will compare two conditions Assume we will compare two conditions with multiple replicates for each classwith multiple replicates for each class

Our goal is to find genes that are Our goal is to find genes that are significantly different between these significantly different between these classesclasses

These are the genes that we will use for These are the genes that we will use for later data mininglater data mining

Finding Significant GenesFinding Significant Genes

Average Fold Change Difference for each geneAverage Fold Change Difference for each genesuffers from being arbitrary and not taking suffers from being arbitrary and not taking into account systematic variation in the datainto account systematic variation in the data


??????

tt--test for each genetest for each geneTests whether the difference between the mean of Tests whether the difference between the mean of the query and reference groups are the samethe query and reference groups are the sameEssentially measures signalEssentially measures signal--toto--noisenoiseCalculate Calculate pp--value (permutations or distributions)value (permutations or distributions)May suffer from intensityMay suffer from intensity--dependent effectsdependent effects


tt = = signalsignal = = difference between meansdifference between means = = <<XqXq> > –– <<XcXc>_>_noise variability of groups noise variability of groups SE(XqSE(Xq--XcXc))

c

c

q

q

nn

XcXqt

22 σσ+

−=

A significantA significantdifferencedifference

ProbablyProbablynotnot

tt--teststests

1.1. Assign experiments to two groups, e.g., in the Assign experiments to two groups, e.g., in the expression matrix below, assign Experiments 1, 2 and expression matrix below, assign Experiments 1, 2 and 5 to group A, and experiments 3, 4 and 6 to group B.5 to group A, and experiments 3, 4 and 6 to group B.

Exp 1Exp 1 Exp 2Exp 2 Exp 3Exp 3 Exp 4Exp 4 Exp 5Exp 5 Exp 6Exp 6

Gene 1Gene 1

Gene 2Gene 2

Gene 3Gene 3

Gene 4Gene 4

Gene 5Gene 5

Gene 6Gene 6

2. Question: Is mean expression level of a gene in group A 2. Question: Is mean expression level of a gene in group A significantly different from mean expression level in significantly different from mean expression level in group B?group B?

Exp 1Exp 1 Exp 2Exp 2 Exp 3Exp 3 Exp 4Exp 4Exp 5Exp 5 Exp 6Exp 6

Gene 1Gene 1

Gene 2Gene 2

Gene 3Gene 3

Gene 4Gene 4

Gene 5Gene 5

Gene 6Gene 6

Group AGroup A Group BGroup B

TT--Tests (TTEST) Tests (TTEST) -- 11

3. Calculate t3. Calculate t--statistic for each genestatistic for each gene

4. Calculate probability value of the t4. Calculate probability value of the t--statistic statistic for each for each gene either from:gene either from:

A. Theoretical tA. Theoretical t--distributiondistribution

OROR

B. Permutation tests.B. Permutation tests.

TT--TEST TEST -- 22

Permutation testsPermutation tests

i) For each gene, compute ti) For each gene, compute t--statisticstatistic

ii) Randomly shuffle the values of the gene between ii) Randomly shuffle the values of the gene between groups A and B, such that the reshuffled groups A and B groups A and B, such that the reshuffled groups A and B respectively have the same number of elements as the respectively have the same number of elements as the original groups A and B.original groups A and B.


Gene 1Gene 1


Original groupingOriginal grouping

Exp 1Exp 1Exp 4Exp 4Exp 5Exp 5Exp 2Exp 2Exp 3Exp 3 Exp 6Exp 6

Gene 1Gene 1


Randomized groupingRandomized grouping

TT--TEST TEST -- 33

Permutation tests Permutation tests -- continuedcontinued

iii) Compute tiii) Compute t--statistic for the randomized genestatistic for the randomized gene

iv) Repeat steps iiv) Repeat steps i--iii iii nn times (where times (where n n is specified by the is specified by the user).user).

v) Let v) Let xx = the number of times the absolute value of the = the number of times the absolute value of the original toriginal t--statistic exceeds the absolute values of the statistic exceeds the absolute values of the randomized trandomized t--statistic over statistic over nn randomizations.randomizations.

vi) Then, the pvi) Then, the p--value associated with the gene = 1 value associated with the gene = 1 –– ((x/nx/n))

TT--TEST TEST -- 44

5. Determine whether a gene’s expression levels are 5. Determine whether a gene’s expression levels are significantly different between the two groups by one of significantly different between the two groups by one of three methods:three methods:

A)A) “Just alpha” (“Just alpha” (αα significance level): If the calculated significance level): If the calculated pp--value for a gene is less than or equal to the uservalue for a gene is less than or equal to the user--input a input a (critical (critical pp--value), the gene is considered significant.value), the gene is considered significant.

ORORUse Bonferroni corrections to reduce the probability of Use Bonferroni corrections to reduce the probability of

erroneously classifying nonerroneously classifying non--significant genes as significant genes as significant.significant.

B) B) Standard Bonferroni correctionStandard Bonferroni correction: The user: The user--input alpha is input alpha is divided by the total number of genes to give a critical divided by the total number of genes to give a critical pp--value that is used as above value that is used as above ––> > ppcriticalcritical = = αα /N. /N.

TT--TEST TEST -- 55

5C) 5C) Adjusted BonferroniAdjusted Bonferroni: :

i) The i) The tt--values for all the genes are ranked values for all the genes are ranked in descending order. in descending order.

ii) For the gene with the highest ii) For the gene with the highest tt--value, the value, the critical critical pp--value becomes (value becomes (αα //NN), where ), where NN is is the total number of genes; for the gene with the total number of genes; for the gene with the secondthe second--highest thighest t--value, the critical value, the critical pp--value will be (value will be (αα/[/[NN--1]), and so on.1]), and so on.

TT--TEST TEST –– 66

Significance Analysis of Microarrays (SAM)Significance Analysis of Microarrays (SAM)Uses a modified Uses a modified tt--test by estimating and adding test by estimating and adding a small positive constant to the denominatora small positive constant to the denominatorSignificant genes are those which exceed the Significant genes are those which exceed the expected values from permutation analysis.expected values from permutation analysis.


SAM test StatisticSAM test Statistic

0ssrd

i

ii +=

•• ddii = Score = Score •• ssii = Standard Deviation= Standard Deviation•• ss00 = “Safety” Factor = “Safety” Factor

21 iii xxr −=

2

)()(11

21

2

22

1

21

21 −+

−+−

+=

∑∑∈∈

nn

xxxx

nns Cj

iijCj

iij

i

SAM : Variance EstimateSAM : Variance EstimateGene by gene variance estimate + “safety” factorGene by gene variance estimate + “safety” factorVariance equal in the two conditionsVariance equal in the two conditionsss00 term is here to deal with cases when variance term is here to deal with cases when variance estimates gets too close to zeroestimates gets too close to zero

How to choose sHow to choose s00 ??Test statistics are binned in 100 different Test statistics are binned in 100 different group depending on the group depending on the ssii valuevaluess00 is chosen so that the dispersion of the is chosen so that the dispersion of the test statistic does not vary from bin to bintest statistic does not vary from bin to binavoids aberrant values when variance avoids aberrant values when variance estimates close to 0estimates close to 0

SAM : Hypothesis TestingSAM : Hypothesis Testing

Permutation techniquePermutation techniqueMultiple testing adjustment techniqueMultiple testing adjustment techniqueFalse Discovery RateFalse Discovery Rate

Confidence Level :Confidence Level :False Discovery RateFalse Discovery Rate

1.1. Fix a threshold DELTA for differentially Fix a threshold DELTA for differentially expressed genesexpressed genes

2.2. For each permutation, count how many genes For each permutation, count how many genes you declare differentially expressedyou declare differentially expressedNB : In a permutation you should find 0 genes.NB : In a permutation you should find 0 genes.

3.3. Compute median number of falsely called genes Compute median number of falsely called genes in permutations in permutations

False Discovery RateFalse Discovery Rate is number of falsely called is number of falsely called genes divided by number of differential genes divided by number of differential expressed genes in original dataexpressed genes in original data

FDRFDR: percentage of NON: percentage of NON--significant genes significant genes you can expect to find in your result list you can expect to find in your result list

SAMSAMSAM gives estimates of the SAM gives estimates of the False Discovery Rate False Discovery Rate (FDR)(FDR), which is the proportion of genes likely to , which is the proportion of genes likely to have been wrongly identified by chance as being have been wrongly identified by chance as being significantsignificant..

It is a very It is a very interactive algorithminteractive algorithm –– allows allows users to users to dynamically change thresholds for significance dynamically change thresholds for significance (through the tuning parameter delta) after looking at (through the tuning parameter delta) after looking at the distributionthe distribution of the test statistic.of the test statistic.

The ability to The ability to dynamically alterdynamically alter the input parameters the input parameters based on immediate visual feedback, even before based on immediate visual feedback, even before completing the analysis, helps make the datacompleting the analysis, helps make the data--mining mining process sensitive.process sensitive.

1.1. Assign experiments to two groups, e.g., in the Assign experiments to two groups, e.g., in the expression matrix below, assign Experiments 1, 2 and expression matrix below, assign Experiments 1, 2 and 5 to group A, and experiments 3, 4 and 6 to group B.5 to group A, and experiments 3, 4 and 6 to group B.


Gene 1Gene 1

Gene 2Gene 2Gene 3Gene 3

Gene 4Gene 4

Gene 5Gene 5

Gene 6Gene 6

2. 2. Question: Is mean expression level of a gene in Question: Is mean expression level of a gene in group A significantly different from mean group A significantly different from mean expression level in group B?expression level in group B?


Gene 1Gene 1

Gene 2Gene 2Gene 3Gene 3

Gene 4Gene 4

Gene 5Gene 5

Gene 6Gene 6


SAM TwoSAM Two--Class Class –– 11

Permutation testsPermutation testsi) For each gene, compute i) For each gene, compute dd--valuevalue (analogous to (analogous to tt--statistic). statistic).

This is the observed This is the observed dd--valuevalue for that gene.for that gene.ii) Randomly shuffle the values of the gene between groups ii) Randomly shuffle the values of the gene between groups

A and B, such that the reshuffled groups A and B A and B, such that the reshuffled groups A and B respectively have the same number of elements as the respectively have the same number of elements as the original groups A and B. Compute the original groups A and B. Compute the dd--value for each value for each randomized generandomized gene


Gene 1Gene 1


Original groupingOriginal grouping

Exp 1Exp 1Exp 4Exp 4Exp 5Exp 5Exp 2Exp 2Exp 3Exp 3 Exp 6Exp 6

Gene 1Gene 1


Randomized groupingRandomized grouping


SAM TwoSAM Two--Class Class –– 33iii) Repeat step (ii) many times, so that each gene iii) Repeat step (ii) many times, so that each gene

has many randomized has many randomized dd--values. Take the average values. Take the average of the randomized of the randomized dd--values for each gene. This is values for each gene. This is the expected the expected dd--value of that gene.value of that gene.

iv) Plot the observed div) Plot the observed d--values vs. the expected values vs. the expected dd--valuesvalues

SAM TwoSAM Two--Class Class –– 44 Significant positive genes Significant positive genes (i.e., mean expression of group B >(i.e., mean expression of group B >mean expression of group A) in redmean expression of group A) in red

Significant negative genes Significant negative genes (i.e., mean expression of group A > (i.e., mean expression of group A > mean expression of group B) in mean expression of group B) in green green

““Observed d = expected d” lineObserved d = expected d” line

The more a gene deviates from the “observed = expected” line, The more a gene deviates from the “observed = expected” line, the more likely it is to be significant. Any gene beyond the firthe more likely it is to be significant. Any gene beyond the first st gene in the +gene in the +veve or or ––veve direction on the xdirection on the x--axis (including the axis (including the first gene), whose observed exceeds the expected by at least first gene), whose observed exceeds the expected by at least delta, is considered significant.delta, is considered significant.

Tuning Tuning parameterparameter“delta” limits, “delta” limits, can be can be dynamically dynamically changed by changed by using the using the slider slider barbar or entering or entering a value in the a value in the text fieldtext field..


For each permutation of the data, compute the For each permutation of the data, compute the number of positive and negative significant genes number of positive and negative significant genes for a given delta as explained in the previous for a given delta as explained in the previous slide. The median number of significant genes slide. The median number of significant genes from these permutations is the from these permutations is the median False median False Discovery RateDiscovery Rate..

The rationale behind this is, any genes designated The rationale behind this is, any genes designated as significant from the randomized data are being as significant from the randomized data are being picked up purely by chance (i.e., “falsely” picked up purely by chance (i.e., “falsely” discovered). Therefore, the median number discovered). Therefore, the median number picked up over many randomizations is a good picked up over many randomizations is a good estimate of false discovery rate.estimate of false discovery rate.

Analysis of Variation (ANOVA)Analysis of Variation (ANOVA)Which genes are most significant for separating Which genes are most significant for separating classes of samples?classes of samples?Calculate Calculate pp--value (permutations or distributions)value (permutations or distributions)Reduces to a Reduces to a tt--test for 2 samplestest for 2 samplesMay suffer from intensityMay suffer from intensity--dependent effectsdependent effects


??????

One Way Analysis of Variance (ANOVA)One Way Analysis of Variance (ANOVA)1.1. Assign experiments to > 2 groupsAssign experiments to > 2 groups

2. Question: Is mean expression level of a gene the same across 2. Question: Is mean expression level of a gene the same across all all groups?groups?

Ex 2Ex 2Ex 1Ex 1 Ex 3Ex 3 Ex 4Ex 4 Ex 5Ex 5 Ex 6Ex 6 Ex 7Ex 7 Ex 8Ex 8 Ex 9Ex 9Gene 1Gene 1Gene 2Gene 2Gene 3Gene 3Gene 4Gene 4Gene 5Gene 5Gene 6Gene 6Gene 7Gene 7

Ex 3Ex 3 Ex 6Ex 6 Ex 8Ex 8Ex 4Ex 4 Ex 5Ex 5 Ex 9Ex 9Ex 2Ex 2Ex 1Ex 1 Ex 7Ex 7Gene 1Gene 1Gene 2Gene 2Gene 3Gene 3Gene 4Gene 4Gene 5Gene 5Gene 6Gene 6Gene 7Gene 7

Group 1Group 1 Group 2Group 2 Group 3Group 3

ANOVA ANOVA -- 223. Calculate an F3. Calculate an F--ratio for each gene, whereratio for each gene, where

Mean square (groups)Mean square (groups)F = F = ----------------------------------------------------------------, which is a measure of , which is a measure of

Mean square (error) Mean square (error)

Between groups variabilityBetween groups variability----------------------------------------------------------------------------Within groups variabilityWithin groups variability

The larger the value of F, the greater the The larger the value of F, the greater the difference among the group difference among the group means means relative to the relative to the sampling error variabilitysampling error variability (which is the (which is the within within groups variabilitygroups variability).).

i.e., the larger the value of F, the more likely it is that the i.e., the larger the value of F, the more likely it is that the differences differences among the group means reflect “real” differences among the meansamong the group means reflect “real” differences among the means of of the populations they are drawn from, rather than being due to rthe populations they are drawn from, rather than being due to random andom sampling error.sampling error.

ANOVA ANOVA -- 33

4. The 4. The pp--valuevalue associated with an Fassociated with an F--value is the value is the probability that an Fprobability that an F--value that large would be obtained value that large would be obtained if there were no differences among group means (i.e., if there were no differences among group means (i.e., given the null hypothesis). given the null hypothesis).

Therefore, the smaller the pTherefore, the smaller the p--value, the less likely it is value, the less likely it is that the null hypothesis is validthat the null hypothesis is valid, i.e., the differences , i.e., the differences among group means are more likely to reflect real among group means are more likely to reflect real population differences as ppopulation differences as p--values decrease in values decrease in magnitude.magnitude.

ANOVA ANOVA -- 44

5. P5. P--values can be obtained for the Fvalues can be obtained for the F--values from a values from a theoretical Ftheoretical F--distribution, assuming that the populations distribution, assuming that the populations from which the data are obtained from which the data are obtained

•• are normally distributedare normally distributed, and , and

•• have homogeneous varianceshave homogeneous variances..

The test is considered robust to violations of these The test is considered robust to violations of these assumptions, provided sample sizes are relatively large assumptions, provided sample sizes are relatively large and similar across groups.and similar across groups.

ANOVA ANOVA –– 55

6. P6. P--values can be obtained from permutation tests (just values can be obtained from permutation tests (just like in tlike in t--tests), if one does not want to rely on the tests), if one does not want to rely on the assumptions needed for using the Fassumptions needed for using the F--distribution. distribution.

PP--values can also be corrected for multiple comparisons values can also be corrected for multiple comparisons (using Bonferroni or other procedures).(using Bonferroni or other procedures).

Volcano PlotsVolcano PlotsCombines Combines pp--values and fold change measuresvalues and fold change measuresSignificant genes appear in upper cornersSignificant genes appear in upper corners


Mean log(ratio)

log 1

0(p-

valu

e)

Goal is identify genes (or experiments) which haveGoal is identify genes (or experiments) which have““similarsimilar”” patterns of expressionpatterns of expression

This is a problem in data miningThis is a problem in data mining

““Clustering AlgorithmsClustering Algorithms”” are most widely usedare most widely used

TypesTypesAgglomerative: HierarchicalAgglomerative: HierarchicalDivisive: Divisive: kk--means, means, SOMsSOMsNonclusteringNonclustering: Principal Component Analysis : Principal Component Analysis

(PCA)(PCA)

All depend on how one measures distanceAll depend on how one measures distance

Multiple Experiments?Multiple Experiments?

Similar expressionSimilar expression

Crucial concept for understanding clusteringCrucial concept for understanding clustering

Each gene is represented by a vector where Each gene is represented by a vector where coordinates are its values log(ratio) in each coordinates are its values log(ratio) in each experimentexperiment

xx = log(ratio)= log(ratio)expt1expt1yy = log(ratio)= log(ratio)expt2expt2zz = log(ratio)= log(ratio)expt3expt3etc.etc.

Expression Vectors

xx

yy

zz

Crucial concept for understanding clusteringCrucial concept for understanding clustering

Each gene is represented by a vector where Each gene is represented by a vector where coordinates are its values log(ratio) in each coordinates are its values log(ratio) in each experimentexperiment

xx = log(ratio)= log(ratio)expt1expt1yy = log(ratio)= log(ratio)expt2expt2zz = log(ratio)= log(ratio)expt3expt3etc.etc.

For example, if we do six experiments, For example, if we do six experiments, GeneGene11 = (= (--1.2, 1.2, --0.5, 0, 0.25, 0.75, 1.4) 0.5, 0, 0.25, 0.75, 1.4) GeneGene22 = (0.2, = (0.2, --0.5, 1.2, 0.5, 1.2, --0.25, 0.25, --1.0, 1.5) 1.0, 1.5) GeneGene33 = (1.2, 0.5, 0, = (1.2, 0.5, 0, --0.25, 0.25, --0.75, 0.75, --1.4) 1.4) etc.etc.

Expression Vectors

These gene expression vectors of log(ratio) values These gene expression vectors of log(ratio) values can be used to construct an expression matrixcan be used to construct an expression matrix

Expression Matrix

Exp

tE

xpt 11

Exp

tE

xpt 22

Exp

tE

xpt 33

Exp

tE

xpt 44

Exp

tE

xpt 55

Exp

tE

xpt 66

GeneGene11 --1.2 1.2 --0.5 0 0.25 0.75 1.4 0.5 0 0.25 0.75 1.4 GeneGene22 0.2 0.2 --0.5 1.2 0.5 1.2 --0.25 0.25 --1.0 1.5 1.0 1.5 GeneGene33 1.2 0.5 0 1.2 0.5 0 --0.25 0.25 --0.75 0.75 --1.41.4etc.etc.

This is often represented as a red/green colored This is often represented as a red/green colored matrixmatrix

The Expression Matrix is a representation of data from multipThe Expression Matrix is a representation of data from multipmicroarray experiments.microarray experiments.

Each element is a Each element is a log ratio, usually log ratio, usually

log log 2 2 (Cy5/Cy3) (Cy5/Cy3)

Red indicates a Red indicates a positive log ratio, i.e, Cy5 > Cy3 positive log ratio, i.e, Cy5 > Cy3

Green indicates aGreen indicates anegative lognegative log--ratio, ratio, i.e., Cy5 < Cy3 i.e., Cy5 < Cy3

Black indicates a Black indicates a loglog--ratio of zero, ratio of zero, i.e., Cy5 and Cy3 i.e., Cy5 and Cy3 are very close in are very close in value value

Gray indicates missing data Gray indicates missing data

Exp

1Ex

p 1

Exp

2Ex

p 2

Exp

3Ex

p 3

Exp

4Ex

p 4

Exp

5Ex

p 5

Exp

6Ex

p 6

Gene 1Gene 1

Gene 2Gene 2

Gene 3Gene 3

Gene 4Gene 4

Gene 5Gene 5

Gene 6Gene 6

Expression Matrix

Expression Vectors As Points inExpression Vectors As Points in‘Expression Space’‘Expression Space’

Experiment 1Experiment 1

Experiment 2Experiment 2

Experiment 3Experiment 3 Similar ExpressionSimilar Expression

Exp 1Exp 1 Exp 2Exp 2 Exp 3Exp 3G1G1

G2G2

G3G3

G4G4

G5G5

xx

yy

zz

Distances are measured Distances are measured ““betweenbetween”” expression expression vectorsvectors

Distance metrics define the way we measure Distance metrics define the way we measure distancesdistances

Many different ways to measure distance:Many different ways to measure distance:Euclidean distanceEuclidean distancePearson correlation coefficient(s)Pearson correlation coefficient(s)Manhattan distanceManhattan distanceMutual informationMutual informationKendallKendall’’s s TauTauetc.etc.

Each has different properties and can reveal different Each has different properties and can reveal different features of the datafeatures of the data

Distance metrics

Distance and Similarity Distance and Similarity The ability to calculate a distance (or similarity The ability to calculate a distance (or similarity --its inverse) between two expression vectors is its inverse) between two expression vectors is fundamental to clustering algorithmsfundamental to clustering algorithms

Distance between vectors is the basis upon Distance between vectors is the basis upon which decisions are made when grouping which decisions are made when grouping similar patterns of expressionsimilar patterns of expression

Selection of a Selection of a distance metricdistance metric defines the defines the concept of distance for a particular experimentconcept of distance for a particular experiment

DistanceDistance

SimilaritySimilarity

2. Manhattan: Σi = 1 |xiA – xiB|6

Distance: a measure of similarity between genesDistance: a measure of similarity between genes


Gene AGene A

Gene BGene B

xx1A1A xx2A2A xx3A3A xx4A4A xx5A5A xx6A6A

xx1B1B xx2B2B xx3B3B xx4B4B xx5B5B xx6B6B

61. Euclidean: √Σi = 1 (xiA - xiB)2 pA

pB

3. Pearson correlation

Some distances: (MeV provides 11 metrics)Some distances: (MeV provides 11 metrics)

Distance Is Defined by a Distance Is Defined by a MetricMetric

Euclidean Pearson*Distance Metric:

6.0

1.4

+1.00

-0.05D

D

-3

0

3

log2

(cy5

/cy3

)

Once a distance metric has been selected, the Once a distance metric has been selected, the starting point for all clustering methods is a starting point for all clustering methods is a ““distance distance matrixmatrix””

Distance Matrix

Gen

eG

ene 11

Gen

eG

ene 22

Gen

eG

ene 33

Gen

eG

ene 44

Gen

eG

ene 55

Gen

eG

ene 66

GeneGene11 0 1.5 1.2 0.25 0.75 1.4 0 1.5 1.2 0.25 0.75 1.4 GeneGene22 1.5 1.5 0 1.3 0.55 2.0 1.5 0 1.3 0.55 2.0 1.5 GeneGene33 1.2 1.3 0 1.3 0.75 0.31.2 1.3 0 1.3 0.75 0.3GeneGene44 0.25 0.55 0.25 0.55 1.31.3 0 0.25 0.4 0 0.25 0.4 GeneGene55 0.75 2.0 0.75 0.25 0 1.2 0.75 2.0 0.75 0.25 0 1.2 GeneGene66 1.4 1.5 0.3 0.4 1.2 01.4 1.5 0.3 0.4 1.2 0

The elements of this matrix are the pairThe elements of this matrix are the pair--wise wise distances. Note that the matrix is symmetric about distances. Note that the matrix is symmetric about the diagonal.the diagonal.

MeV: Data Mining ToolsMeV: Data Mining Tools Alexander SaeedAlexander SaeedAlexander Alexander SturnSturn

Nirmal BhagabatiNirmal BhagabatiJohn BraistedJohn Braisted

SyntekSyntek Inc.Inc.DatanautDatanaut, Inc., Inc.

☺☺ Available as OSIAvailable as OSI--certified open sourcecertified open source

Hierarchical ClusteringHierarchical Clustering1. Calculate the distance between all genes. Find the smallest 1. Calculate the distance between all genes. Find the smallest

distance. If several pairs share the same similarity, use a distance. If several pairs share the same similarity, use a predetermined rule to decide between alternatives.predetermined rule to decide between alternatives.

G1G1G6G6

G3G3

G5G5

G4G4

G2G2

2. Fuse the two selected clusters to produce a new cluster that 2. Fuse the two selected clusters to produce a new cluster that now contains at least two objects. Calculate the distance now contains at least two objects. Calculate the distance between the new cluster and all other clusters.between the new cluster and all other clusters.

3. Repeat steps 1 and 2 until only a single cluster remains.3. Repeat steps 1 and 2 until only a single cluster remains.

G1G1

G6G6

G3G3

G5G5

G4G4

G2G2

4. Draw a tree representing the results.4. Draw a tree representing the results.

Hierarchical ClusteringHierarchical Clustering

g8g8g1g1 g2g2 g3g3 g4g4 g5g5 g6g6 g7g7


g1 is most like g8g1 is most like g8


g4 is most like {g1, g8}g4 is most like {g1, g8}

(HCL(HCL--2)2)


Hierarchical ClusteringHierarchical Clustering


g5 is most like g7 g5 is most like g7


{g5,g7} is most like {g1, g4, g8} {g5,g7} is most like {g1, g4, g8}

(HCL(HCL--3)3)

Hierarchical TreeHierarchical Tree


(HCL(HCL--4)4)

Agglomerative Linkage MethodsAgglomerative Linkage Methods

Linkage methods are rules or metrics that return a Linkage methods are rules or metrics that return a value that can be used to determine which elements value that can be used to determine which elements (clusters) should be linked.(clusters) should be linked.

Three linkage methods that are commonly used are: Three linkage methods that are commonly used are:

•• Single LinkageSingle Linkage•• Average LinkageAverage Linkage•• Complete LinkageComplete Linkage

(HCL(HCL--6)6)

ClusterCluster--toto--cluster distance is defined as the cluster distance is defined as the minimum distanceminimum distance between members of one cluster between members of one cluster and members of the another cluster. Single linkage and members of the another cluster. Single linkage tends to create ‘elongated’ clusters with individual tends to create ‘elongated’ clusters with individual genes chained onto clusters.genes chained onto clusters.

DDABAB = min ( = min ( d(ud(uii, , vvjj) )) )

where u where u ∈ ∈ A and v A and v ∈ ∈ BBfor all i = 1 to Nfor all i = 1 to NAA and j = 1 to Nand j = 1 to NBB

Single LinkageSingle Linkage

(HCL(HCL--7)7)

DDABAB

ClusterCluster--toto--cluster distance is defined as the cluster distance is defined as the average average distancedistance between all members of one cluster and all between all members of one cluster and all members of another cluster. Average linkage has a members of another cluster. Average linkage has a slight tendency to produce clusters of similar slight tendency to produce clusters of similar variance.variance.

DDABAB = 1/(N= 1/(NAANNBB) ) Σ ΣΣ Σ ( ( d(ud(uii, , vvjj) )) )


Average LinkageAverage Linkage

(HCL(HCL--8)8)

DDABAB

ClusterCluster--toto--cluster distance is defined as the cluster distance is defined as the maximum distancemaximum distance between members of one cluster between members of one cluster and members of the another cluster. Complete and members of the another cluster. Complete linkage tends to create clusters of similar size and linkage tends to create clusters of similar size and variability.variability.

DDABAB = max ( = max ( d(ud(uii, , vvjj) )) )


Complete LinkageComplete Linkage

(HCL(HCL--9)9)

DDABAB

Comparison of Linkage Comparison of Linkage MethodsMethods

SingleSingle AverageAverage CompleteComplete(HCL(HCL--10)10)

BootstrappingBootstrappingBootstrapping Bootstrapping –– resamplingresampling with replacementwith replacementOriginal expression matrix:Original expression matrix:


Gene 1Gene 1Gene 2Gene 2Gene 3Gene 3Gene 4Gene 4Gene 5Gene 5Gene 6Gene 6

Various bootstrapped matrices (by experiments):Various bootstrapped matrices (by experiments):Exp 2Exp 2 Exp 3Exp 3 Exp 4Exp 4


Exp 2Exp 2 Exp 4Exp 4 Exp 4Exp 4 Exp 1Exp 1 Exp 3Exp 3 Exp 5Exp 5 Exp 6Exp 6


Exp 1Exp 1 Exp 5Exp 5

JackknifingJackknifing

Jackknifing Jackknifing –– resamplingresampling without replacementwithout replacementOriginal expression matrix:Original expression matrix:



Various jackknifed matrices (by experiments):Various jackknifed matrices (by experiments):

Exp 1Exp 1 Exp 3Exp 3 Exp 4Exp 4 Exp 5Exp 5 Exp 6Exp 6


Exp 1Exp 1 Exp 2Exp 2 Exp 3Exp 3 Exp 4Exp 4 Exp 6Exp 6


Analysis of bootstrapped and Analysis of bootstrapped and jackknifed support treesjackknifed support trees

Bootstrapped or jackknifed expression matrices are created Bootstrapped or jackknifed expression matrices are created many times by randomly many times by randomly resamplingresampling the original expression the original expression matrix, using either the bootstrap or jackknife procedure.matrix, using either the bootstrap or jackknife procedure.Each time, hierarchical trees are created from the Each time, hierarchical trees are created from the resampledresampledmatrices.matrices.The trees are compared to the tree obtained from the original The trees are compared to the tree obtained from the original data set.data set.The more frequently a given cluster from the original tree is The more frequently a given cluster from the original tree is found in the found in the resampledresampled trees, the stronger the support for the trees, the stronger the support for the cluster. cluster. As each As each resampledresampled matrix lacks some of the original data, matrix lacks some of the original data, high support for a cluster means that the clustering is not high support for a cluster means that the clustering is not biased by a small subset of the data.biased by a small subset of the data.

Self Organizing Tree AlgorithmSelf Organizing Tree AlgorithmDopazo, J. , J.M Carazo, Phylogenetic Dopazo, J. , J.M Carazo, Phylogenetic reconstruction using and unsupervised reconstruction using and unsupervised growing neural network that adopts the growing neural network that adopts the topology of a phylogenetic tree. J. Mol. Evol. topology of a phylogenetic tree. J. Mol. Evol. 44:22644:226--233, 1997.233, 1997.

Herrero, J., A. Valencia, and J. Dopazo. A Herrero, J., A. Valencia, and J. Dopazo. A hierarchical unsupervised growing neural hierarchical unsupervised growing neural network for clustering gene expression network for clustering gene expression patterns. Bioinformatics, 17(2):126patterns. Bioinformatics, 17(2):126--136, 136, 2001.2001.

SOTA SOTA -- 11

SOTA CharacteristicsSOTA CharacteristicsDivisive clustering, allowing high level hierarchical Divisive clustering, allowing high level hierarchical structure to be revealed without having to completely structure to be revealed without having to completely partition the data set down to single gene vectorspartition the data set down to single gene vectorsData set is reduced to clusters arranged in a binary Data set is reduced to clusters arranged in a binary tree topologytree topologyThe number of resulting clusters is not fixed before The number of resulting clusters is not fixed before clusteringclusteringNeural network approach which has advantages Neural network approach which has advantages similar to SOMs such as handling large data sets that similar to SOMs such as handling large data sets that have large amounts of ‘noise’have large amounts of ‘noise’

SOTA SOTA -- 22

SOTA TopologySOTA Topology

Parent Parent NodeNode

ααpp

Winning Winning CellCell

Sister Sister CellCell

ααwwαα ss

αα∗ ∗ = = migration factor (migration factor (ααss < < ααp p < < ααww))

SOTA SOTA -- 33

MembersMembers

Centroid Centroid VectorVector

Adaptation OverviewAdaptation OverviewEach gene vector associated with the parent is Each gene vector associated with the parent is

compared to the centroid vector of its offspring compared to the centroid vector of its offspring cells.cells.

The most similar cell’s centroid and its The most similar cell’s centroid and its neighboring cells are adapted using the neighboring cells are adapted using the appropriate migration weights.appropriate migration weights.

SOTA SOTA -- 44

Following the presentation of all genes to the Following the presentation of all genes to the system a measure of system diversity is used system a measure of system diversity is used to determine if training has found an optimal to determine if training has found an optimal position for the offspring.position for the offspring.

If the system diversity improves (decreases) If the system diversity improves (decreases) then another training epoch is started then another training epoch is started otherwise training ends and a new cycle starts otherwise training ends and a new cycle starts with a cell division.with a cell division.

SOTA SOTA -- 55

The most ‘diverse’ cell The most ‘diverse’ cell is selected for division is selected for division at the start of the next at the start of the next training cycle.training cycle.

SOTA SOTA -- 66

Growth TerminationGrowth Termination

Expansion stops Expansion stops when the most when the most diverse cell’s diverse cell’s diversity falls below diversity falls below a threshold.a threshold.

SOTA SOTA -- 77

0

0.05

0.1

0.15

0.2

0 100 200 300 400 500

Adaptation Epoch Number

Tree

Div

ersi

ty

Each training cycle ends when the Each training cycle ends when the overall tree diversity ‘stabilizes’.overall tree diversity ‘stabilizes’.This triggers a cell division andThis triggers a cell division andpossibly a new training cycle.possibly a new training cycle.

SOTA SOTA -- 88

1. Specify number of 1. Specify number of clustersclusters, e.g., 5. , e.g., 5.

2. Randomly assign genes to clusters.2. Randomly assign genes to clusters.

G1G1 G2G2 G3G3 G4G4 G5G5 G6G6 G7G7 G8G8 G9G9 G10G10 G11G11 G12G12 G13G13

KK--Means/Medians Clustering Means/Medians Clustering –– 11

KK--Means/Medians Clustering Means/Medians Clustering –– 223. Calculate mean/median expression profile of each cluster.3. Calculate mean/median expression profile of each cluster.

4. Shuffle genes among clusters such that each gene is now 4. Shuffle genes among clusters such that each gene is now in the cluster whose mean expression profile (calculated in in the cluster whose mean expression profile (calculated in step 3) is the closest to that gene’s expression profile.step 3) is the closest to that gene’s expression profile.

G1G1 G2G2G3G3 G4G4 G5G5G6G6

G7G7

G8G8 G9G9G10G10

G11G11

G12G12

G13G13

5. Repeat steps 3 and 4 until genes cannot be shuffled around 5. Repeat steps 3 and 4 until genes cannot be shuffled around any more, OR a userany more, OR a user--specified number of iterations has been specified number of iterations has been reached. reached.

kk--means is most useful when the user has an means is most useful when the user has an a prioria priori hypothesis hypothesis about the number of clusters the genes should belong to.about the number of clusters the genes should belong to.

SelfSelf--organizing maps (organizing maps (SOMsSOMs) ) –– 111. Specify the number of nodes (clusters) desired, and 1. Specify the number of nodes (clusters) desired, and

also specify a 2also specify a 2--D geometry for the nodes, e.g., D geometry for the nodes, e.g., rectangular or hexagonalrectangular or hexagonal N = NodesN = Nodes

G = GenesG = GenesG1G1 G6G6

G3G3

G5G5G4G4

G2G2

G11G11

G7G7G8G8

G10G10G9G9

G12G12 G13G13G14G14

G15G15

G19G19G17G17

G22G22

G18G18G20G20

G16G16

G21G21G23G23

G25G25G24G24

G26G26 G27G27

G29G29G28G28

N1N1 N2N2

N3N3 N4N4

N5N5 N6N6

SOMsSOMs –– 222. Choose a random gene, e.g., G92. Choose a random gene, e.g., G9

3. Move the nodes in the direction of G9. The node 3. Move the nodes in the direction of G9. The node closest to G9 (N2) is moved the most, and the other closest to G9 (N2) is moved the most, and the other nodes are moved by smaller varying amounts. The nodes are moved by smaller varying amounts. The farther away the node is from N2, the less it is moved. farther away the node is from N2, the less it is moved.

G1G1 G6G6

G3G3

G5G5G4G4

G2G2

G11G11

G7G7G8G8

G10G10G9G9

G12G12 G13G13G14G14

G15G15

G19G19G17G17

G22G22

G18G18G20G20

G16G16

G21G21G23G23

G25G25 G24G24

G26G26 G27G27

G29G29G28G28

N1N1 N2N2

N3N3 N4N4

N5N5 N6N6

SOMsSOMs –– 334. Steps 2 and 3 (i.e., choosing a random gene and moving the 4. Steps 2 and 3 (i.e., choosing a random gene and moving the

nodes towards it) are repeated many (usually several nodes towards it) are repeated many (usually several thousand) times. However, with each iteration, the amount that thousand) times. However, with each iteration, the amount that the nodes are allowed to move is decreased.the nodes are allowed to move is decreased.

5. Finally, each node will “nestle” among a cluster of 5. Finally, each node will “nestle” among a cluster of genes, and a gene will be considered to be in the cluster genes, and a gene will be considered to be in the cluster if its distance to the node in that cluster is less than its if its distance to the node in that cluster is less than its distance to any other node. distance to any other node.

G1G1 G6G6

G3G3

G5G5G4G4

G2G2

G11G11

G7G7G8G8

G10G10G9G9N1N1 N2N2

G12G12 G13G13G14G14

G15G15G26G26 G27G27

G29G29G28G28N3N3

N4N4

G19G19G17G17

G22G22

G18G18G20G20

G16G16

G21G21G23G23

G25G25 G24G24N5N5N6N6

SOM Neighborhood OptionsSOM Neighborhood Options

G11G11

G7G7 G8G8

G10G10G9G9

N1N1 N2N2

N3N3 N4N4

N5N5 N6N6

G11G11

G7G7 G8G8

G10G10G9G9

N1N1 N2N2

N3N3 N4N4

N5N5 N6N6

Bubble Bubble NeighborhoodNeighborhood

GaussianGaussian

NeighborhoodNeighborhoodradiusradius

All move, alpha is scaled.All move, alpha is scaled.Some move, alpha Some move, alpha is constant.is constant.

Compute first Compute first principle component principle component of expression matrixof expression matrixShave off Shave off αα% (default % (default 10%) of genes with 10%) of genes with lowest values of dot lowest values of dot product with 1product with 1stst

principal componentprincipal component

OrthogonalizeOrthogonalize expression expression matrix with respect to the matrix with respect to the average gene in the cluster average gene in the cluster and repeat shaving and repeat shaving procedureprocedure

Repeat until only one Repeat until only one gene remainsgene remains

Results in a series Results in a series of nested clustersof nested clusters

Choose cluster of Choose cluster of appropriate size appropriate size as determined by as determined by gap statistic gap statistic calculationcalculation

Gene ShavingGene Shaving

Create random permutations of the Create random permutations of the expression matrix and calculate Rexpression matrix and calculate R22

for eachfor each

Compare RCompare R22 of each cluster to that of of each cluster to that of the entire expression matrix the entire expression matrix

Choose the cluster whose RChoose the cluster whose R22 is furthest is furthest from the average Rfrom the average R22 of the permuted of the permuted expression matrices.expression matrices.

Gap statistic calculation Gap statistic calculation (choosing cluster size)(choosing cluster size)

Quality measure for clusters:Quality measure for clusters:

Large RLarge R22 implies a implies a tight cluster of tight cluster of coherent genescoherent genes

within variance within variance between variance between variance

RR22 ==

between between variance of mean variance of mean gene across gene across experimentsexperiments

within within variance of each gene variance of each gene about the cluster about the cluster average average

Gene ShavingGene Shaving

The final cluster The final cluster contains a set of genes contains a set of genes

that are greatly affected that are greatly affected by the experimental by the experimental

conditions conditions in a similar way.in a similar way.

Principal Components AnalysisPrincipal Components Analysis1.1. PCA simplifies the “views” of the data.PCA simplifies the “views” of the data.

2.2. Suppose we have measurements for each gene on Suppose we have measurements for each gene on multiple experiments.multiple experiments.

3.3. Suppose some of the experiments are correlated.Suppose some of the experiments are correlated.

4.4. PCA will ignore the redundant experiments, and will take a PCA will ignore the redundant experiments, and will take a weighted average of some of the experiments, thus weighted average of some of the experiments, thus possibly making the trends in the data more interpretable. possibly making the trends in the data more interpretable.

5. The components can be thought of as axes in n5. The components can be thought of as axes in n--dimensional space, where n is the number of components. dimensional space, where n is the number of components. Each axis represents a different trend in the data.Each axis represents a different trend in the data.

Principal Components AnalysisPrincipal Components Analysis

““Cloud” of data points (e.g., genes) Cloud” of data points (e.g., genes) in Nin N--dimensional space, N = # hybridizationsdimensional space, N = # hybridizations

Data points resolved along 3 principalData points resolved along 3 principalcomponent axescomponent axes..In this example:In this example:

xx--axisaxis could mean a continuum from overcould mean a continuum from over--to underto under--expressionexpression

yy--axisaxis could mean that “blue” genes are overcould mean that “blue” genes are over--expressed in first five expressed in first five exptsexpts and under and under expressed in the remaining expressed in the remaining exptsexpts, while “brown” genes are under, while “brown” genes are under--expressed in the first expressed in the first five five exptsexpts, and over, and over--expressed in the remaining expressed in the remaining exptsexpts. .

zz--axisaxis might represent different cyclic patterns, e.g., “red” genes mimight represent different cyclic patterns, e.g., “red” genes might be overght be over--expressed expressed in oddin odd--numbered numbered exptsexpts and underand under--expressed in evenexpressed in even--numbered ones, whereas the numbered ones, whereas the opposite is true for “purple” genes.opposite is true for “purple” genes.

Interpretation of components is somewhat subjective. Interpretation of components is somewhat subjective.

yy

xx

zz

xx

zz

yy

PCA in picturesPCA in pictures

Principal ComponentsPrincipal Componentspick out the directionspick out the directionsin the data that capturein the data that capturethe greatest variabilitythe greatest variability

xx

zz

yy


zz’’

yy’’

xx’’

The “new” axes are linearThe “new” axes are linearcombinations of the oldcombinations of the oldaxes axes –– typically combinationstypically combinationsof genes or experiments.of genes or experiments.

=a=a11x+bx+b11y+cy+c11zz




Projecting the data into aProjecting the data into alower dimensional spacelower dimensional spacecan help visualize relationshipscan help visualize relationships

yy’’

xx’’

PCA in picturesPCA in picturesyy’’

xx’’

Projecting the data into aProjecting the data into alower dimensional spacelower dimensional spacecan help visualize relationshipscan help visualize relationships

Support Vector Machines (SVM)Support Vector Machines (SVM)

Supervised learning techniqueSupervised learning technique

Uses supplied information such as Uses supplied information such as presumptive biological relationships presumptive biological relationships between a set of elements, and the between a set of elements, and the expression profiles of elements to expression profiles of elements to produce a binary classification of produce a binary classification of elements.elements.

ElementsElementsInIn

ClassificationClassification

ElementsElementsOut ofOut of

ClassificationClassification

SVM Process OverviewSVM Process OverviewInitial Initial

ClassificationClassificationDataData

SVMSVMTrainingTraining

WeightsWeights SVMSVMClassificationClassification

DataData

SVM ClassificationSVM Classification

SVM attempts to find an optimal SVM attempts to find an optimal separating separating hyperplanehyperplane between members between members of the two initial classifications.of the two initial classifications.

Separating Separating hyperplanehyperplane

SVM Kernel ConstructionSVM Kernel Construction

The expression data can be transformed to a The expression data can be transformed to a higher dimensional space (feature space) by higher dimensional space (feature space) by applying a kernel function. This transformation applying a kernel function. This transformation can have the effect of allowing a separating can have the effect of allowing a separating hyperplanehyperplane to be found.to be found.

Practical SVM IssuesPractical SVM Issues

Results depend heavily on the input Results depend heavily on the input parameters. parameters. Using a high degree kernel function Using a high degree kernel function risks artificial separation of the data. risks artificial separation of the data. An iterative approach to increasing the An iterative approach to increasing the kernel power is advisable.kernel power is advisable.

SVM ResultsSVM ResultsTwo classes are producedTwo classes are produced

Positive Class: contains elements with Positive Class: contains elements with expression patterns similar to those in the expression patterns similar to those in the positive examples in the training set. positive examples in the training set. Negative Class: contains all other members of Negative Class: contains all other members of the input set.the input set.

Each of these classes has elements that fall in Each of these classes has elements that fall in two groupstwo groups

Those initially in the class (true positives and Those initially in the class (true positives and true negatives)true negatives)Those recruited into the class (false positives Those recruited into the class (false positives and false negatives)and false negatives)

A theory has only the possibility of being right A theory has only the possibility of being right or wrong. A model has a third possibility; it may or wrong. A model has a third possibility; it may

be right but irrelevant.be right but irrelevant.

–– Manfred Manfred EigenEigen

The TIGR Gene Index TeamThe TIGR Gene Index TeamFoo CheungFoo Cheung

Svetlana KaramychevaSvetlana KaramychevaYudan LeeYudan Lee

Babak ParviziBabak ParviziGeo PerteaGeo Pertea

Razvan SultanaRazvan SultanaJennifer TsaiJennifer Tsai

John QuackenbushJohn QuackenbushJoseph WhiteJoseph White

Funding provided by the Department of EnergyFunding provided by the Department of Energyand the National Science Foundationand the National Science Foundation

TIGR Human/Mouse/ArabidopsisTIGR Human/Mouse/ArabidopsisExpression TeamExpression Team

Emily ChenEmily ChenBryan FrankBryan Frank

Renee GaspardRenee GaspardJeremy HassemanJeremy Hasseman

Heenam KimHeenam KimLara LinfordLara Linford

Simon KwongSimon KwongJohn QuackenbushJohn Quackenbush

Shuibang WangShuibang WangYonghong WangYonghong Wang

Ivana YangIvana YangYan YuYan Yu

Array Software Hit TeamArray Software Hit TeamNirmal BhagabatiNirmal Bhagabati

John BraistedJohn BraistedTracey CurrierTracey Currier

Jerry LiJerry LiWei LiangWei Liang

John QuackenbushJohn QuackenbushAlexander I. SaeedAlexander I. Saeed

Vasily SharovVasily SharovMathangi Mathangi ThaiagarjianThaiagarjian

Joseph WhiteJoseph WhiteAssistantAssistantSue MineoSue MineoFunding provided by the National Cancer Institute,Funding provided by the National Cancer Institute,

the National Heart, Lung, Blood Institute,the National Heart, Lung, Blood Institute,and the National Science Foundationand the National Science Foundation

H. Lee Moffitt Center/USFH. Lee Moffitt Center/USFTimothy J. YeatmanTimothy J. Yeatman

Greg BloomGreg Bloom

TIGR PGA CollaboratorsTIGR PGA CollaboratorsNorman LeeNorman LeeRenae MalekRenae Malek

HongHong--Ying WangYing WangTruong LuuTruong Luu

Bobby BehbahaniBobby Behbahani

TIGR Faculty, IT Group, and StaffTIGR Faculty, IT Group, and Staff

<[email protected]><[email protected]>AcknowledgmentsAcknowledgments

PGA CollaboratorsPGA CollaboratorsGary Churchill (TJL)Gary Churchill (TJL)Greg Evans (NHLBI)Greg Evans (NHLBI)Harry Harry GavarasGavaras (BU)(BU)

Howard Jacob (MCW)Howard Jacob (MCW)Anne Kwitek (MCW)Anne Kwitek (MCW)Allan Pack (Penn)Allan Pack (Penn)

Beverly Paigen (TJL)Beverly Paigen (TJL)Luanne Peters (TJL)Luanne Peters (TJL)

David Schwartz (Duke)David Schwartz (Duke)

EmeritusEmeritusJennifer Cho (TGI)Jennifer Cho (TGI)

Ingeborg Holt (TGI)Ingeborg Holt (TGI)Feng Liang (TGI)Feng Liang (TGI)

KristieKristie Abernathy (Abernathy (mAmA))Sonia Sonia Dharap(mADharap(mA))

Julie EarleJulie Earle--Hughes (Hughes (mAmA))Cheryl Gay (Cheryl Gay (mAmA))Priti Hegde (Priti Hegde (mAmA))

Rong Rong QiQi ((mAmA))Erik Snesrud (Erik Snesrud (mAmA))

TM4 SoftwareTM4 Software: : http://www.tigr.org/software/tm4/http://www.tigr.org/software/tm4/

Normalization working groupNormalization working groupWeb page:Web page:

www.dnachip.org/mged/normalization.htmlwww.dnachip.org/mged/normalization.htmlMailing list:Mailing list:

[email protected]@ebi.ac.uk