Microarray Data NormalizationMicroarray Data Normalizationand Analysisand Analysis
John QuackenbushJohn QuackenbushCAMDACAMDA
12 November 200312 November 2003
Microarray Analysis at TIGRMicroarray Analysis at TIGR
Step 1: Experimental DesignStep 1: Experimental Design
Step 2: Data CollectionStep 2: Data Collection
Step 3: Data AnalysisStep 3: Data Analysis
Step 4: Consulting with the Step 4: Consulting with the ArraySWArraySW gang in the trailergang in the trailer
Step 5: Sharing data with our collaboratorsStep 5: Sharing data with our collaborators
TM4TM4 ResourcesResources
Saeed, A.I., Sharov, V., White, J., Li, J., Saeed, A.I., Sharov, V., White, J., Li, J., Liang, W., Bhagabati, N., Braisted, J., et Liang, W., Bhagabati, N., Braisted, J., et al. 2003. TM4: A Free, Openal. 2003. TM4: A Free, Open--Source Source System for Microarray Data System for Microarray Data Management and Analysis. Management and Analysis. BioTechniquesBioTechniques 34: 37434: 374--378378
TM4 ReferenceTM4 Reference
•• All TM4 ApplicationsAll TM4 Applications•• User ManualsUser Manuals•• Supplementary DocumentationSupplementary Documentation•• Sample Data SetsSample Data Sets
CD ContentsCD Contents
TM4 WebsiteTM4 Website •• Application DownloadsApplication Downloads•• Documentation and Documentation and FAQsFAQs•• And Much More!And Much More!http://www.tigr.org/software/tm4/http://www.tigr.org/software/tm4/
Microarray Data Flow
Database
AGED
Database
Others…
Database
MAD
Raw Gene Expression Data
Normalized Data with Gene Annotation
Interpretation of Analysis Results
.tiff Image File
Gene Annotation
ScannerPrinter
Image Analysis
Normalization / Filtering
Expression Analysis
Microarray Data Flow
Database
AGED
Database
Others…
Database
MAD
Raw Gene Expression Data
Normalized Data with Gene Annotation
Interpretation of Analysis Results
.tiff Image File
Gene Annotation
ScannerPrinter
Image Analysis
Normalization / Filtering
Expression Analysis
MAD Database SchemaMAD Database Schema
MADAMMADAM
Microarray Data Flow
Database
AGED
Database
Others…
Database
MAD
Raw Gene Expression Data
Normalized Data with Gene Annotation
Interpretation of Analysis Results
.tiff Image File
Gene Annotation
ScannerPrinter
Image Analysis
Normalization / Filtering
Expression Analysis
TIGR SpotfinderTIGR Spotfinder
Microarray Data Flow
Database
AGED
Database
Others…
Database
MAD
Raw Gene Expression Data
Normalized Data with Gene Annotation
Interpretation of Analysis Results
.tiff Image File
Gene Annotation
ScannerPrinter
Image Analysis
Normalization / Filtering
Expression Analysis
MIDASMIDAS
Microarray Data Flow
Database
AGED
Database
Others…
Database
MAD
Raw Gene Expression Data
Normalized Data with Gene Annotation
Interpretation of Analysis Results
.tiff Image File
Gene Annotation
ScannerPrinter
Image Analysis
Normalization / Filtering
Expression Analysis
MeVMeV
The Experimental Design dictates a good deal of what The Experimental Design dictates a good deal of what you can do with the datayou can do with the data
Good normalization and processing reflects the Good normalization and processing reflects the experimental designexperimental design
The design also facilitates certain comparisons between The design also facilitates certain comparisons between samples and provides the statistical power you need for samples and provides the statistical power you need for assigning confidence limits to individual measurementsassigning confidence limits to individual measurements
The design must reflect experimental realityThe design must reflect experimental reality
The most straightThe most straight--forward designs compare expression forward designs compare expression in two classes of samples to look for patterns that in two classes of samples to look for patterns that distinguish them.distinguish them.
The Experimental Design
Sample Pairing for CoSample Pairing for Co--Hybridization ExperimentsHybridization ExperimentsDirect Comparison with Dye Swap:Direct Comparison with Dye Swap:
AA11
AA11
BB11
BB11
AA22
AA22
BB22
BB22
AA11 BB11 AA22 BB22
Balanced Block Design:Balanced Block Design:
AA33
AA33
BB33
BB33
AA33 BB33 AA44 BB44
AA44
AA44
BB44
BB44
•• RNA sample is RNA sample is notnot limiting (e.g. plenty of sample)limiting (e.g. plenty of sample)•• Flip dyes account for any geneFlip dyes account for any gene--dye effectsdye effects
•• RNA sample is limitingRNA sample is limiting•• Balanced blocking accounts for any geneBalanced blocking accounts for any gene--dye effectsdye effects
Multiple Sample PairingsMultiple Sample PairingsReference Design (Indirect Comparison):Reference Design (Indirect Comparison):
AA
CC
BB
DD
AA BB CC
RR
DD
AA
CC
BB
EE
FF
DD
Loop Design:Loop Design:
•• More than two samples are comparedMore than two samples are compared(e.g. tumor classification, time course)(e.g. tumor classification, time course)
•• Flip dyes are not necessary but can be Flip dyes are not necessary but can be done to increase precisiondone to increase precision
•• Ratio values are inferred (indirect)Ratio values are inferred (indirect)•• Suited for cluster analysis Suited for cluster analysis –– need common need common
referencereference
Why perform flipWhy perform flip--dye experiments?dye experiments?Microarray experiments employing a single direction designMicroarray experiments employing a single direction design::QQ11 CC11 QQ22 CC22 QQ33 CC33
FlipFlip--dye or dyedye or dye--reversal designreversal design::
QQ11 CC11 QQ22 CC22
FlipFlip--dye experiments account for genedye experiments account for gene--dye effectsdye effects
QQ33 CC33
A B C D E
R R R R R
10 hybs Standard flip-dye expt
A
B
CD
E10 hybs Proposed loop expt
0 new hybs
A
B
CD
R Proposed loop expt withreference to provide
direct comparison with reference
A
B
DC
E 3 new hybsProposed loop expt with
reference to showinvariance w.r.t order
Loops and Reference Designs
S. Wang , K. Kerr, J. Quackenbush, G. ChurchillS. Wang , K. Kerr, J. Quackenbush, G. Churchill
23 Hybs23 Hybs
Loops and Reference Designs
S. Wang , K. Kerr, J. Quackenbush, G. ChurchillS. Wang , K. Kerr, J. Quackenbush, G. Churchill
Both approaches can give equivalent resultsBoth approaches can give equivalent results
Loop designLoop designCan provide direct measurementsCan provide direct measurementsGive more data on each experimental sample with Give more data on each experimental sample with
the same number of the same number of hybshybsRequire more RNA per sampleRequire more RNA per sampleCan Can ““unwindunwind”” with a bad sample or for a gene with a bad sample or for a gene
with bad datawith bad data
Reference designReference designEasily extensibleEasily extensibleSimple interpretation of all resultsSimple interpretation of all resultsRequires less RNA per sampleRequires less RNA per sampleLess sensitive to bad RNA samples and bad arrayLess sensitive to bad RNA samples and bad array
elementselements
Loop vs. Reference DesignsLoop vs. Reference Designs
A1A1 B1B1
A2A2
B2B2
A3A3B3B3
A4A4
B4B4
A1A1
B1B1
A2A2
B2B2
A3A3
B3B3
A4A4
B4B4
A1A1
B2B2
A2A2
B3B3
A3A3
B4B4
A4A4
B1B1
A1A1
B3B3
A2A2
B4B4
A3A3
B1B1
A4A4
B2B2
A1A1
B4B4
A2A2
B3B3
A3A3
B2B2
A4A4
B1B1
Experimental DesignExperimental Design
Keep it simple!Keep it simple!
XXXX
XX
Parental Parental -- stressedstressed
Parental Parental -- unstressedunstressed
Derived Derived -- stressedstressed
Derived Derived -- unstressedunstressed
EnvironmentEnvironment
GenotypeGenotype
One Possible Experimental ParadigmOne Possible Experimental Paradigm::Examining Genotype, Phenotype, and EnvironmentExamining Genotype, Phenotype, and Environment
Reference SampleReference Sample
Assay VariationAssay Variation
Biological replicas are more informative than Biological replicas are more informative than correlated replicas (independent RNA, independent correlated replicas (independent RNA, independent slides)slides)
More replicas are better More replicas are better –– higher statistical powerhigher statistical power
For loops, hybridizations of individual samples should For loops, hybridizations of individual samples should be be ““balancedbalanced”” (as many Cy3 as Cy5 (as many Cy3 as Cy5 labelingslabelings))
SelfSelf--self hybs add data on reproducibility and can be self hybs add data on reproducibility and can be used to produce error modelsused to produce error models
At a minimum, should use dye swap replicates to At a minimum, should use dye swap replicates to compensate for any dye biases in labeling or detectioncompensate for any dye biases in labeling or detection
Basic Design PrinciplesBasic Design Principles
How Many Replicates?How Many Replicates?
Where zWhere zαα /2/2 and and zzββ are normal percentile values at are normal percentile values at significance level significance level αα and and falsefalse negative rate negative rate ββ; ; δδ represents the minimum detectable logrepresents the minimum detectable log22 ratioratio;;and and σσ represents the SD of log ratio valuesrepresents the SD of log ratio values..
For For αα = 0.001 and = 0.001 and ββ = 0.05, then z= 0.05, then zαα /2/2 = = --3.29 and 3.29 and zzββ = = --1.65.1.65.
Assume Assume δδ = 1.0 (2= 1.0 (2--fold change) and fold change) and σσ = 0.25,= 0.25,
Therefore n = 12 samples (6 query and 6 control).Therefore n = 12 samples (6 query and 6 control).
(Simon et al., (Simon et al., Genetic EpidemiologyGenetic Epidemiology 23: 2123: 21--36, 2002)36, 2002)
n = [4(zn = [4(zαα /2/2 + z+ zββ))22] / [(] / [(δδ/1.4/1.4σσ))22]]
Some Concepts from Some Concepts from StatisticsStatistics
Probability distributionsProbability distributions
The The probabilityprobability of an event is the likelihood of its occurring.of an event is the likelihood of its occurring.
It is sometimes computed as a relative It is sometimes computed as a relative frequency (frequency (rfrf)), where, where
The probability of an event can sometimes be inferred from aThe probability of an event can sometimes be inferred from a“theoretical” “theoretical” probability distributionprobability distribution, such as a normal distribution., such as a normal distribution.
the number of “favorable” outcomes for an eventthe number of “favorable” outcomes for an eventthe total number of possible outcomes for that eventthe total number of possible outcomes for that eventrfrf ==
Normal distributionNormal distribution
σσ = standard = standard deviationdeviationof the of the distributiondistribution
X = X = µµ (mean of the distribution)(mean of the distribution)
Population 1Population 1
Mean 1Mean 1
Population 2Population 2
Mean 2Mean 2
Less than a 5 %Less than a 5 % chance that the sample with mean chance that the sample with mean ss came from came from Population 1Population 1
ss is significantly different from is significantly different from Mean 1Mean 1 at the p < 0.05 significance level. at the p < 0.05 significance level.
But we cannot reject the hypothesis that the sample came from But we cannot reject the hypothesis that the sample came from Population 2Population 2
Sample mean “Sample mean “ss””
Many biological variables, such as height and weight, can Many biological variables, such as height and weight, can reasonably be assumed to approximate the normal reasonably be assumed to approximate the normal distribution. distribution.
But expression measurements? Probably not.But expression measurements? Probably not.
Fortunately, many statistical tests are considered to be fairly Fortunately, many statistical tests are considered to be fairly robust to violations of the normality assumption, and other robust to violations of the normality assumption, and other assumptions used in these tests.assumptions used in these tests.
Randomization / Randomization / resamplingresampling based tests can be used to get based tests can be used to get around the violation of the normality assumption.around the violation of the normality assumption.
Even when parametric statistical tests (the ones that make use Even when parametric statistical tests (the ones that make use of normal and other distributions) are valid, randomization of normal and other distributions) are valid, randomization tests are still useful.tests are still useful.
Probability and Expression DataProbability and Expression Data
1. Compute the value of interest (i.e., the 1. Compute the value of interest (i.e., the testtest--statistic sstatistic s) ) from your data set.from your data set.
Outline of a randomization test Outline of a randomization test -- 11
Original data setOriginal data set
ss
2. Make 2. Make ““fakefake”” data sets from your original data, by taking a data sets from your original data, by taking a random subrandom sub--sample of the data, or by resample of the data, or by re--arranging the data arranging the data in a random fashion. in a random fashion. ReRe--compute compute ss from the from the ““fakefake”” data set.data set.
““fakefake”” ss““fakefake”” ss““fakefake”” ss
. . .. . .Randomized Randomized ““fakefake”” data setsdata sets
3. Repeat step 2 many times (often several hundred to 3. Repeat step 2 many times (often several hundred to several thousand times) and record of the several thousand times) and record of the ““fakefake”” ssvalues from step 2values from step 2
4. Draw inferences about the significance of your original s 4. Draw inferences about the significance of your original s value by comparing it with the value by comparing it with the distribution of the distribution of the randomized (randomized (““fakefake””) ) ss valuesvalues
Range of randomized Range of randomized ss valuesvalues
Original Original ss value could be value could be significantsignificantas it exceeds most of the randomized as it exceeds most of the randomized ss valuesvalues
Outline of a randomization test Outline of a randomization test -- 22
RationaleRationale
Ideally, we want to know the Ideally, we want to know the ““behaviorbehavior”” of the larger of the larger population from which the sample is drawn, in order to make population from which the sample is drawn, in order to make statistical inferences. statistical inferences.
Here, we donHere, we don’’t know that the larger population t know that the larger population ““behavesbehaves”” like like a normal distribution, or some other idealized distribution. Alla normal distribution, or some other idealized distribution. Allwe have to work with are the data in hand.we have to work with are the data in hand.
Our Our ““fakefake”” data sets are our best guess about this behavior data sets are our best guess about this behavior (i.e., if we had been pulling data at random from an infinitely (i.e., if we had been pulling data at random from an infinitely large population, we might expect to get a distribution similar large population, we might expect to get a distribution similar to what we get by pulling random subto what we get by pulling random sub--samples, or by samples, or by reshuffling the order of the data in our sample)reshuffling the order of the data in our sample)
Outline of a randomization test Outline of a randomization test -- 33
Let’s imagine there are 10,000 genes on a chip, Let’s imagine there are 10,000 genes on a chip, andand
none of them is differentially expressed.none of them is differentially expressed.
Suppose we use a statistical test for differential Suppose we use a statistical test for differential expression, where we consider a gene to be expression, where we consider a gene to be differentially expressed if it meets the criterion at a differentially expressed if it meets the criterion at a pp--value of p < 0.05.value of p < 0.05.
The problem of multiple testingThe problem of multiple testing(adapted from presentation by (adapted from presentation by Anja von Anja von HeydebreckHeydebreck, Max, Max––PlanckPlanck––Institute for Molecular Institute for Molecular
Genetics, Dept. Computational Molecular Biology, Berlin, GermanyGenetics, Dept. Computational Molecular Biology, Berlin, Germanyhttp://www.bioconductor.org/workshops/Heidelberg02/mult.pdfhttp://www.bioconductor.org/workshops/Heidelberg02/mult.pdf))
Let’s say that applying this test to gene “G1” yields Let’s say that applying this test to gene “G1” yields a a pp--value of value of pp = 0.01= 0.01
Remember that a Remember that a pp--value of 0.01 means that there is value of 0.01 means that there is a 1% chance that the gene is not differentially a 1% chance that the gene is not differentially expressed, i.e., expressed, i.e.,
Even though we conclude that the gene is Even though we conclude that the gene is differentially expressed (because differentially expressed (because pp < 0.05), there is a < 0.05), there is a 1% chance that our conclusion is wrong.1% chance that our conclusion is wrong.
We might be willing to live with such a low We might be willing to live with such a low probability of being wrongprobability of being wrong
BUT ..... BUT .....
The problem of multiple testing The problem of multiple testing –– 22
We are testing 10,000 genes, not just one!!!We are testing 10,000 genes, not just one!!!
Even though none of the genes is differentially Even though none of the genes is differentially expressed, about 5% of the genes (i.e., 500 expressed, about 5% of the genes (i.e., 500 genes) will be erroneously concluded to be genes) will be erroneously concluded to be differentially expressed, because we have differentially expressed, because we have decided to “live with” a decided to “live with” a pp--value of 0.05value of 0.05
If only one gene were being studied, a 5% If only one gene were being studied, a 5% margin of error might not be a big deal, but 500 margin of error might not be a big deal, but 500 false conclusions in one study? That doesn’t false conclusions in one study? That doesn’t sound too good.sound too good.
The problem of multiple testing The problem of multiple testing –– 33
There are “tricks” we can use to reduce the There are “tricks” we can use to reduce the severity of this problem.severity of this problem.
They all involve “slashing” the pThey all involve “slashing” the p--value for each value for each test (i.e., gene), so that while the critical test (i.e., gene), so that while the critical pp--value value for the entire data set might still equal 0.05, each for the entire data set might still equal 0.05, each gene will be evaluated at a lower gene will be evaluated at a lower pp--value.value.
We’ll go into some of these techniques later. We’ll go into some of these techniques later.
The problem of multiple testing The problem of multiple testing –– 44
Don’t get too hung up on Don’t get too hung up on pp--values.values.
Ultimately, what matters is biological relevance. Ultimately, what matters is biological relevance.
PP--values should help you evaluate the strength of values should help you evaluate the strength of the evidence, rather than being used as an the evidence, rather than being used as an absolute yardstick of significance. absolute yardstick of significance.
Statistical significance is not necessarily the Statistical significance is not necessarily the same as biological significance.same as biological significance.
The problem of multiple testing The problem of multiple testing –– 55
You don’t want to belong to “that You don’t want to belong to “that group of people whose aim in life is to group of people whose aim in life is to be wrong 5% of the time”!!!be wrong 5% of the time”!!! **
* Kempthorne, O., and T.E. Deoerfler 1969 The behaviour of some significance tests under experimental randomization. Biometrika56:231-248, as cited in Manly, B.J.F. 1997. Randomization, bootstrap and Monte Carlo methods in biology: pg. 1. Chapman and Hall / CRC
Indicates the degree to which a linear relationship can be Indicates the degree to which a linear relationship can be approximated between two variables.approximated between two variables.
Can range from (Can range from (––1.0) to (+1.0).1.0) to (+1.0).
Positive Positive rr between two variables between two variables xx and and yy: as : as xx increases, so increases, so does does yy on the whole.on the whole.
xx
yy
Negative Negative rr: as : as xx increases, increases, yy generally decreases.generally decreases.
The higher the magnitude of The higher the magnitude of rr (in the positive or negative (in the positive or negative direction), the more linear the relationship.direction), the more linear the relationship.
yy
xx
Pearson correlation coefficient Pearson correlation coefficient –– rr
Pearson correlation coefficient Pearson correlation coefficient –– 22Sometimes, a Sometimes, a pp--value is associated with the value is associated with the correlation coefficient correlation coefficient rr. .
This This pp--value is computed from the “theoretical” value is computed from the “theoretical” distribution of the correlation coefficient, similar to the distribution of the correlation coefficient, similar to the normal distribution.normal distribution.
Population correlation coefficient = 0
Sample correlation coefficient r
p < 0.05 range, i.e., reject the null hypothesis that the variables are not correlated, since the sample correlation coefficient is in the rejection range of the correlation coefficient distribution that has a mean = 0
Population correlation coefficient = 0
Sample correlation coefficient r
This is the This is the pp--value for the null hypothesis that the value for the null hypothesis that the xx and and yydata for our sample come from a population in which their data for our sample come from a population in which their correlation is zero, i.e., the null hypothesis is that there is correlation is zero, i.e., the null hypothesis is that there is no linear relationship between no linear relationship between xx and and yy..
If p is sufficiently small (often If p is sufficiently small (often pp < 0.05), we can reject the < 0.05), we can reject the null hypothesis, i.e., we conclude that there is indeed a null hypothesis, i.e., we conclude that there is indeed a linear relationship between linear relationship between xx and and yy..
Pearson correlation coefficient Pearson correlation coefficient –– 22
The square of the Pearson correlation, The square of the Pearson correlation, rr22, , also also known asknown as the the ““coefficient of determinationcoefficient of determination,,”” is a is a measure of the measure of the ““strengthstrength”” of the linear of the linear relationship between relationship between xx and and yy..
It is the proportion of the total variation in It is the proportion of the total variation in xx and and yythat is explained by a linear relationship.that is explained by a linear relationship.
Pearson correlation coefficient Pearson correlation coefficient –– rr
Dealing with DataDealing with Data
Before any pattern analysis can be done, one Before any pattern analysis can be done, one must first normalize and filter the data.must first normalize and filter the data.
Normalization facilitates comparisons Normalization facilitates comparisons between datasets.between datasets.
Filtering transformations can eliminate Filtering transformations can eliminate questionable data and reduce complexity.questionable data and reduce complexity.
Expression ElementsExpression Elements
SamplesSamples
Exp.Exp.
Ref.Ref.
DyesDyes
Cy5Cy5
Cy3Cy3
Measured Measured IntensitiesIntensities
216108216108
7710377103
LogLog22 RatioRatio
1.4851.485
ImagesImages RatioRatio
2.7982.798
00--3.03.0 3.03.0
False Color False Color RepresentationRepresentation
Ratio vs. logRatio vs. log--ratioratio
i
ii B
AR =AAii : : Red intensityRed intensity
BBii: : Green intensityGreen intensityLetLet
=
i
ii B
AR 22 loglog
Gene1: Gene1: RR1 1 = 4, log= 4, log22RR1 1 = 2= 2
Gene2: Gene2: RR2 2 = 1/4, log= 1/4, log22RR2 2 = = --22
RR
A*BA*B
44
22
00Gene2Gene2
Gene1Gene1
33
11
loglog22(A*B)(A*B)
Advantages of log transformation:Advantages of log transformation:
Treat upTreat up--regulated and downregulated and down--regulated genes symmetrically!regulated genes symmetrically!
Transfer multiplication operations to addition operations! BTransfer multiplication operations to addition operations! Because:ecause:
BABA
222 logloglog −=
loglog22RR
00
--22Gene2Gene2
Gene1Gene122
Expression VectorsExpression VectorsGene Expression VectorsGene Expression Vectors represent represent the expression of a gene over a set the expression of a gene over a set of experimental conditions or of experimental conditions or sample types.sample types.
-0.8 0.81.5 1.8 0.5 -1.3-0.4 1.5
log2(cy5/cy3)
Goal is to measure ratios of gene expression levelsGoal is to measure ratios of gene expression levels(ratio)(ratio)ii = = RRii/G/Gii
where where RRii/G/Gii are, respectively , the measured are, respectively , the measured intensities for the intensities for the iithth spot.spot.
In a selfIn a self--self hybridization, we would expect all ratios self hybridization, we would expect all ratios to be equal to one:to be equal to one:
RRii/G/Gii = 1 for all = 1 for all ii. But they may not be.. But they may not be.
Why not?Why not?Unequal labeling efficiencies for Cy3/Cy5Unequal labeling efficiencies for Cy3/Cy5Noise in the systemNoise in the systemDifferential expressionDifferential expression
Normalization brings (appropriate) ratios back to one.Normalization brings (appropriate) ratios back to one.
Why Normalize Data?Why Normalize Data?
Ratio Histogram
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
Ratio
Freq
uenc
y
The Starting Point: The RatioThe Starting Point: The Ratio
Log(ratio) Histogram
0
500
1000
1500
2000
2500
3000
-2 -1.8
-1.6
-1.4
-1.2 -1 -0.8
-0.6
-0.4
-0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Log(ratio)
Freq
uenc
y
LogLog22(ratio) measures treat up(ratio) measures treat up-- and downand down--regulated genes equally regulated genes equally
loglog22(1) = 0(1) = 0 loglog22(2) = 1(2) = 1 loglog22(1/2) = (1/2) = --11
Normalization Approaches: A variety existNormalization Approaches: A variety existTotal IntensityTotal IntensityLinear RegressionLinear RegressionRatio statistics described by Chen, Dougherty, & Bittner Ratio statistics described by Chen, Dougherty, & Bittner
J. J. BiomedBiomed. Optics. Optics (1997) 2(4) 364(1997) 2(4) 364--374374Iterative log(ratio) mean centeringIterative log(ratio) mean centeringLowess CorrectionLowess Correction
And othersAnd others
Any of these using:Any of these using:Entire Data SetEntire Data SetUserUser--defined Data Set/Controlsdefined Data Set/Controls
Normalization ApproachesNormalization ApproachesUsing the Entire Data SetUsing the Entire Data Set
Probe Quantification less importantProbe Quantification less importantNo assumption on which genes constitute No assumption on which genes constitute
“housekeeping” set“housekeeping” setUses all the dataUses all the dataNo independent confirmationNo independent confirmation
UserUser-- defined Data Set/Controlsdefined Data Set/Controls
Requires definition of “housekeeping” set Requires definition of “housekeeping” set oror good added controlsgood added controls
Requires good RNA Requires good RNA quantitationquantitationIgnores much dataIgnores much data
Normalization ApproachesNormalization Approaches
The Solution(?)The Solution(?)The best technique is experiment dependentThe best technique is experiment dependentA good approach is to use a combinationA good approach is to use a combination
of techniquesof techniquesAll analysis methods depend on an intelligentAll analysis methods depend on an intelligent
Experimental designExperimental design
Resource: A. Resource: A. thalianathaliana DNA Clones for SpikingDNA Clones for Spiking
• chlorophyll a/b binding protein (Cab)chlorophyll a/b binding protein (Cab)
•• RUBISCO RUBISCO activaseactivase (RCA)(RCA)
•• ribuloseribulose--1,51,5--bisphosphate bisphosphate carboxylase/oxygenasecarboxylase/oxygenase
((RbcLRbcL))
•• lipid transfer protein 4 (LTP4)lipid transfer protein 4 (LTP4)
•• lipid transfer protein 6 (LTP6)lipid transfer protein 6 (LTP6)
•• papainpapain--type type cysteinecysteine endopeptidaseendopeptidase (XCP2)(XCP2)
•• root cap 1 (RCP1)root cap 1 (RCP1)
•• NAC1NAC1
•• triosphosphatetriosphosphate isomeraseisomerase (TIM)(TIM)
•• ribuloseribulose--55--phosphate phosphate kinasekinase ((PRKasePRKase))
M13R
pSP64 Poly(A)Vector
(3030bp)
5’…ATTTA GGTGA CACTA TAGAA TACAA GCTTG GGCTG CAGGT CGACT CTAGA
GGATC CCCGG GCGAG CTCCC AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA CCGAA TTC…3’
SP6 PromoterHindIII PstI SatI
AccIHincII
XbaI
EcoRISacIAvaISmaI
BamHi
SP6 Transcription Start
Clone set available atClone set available at::<http://<http://pga.tigr.orgpga.tigr.org>>
Resource: B. Resource: B. subtillussubtillus DNA Clones for SpikingDNA Clones for Spiking
•• pGIBSpGIBS--lyslys ATCC 87482ATCC 87482
•• pGIBSpGIBS--phephe ATCC 87483ATCC 87483
•• pGIBSpGIBS--thrthr ATCC 87484ATCC 87484
•• pGIBSpGIBS--trptrp ATCC 87485ATCC 87485
•• pGIBSpGIBS--dapdap ATCC 87486ATCC 87486
Artificial Artificial polyApolyA added to the 3’endadded to the 3’end
f1(+) origin
lacZSacI
pUCorigin
MCS
P lac
ampicillin
pBluescript II KS+3.0 kb KpnI
Clone set available atClone set available at::<http://<http://www.atcc.orgwww.atcc.org>>
Normalization Approaches: Total IntensityNormalization Approaches: Total IntensityConceptually, this is the simplest approach Conceptually, this is the simplest approach
Assumption: Total RNA (mass) used is same for Assumption: Total RNA (mass) used is same for both samples.both samples.
So, averaged across thousands of genes, total So, averaged across thousands of genes, total hybridization should be the same for both sampleshybridization should be the same for both samples
∑
∑
=
==array
k
array
k
N
k
N
k
G
RN
1
1Normalization Factor:
Normalization: kk NGG =′ and kk RR =′ .
Before and After NormalizationBefore and After Normalization
Data exhibits an intensityData exhibits an intensity--dependent structuredependent structure
Uncertainty in measurements is greater at Uncertainty in measurements is greater at lower intensitieslower intensities
Uncertainty in ratio measurements generally Uncertainty in ratio measurements generally greater at lower intensitiesgreater at lower intensities
Plot logPlot log22(R/G) (R/G) vs.vs. loglog22(R*G)(R*G)[variation: Terry Speed[variation: Terry Speed’’s Ms M--A plot withA plot with
((½½ )*log)*log22(R*G)](R*G)]
The Starting Point: The RThe Starting Point: The R--I PlotI Plot
Good Data
Bad Data from Parts Unknown
Gary ChurchillGary ChurchillEach “pen group” is colored differentlyEach “pen group” is colored differently
Lowess NormalizationLowess NormalizationWhy LOWESS?Why LOWESS?
-3
-2
-1
0
1
2
3
7 8 9 10 11 12 13 14
log(Cy3*Cy5)
A SD = 0.346
ObservationsObservations1.1. IntensityIntensity--dependent structuredependent structure2.2. Data not mean centered at logData not mean centered at log22(ratio) = 0(ratio) = 0
LOWESS (Cont’d)LOWESS (Cont’d)
Local linear regression model Local linear regression model
TriTri--cube weight function cube weight function
Least SquaresLeast Squares
Estimated values Estimated values of logof log22(Cy5/Cy3) as (Cy5/Cy3) as function of function of loglog22(Cy3*Cy5)(Cy3*Cy5)
-3
-2
-1
0
1
2
3
7 8 9 10 11 12 13 14
log(Cy3*Cy5)
A SD = 0.346
WYXWXX
xyxw
xyxw
xy
iii
iii
ii
')'(
0)()(
)()(
1
2
2
−∧
∧
∧
=Β
=⋅−⋅∂∂=
∂∂
⋅−⋅=
⋅=
∑
∑
βββ
β
ββ
β
LOWESS ResultsLOWESS Results
In every array, there are questionable In every array, there are questionable –– or bad or bad ––data for some elementsdata for some elements
Replicates can help identify those elements Replicates can help identify those elements
We can use an unbiased filter to eliminate those We can use an unbiased filter to eliminate those from future considerationfrom future consideration
Replicates: Filtering Questionable DataReplicates: Filtering Questionable Data
Consider two replicates with dyes swappedConsider two replicates with dyes swappedAA11 andand BB22BB11 AA22
Replicates: Applied to Filtering DataReplicates: Applied to Filtering Data
We can calculate We can calculate σσRR and eliminate spots with and eliminate spots with the greatest uncertainty:the greatest uncertainty: σσRR >2>2
We expect to seeWe expect to seeAA11 BB22BB11 AA22
= 1= 1**
The Effects of flipThe Effects of flip--dye replicate trimdye replicate trim
Red data are eliminated as inconsistentRed data are eliminated as inconsistent
The uncertainty in measurements increases as The uncertainty in measurements increases as intensity decreasesintensity decreases
Measurements close to the detection limit are the Measurements close to the detection limit are the most uncertainmost uncertain
FoldFold--change measurements ignore these effectschange measurements ignore these effects
We can calculate an intensityWe can calculate an intensity--dependent dependent ZZ--score score that measures the ratio relative to the standard that measures the ratio relative to the standard deviation in the data:deviation in the data:
wherewhere µµ is theis themean mean log(ratiolog(ratio))
Significance: Significance: ZZ--scoresscores
ZZ = = loglog22(R(Rii/G/Gii))--µµσσ loglog22(R/G(R/G))
locallocal
NormalityNormality
y = y = RRii/G/Gii , , f(y) is frequency of f(y) is frequency of RRii/G/Gii , , µµ ~ 1~ 1
y = logy = log22((RRii/G/Gi i ), ), f(y) is frequency of f(y) is frequency of loglog22((RRii/G/Gi i ), ), µµ ~ 0~ 0
Standard deviation (Standard deviation (σσ))
““Slice Analysis” (IntensitySlice Analysis” (Intensity--dependent dependent ZZ--score)score)
ZZ > 2 is at the 95.5% confidence level> 2 is at the 95.5% confidence level
Problem is to estimate the variability in the Problem is to estimate the variability in the data based on empirical measurementdata based on empirical measurement
This requires a number of selfThis requires a number of self--self self hybridizations to create an estimate of the hybridizations to create an estimate of the inherent variability in the assayinherent variability in the assay
This can be done as a function of intensity or This can be done as a function of intensity or as an estimate of the variability for individual as an estimate of the variability for individual genesgenes
Genes failing to meet the significance criteriaGenes failing to meet the significance criteria
Error ModelsError Models
SelfSelf--self Hybridizations Estimate Variabilityself Hybridizations Estimate Variability
This is then used to This is then used to construct an error construct an error
modelmodel
R-I Plot
-3
-2
-1
0
1
2
3
7 8 9 10 11 12 13 14
log10(R*G)
log 2
(R/G
)
R-I Plot
-3
-2
-1
0
1
2
3
7 8 9 10 11 12 13 14
log10(R*G)
log 2
(R/G
)The error model is used to exclude The error model is used to exclude nonsignificantnonsignificant genesgenes
Measurements of expression vary between Measurements of expression vary between any two assaysany two assays
This can be affected by changes in the mean This can be affected by changes in the mean expression level, but normalization can help expression level, but normalization can help reduce those differencesreduce those differences
However, the variance, or spread in the data, However, the variance, or spread in the data, can be quite different between replicates (or can be quite different between replicates (or pen groups)pen groups)
Variance stabilization can rescale the data Variance stabilization can rescale the data for each experiment to make these more for each experiment to make these more comparablecomparable
Variance stabilization/regularizationVariance stabilization/regularization
A Box Plot can show the difference in varianceA Box Plot can show the difference in variancebetween replicatesbetween replicates
Standard Deviation RegularizationStandard Deviation Regularization
35log2 Cy
Cyaij =
Let Let aaijij be the raw log ratio for the be the raw log ratio for the jjthth spot in spot in iithth block (or slide)block (or slide)
M
M j
jijij
Nijij
Nijij
aa
aa
aa
∏−∑
−∑
−
−⋅=
1
1'
)(
)(
2
2
where where NNjj denotes the number of genes denotes the number of genes iithth block or block or iithth
slide, slide, MM denotes the number of blocks or slides, denotes the number of blocks or slides, aaijij
denotes the log ratio mean of denotes the log ratio mean of iithth block (or block (or iithth slide)slide)
a’a’ijij be the scaled log ratio for the be the scaled log ratio for the jjthth spot in spot in iithth block (or slide)block (or slide)
MIDAS Normalization MethodsMIDAS Normalization Methods(Standard deviation regularization)(Standard deviation regularization)
Assumption: logAssumption: log--ratio standard deviations within each block orratio standard deviations within each block or
slide are the same.slide are the same.
Standard deviation regularizationStandard deviation regularization
Variance regularization can remove the biasVariance regularization can remove the bias
There are Limits to what you There are Limits to what you can Measurecan Measure
The Limits of logThe Limits of log--ratios: The space we exploreratios: The space we explore
The Limits of logThe Limits of log--ratios: The space we exploreratios: The space we explore
The Limits of logThe Limits of log--ratios: The space we exploreratios: The space we explore
Microarray OverviewMicroarray Overview
MAD
CollectCollectHybridizationHybridization
DataData
MAD
ExploreExplorepatterns ofpatterns ofexpressionexpression
NormalizationNormalizationand Filteringand Filtering
Normalize dataNormalize dataand reduceand reducecomplexitycomplexity
Assume we will compare two conditions Assume we will compare two conditions with multiple replicates for each classwith multiple replicates for each class
Our goal is to find genes that are Our goal is to find genes that are significantly different between these significantly different between these classesclasses
These are the genes that we will use for These are the genes that we will use for later data mininglater data mining
Finding Significant GenesFinding Significant Genes
Average Fold Change Difference for each geneAverage Fold Change Difference for each genesuffers from being arbitrary and not taking suffers from being arbitrary and not taking into account systematic variation in the datainto account systematic variation in the data
Finding Significant GenesFinding Significant Genes
??????
tt--test for each genetest for each geneTests whether the difference between the mean of Tests whether the difference between the mean of the query and reference groups are the samethe query and reference groups are the sameEssentially measures signalEssentially measures signal--toto--noisenoiseCalculate Calculate pp--value (permutations or distributions)value (permutations or distributions)May suffer from intensityMay suffer from intensity--dependent effectsdependent effects
Finding Significant GenesFinding Significant Genes
tt = = signalsignal = = difference between meansdifference between means = = <<XqXq> > –– <<XcXc>_>_noise variability of groups noise variability of groups SE(XqSE(Xq--XcXc))
c
c
q
q
nn
XcXqt
22 σσ+
−=
A significantA significantdifferencedifference
ProbablyProbablynotnot
tt--teststests
1.1. Assign experiments to two groups, e.g., in the Assign experiments to two groups, e.g., in the expression matrix below, assign Experiments 1, 2 and expression matrix below, assign Experiments 1, 2 and 5 to group A, and experiments 3, 4 and 6 to group B.5 to group A, and experiments 3, 4 and 6 to group B.
Exp 1Exp 1 Exp 2Exp 2 Exp 3Exp 3 Exp 4Exp 4 Exp 5Exp 5 Exp 6Exp 6
Gene 1Gene 1
Gene 2Gene 2
Gene 3Gene 3
Gene 4Gene 4
Gene 5Gene 5
Gene 6Gene 6
2. Question: Is mean expression level of a gene in group A 2. Question: Is mean expression level of a gene in group A significantly different from mean expression level in significantly different from mean expression level in group B?group B?
Exp 1Exp 1 Exp 2Exp 2 Exp 3Exp 3 Exp 4Exp 4Exp 5Exp 5 Exp 6Exp 6
Gene 1Gene 1
Gene 2Gene 2
Gene 3Gene 3
Gene 4Gene 4
Gene 5Gene 5
Gene 6Gene 6
Group AGroup A Group BGroup B
TT--Tests (TTEST) Tests (TTEST) -- 11
3. Calculate t3. Calculate t--statistic for each genestatistic for each gene
4. Calculate probability value of the t4. Calculate probability value of the t--statistic statistic for each for each gene either from:gene either from:
A. Theoretical tA. Theoretical t--distributiondistribution
OROR
B. Permutation tests.B. Permutation tests.
TT--TEST TEST -- 22
Permutation testsPermutation tests
i) For each gene, compute ti) For each gene, compute t--statisticstatistic
ii) Randomly shuffle the values of the gene between ii) Randomly shuffle the values of the gene between groups A and B, such that the reshuffled groups A and B groups A and B, such that the reshuffled groups A and B respectively have the same number of elements as the respectively have the same number of elements as the original groups A and B.original groups A and B.
Exp 1Exp 1 Exp 2Exp 2 Exp 3Exp 3 Exp 4Exp 4Exp 5Exp 5 Exp 6Exp 6
Gene 1Gene 1
Group AGroup A Group BGroup B
Original groupingOriginal grouping
Exp 1Exp 1Exp 4Exp 4Exp 5Exp 5Exp 2Exp 2Exp 3Exp 3 Exp 6Exp 6
Gene 1Gene 1
Group AGroup A Group BGroup B
Randomized groupingRandomized grouping
TT--TEST TEST -- 33
Permutation tests Permutation tests -- continuedcontinued
iii) Compute tiii) Compute t--statistic for the randomized genestatistic for the randomized gene
iv) Repeat steps iiv) Repeat steps i--iii iii nn times (where times (where n n is specified by the is specified by the user).user).
v) Let v) Let xx = the number of times the absolute value of the = the number of times the absolute value of the original toriginal t--statistic exceeds the absolute values of the statistic exceeds the absolute values of the randomized trandomized t--statistic over statistic over nn randomizations.randomizations.
vi) Then, the pvi) Then, the p--value associated with the gene = 1 value associated with the gene = 1 –– ((x/nx/n))
TT--TEST TEST -- 44
5. Determine whether a gene’s expression levels are 5. Determine whether a gene’s expression levels are significantly different between the two groups by one of significantly different between the two groups by one of three methods:three methods:
A)A) “Just alpha” (“Just alpha” (αα significance level): If the calculated significance level): If the calculated pp--value for a gene is less than or equal to the uservalue for a gene is less than or equal to the user--input a input a (critical (critical pp--value), the gene is considered significant.value), the gene is considered significant.
ORORUse Bonferroni corrections to reduce the probability of Use Bonferroni corrections to reduce the probability of
erroneously classifying nonerroneously classifying non--significant genes as significant genes as significant.significant.
B) B) Standard Bonferroni correctionStandard Bonferroni correction: The user: The user--input alpha is input alpha is divided by the total number of genes to give a critical divided by the total number of genes to give a critical pp--value that is used as above value that is used as above ––> > ppcriticalcritical = = αα /N. /N.
TT--TEST TEST -- 55
5C) 5C) Adjusted BonferroniAdjusted Bonferroni: :
i) The i) The tt--values for all the genes are ranked values for all the genes are ranked in descending order. in descending order.
ii) For the gene with the highest ii) For the gene with the highest tt--value, the value, the critical critical pp--value becomes (value becomes (αα //NN), where ), where NN is is the total number of genes; for the gene with the total number of genes; for the gene with the secondthe second--highest thighest t--value, the critical value, the critical pp--value will be (value will be (αα/[/[NN--1]), and so on.1]), and so on.
TT--TEST TEST –– 66
Significance Analysis of Microarrays (SAM)Significance Analysis of Microarrays (SAM)Uses a modified Uses a modified tt--test by estimating and adding test by estimating and adding a small positive constant to the denominatora small positive constant to the denominatorSignificant genes are those which exceed the Significant genes are those which exceed the expected values from permutation analysis.expected values from permutation analysis.
Finding Significant GenesFinding Significant Genes
SAM test StatisticSAM test Statistic
0ssrd
i
ii +=
•• ddii = Score = Score •• ssii = Standard Deviation= Standard Deviation•• ss00 = “Safety” Factor = “Safety” Factor
21 iii xxr −=
2
)()(11
21
2
22
1
21
21 −+
−+−
+=
∑∑∈∈
nn
xxxx
nns Cj
iijCj
iij
i
SAM : Variance EstimateSAM : Variance EstimateGene by gene variance estimate + “safety” factorGene by gene variance estimate + “safety” factorVariance equal in the two conditionsVariance equal in the two conditionsss00 term is here to deal with cases when variance term is here to deal with cases when variance estimates gets too close to zeroestimates gets too close to zero
How to choose sHow to choose s00 ??Test statistics are binned in 100 different Test statistics are binned in 100 different group depending on the group depending on the ssii valuevaluess00 is chosen so that the dispersion of the is chosen so that the dispersion of the test statistic does not vary from bin to bintest statistic does not vary from bin to binavoids aberrant values when variance avoids aberrant values when variance estimates close to 0estimates close to 0
SAM : Hypothesis TestingSAM : Hypothesis Testing
Permutation techniquePermutation techniqueMultiple testing adjustment techniqueMultiple testing adjustment techniqueFalse Discovery RateFalse Discovery Rate
Confidence Level :Confidence Level :False Discovery RateFalse Discovery Rate
1.1. Fix a threshold DELTA for differentially Fix a threshold DELTA for differentially expressed genesexpressed genes
2.2. For each permutation, count how many genes For each permutation, count how many genes you declare differentially expressedyou declare differentially expressedNB : In a permutation you should find 0 genes.NB : In a permutation you should find 0 genes.
3.3. Compute median number of falsely called genes Compute median number of falsely called genes in permutations in permutations
False Discovery RateFalse Discovery Rate is number of falsely called is number of falsely called genes divided by number of differential genes divided by number of differential expressed genes in original dataexpressed genes in original data
FDRFDR: percentage of NON: percentage of NON--significant genes significant genes you can expect to find in your result list you can expect to find in your result list
SAMSAMSAM gives estimates of the SAM gives estimates of the False Discovery Rate False Discovery Rate (FDR)(FDR), which is the proportion of genes likely to , which is the proportion of genes likely to have been wrongly identified by chance as being have been wrongly identified by chance as being significantsignificant..
It is a very It is a very interactive algorithminteractive algorithm –– allows allows users to users to dynamically change thresholds for significance dynamically change thresholds for significance (through the tuning parameter delta) after looking at (through the tuning parameter delta) after looking at the distributionthe distribution of the test statistic.of the test statistic.
The ability to The ability to dynamically alterdynamically alter the input parameters the input parameters based on immediate visual feedback, even before based on immediate visual feedback, even before completing the analysis, helps make the datacompleting the analysis, helps make the data--mining mining process sensitive.process sensitive.
1.1. Assign experiments to two groups, e.g., in the Assign experiments to two groups, e.g., in the expression matrix below, assign Experiments 1, 2 and expression matrix below, assign Experiments 1, 2 and 5 to group A, and experiments 3, 4 and 6 to group B.5 to group A, and experiments 3, 4 and 6 to group B.
Exp 1Exp 1 Exp 2Exp 2 Exp 3Exp 3 Exp 4Exp 4 Exp 5Exp 5 Exp 6Exp 6
Gene 1Gene 1
Gene 2Gene 2Gene 3Gene 3
Gene 4Gene 4
Gene 5Gene 5
Gene 6Gene 6
2. 2. Question: Is mean expression level of a gene in Question: Is mean expression level of a gene in group A significantly different from mean group A significantly different from mean expression level in group B?expression level in group B?
Exp 1Exp 1 Exp 2Exp 2 Exp 3Exp 3 Exp 4Exp 4Exp 5Exp 5 Exp 6Exp 6
Gene 1Gene 1
Gene 2Gene 2Gene 3Gene 3
Gene 4Gene 4
Gene 5Gene 5
Gene 6Gene 6
Group AGroup A Group BGroup B
SAM TwoSAM Two--Class Class –– 11
Permutation testsPermutation testsi) For each gene, compute i) For each gene, compute dd--valuevalue (analogous to (analogous to tt--statistic). statistic).
This is the observed This is the observed dd--valuevalue for that gene.for that gene.ii) Randomly shuffle the values of the gene between groups ii) Randomly shuffle the values of the gene between groups
A and B, such that the reshuffled groups A and B A and B, such that the reshuffled groups A and B respectively have the same number of elements as the respectively have the same number of elements as the original groups A and B. Compute the original groups A and B. Compute the dd--value for each value for each randomized generandomized gene
Exp 1Exp 1 Exp 2Exp 2 Exp 3Exp 3 Exp 4Exp 4Exp 5Exp 5 Exp 6Exp 6
Gene 1Gene 1
Group AGroup A Group BGroup B
Original groupingOriginal grouping
Exp 1Exp 1Exp 4Exp 4Exp 5Exp 5Exp 2Exp 2Exp 3Exp 3 Exp 6Exp 6
Gene 1Gene 1
Group AGroup A Group BGroup B
Randomized groupingRandomized grouping
SAM TwoSAM Two--Class Class –– 22
SAM TwoSAM Two--Class Class –– 33iii) Repeat step (ii) many times, so that each gene iii) Repeat step (ii) many times, so that each gene
has many randomized has many randomized dd--values. Take the average values. Take the average of the randomized of the randomized dd--values for each gene. This is values for each gene. This is the expected the expected dd--value of that gene.value of that gene.
iv) Plot the observed div) Plot the observed d--values vs. the expected values vs. the expected dd--valuesvalues
SAM TwoSAM Two--Class Class –– 44 Significant positive genes Significant positive genes (i.e., mean expression of group B >(i.e., mean expression of group B >mean expression of group A) in redmean expression of group A) in red
Significant negative genes Significant negative genes (i.e., mean expression of group A > (i.e., mean expression of group A > mean expression of group B) in mean expression of group B) in green green
““Observed d = expected d” lineObserved d = expected d” line
The more a gene deviates from the “observed = expected” line, The more a gene deviates from the “observed = expected” line, the more likely it is to be significant. Any gene beyond the firthe more likely it is to be significant. Any gene beyond the first st gene in the +gene in the +veve or or ––veve direction on the xdirection on the x--axis (including the axis (including the first gene), whose observed exceeds the expected by at least first gene), whose observed exceeds the expected by at least delta, is considered significant.delta, is considered significant.
Tuning Tuning parameterparameter“delta” limits, “delta” limits, can be can be dynamically dynamically changed by changed by using the using the slider slider barbar or entering or entering a value in the a value in the text fieldtext field..
SAM TwoSAM Two--Class Class –– 55
For each permutation of the data, compute the For each permutation of the data, compute the number of positive and negative significant genes number of positive and negative significant genes for a given delta as explained in the previous for a given delta as explained in the previous slide. The median number of significant genes slide. The median number of significant genes from these permutations is the from these permutations is the median False median False Discovery RateDiscovery Rate..
The rationale behind this is, any genes designated The rationale behind this is, any genes designated as significant from the randomized data are being as significant from the randomized data are being picked up purely by chance (i.e., “falsely” picked up purely by chance (i.e., “falsely” discovered). Therefore, the median number discovered). Therefore, the median number picked up over many randomizations is a good picked up over many randomizations is a good estimate of false discovery rate.estimate of false discovery rate.
Analysis of Variation (ANOVA)Analysis of Variation (ANOVA)Which genes are most significant for separating Which genes are most significant for separating classes of samples?classes of samples?Calculate Calculate pp--value (permutations or distributions)value (permutations or distributions)Reduces to a Reduces to a tt--test for 2 samplestest for 2 samplesMay suffer from intensityMay suffer from intensity--dependent effectsdependent effects
Finding Significant GenesFinding Significant Genes
??????
One Way Analysis of Variance (ANOVA)One Way Analysis of Variance (ANOVA)1.1. Assign experiments to > 2 groupsAssign experiments to > 2 groups
2. Question: Is mean expression level of a gene the same across 2. Question: Is mean expression level of a gene the same across all all groups?groups?
Ex 2Ex 2Ex 1Ex 1 Ex 3Ex 3 Ex 4Ex 4 Ex 5Ex 5 Ex 6Ex 6 Ex 7Ex 7 Ex 8Ex 8 Ex 9Ex 9Gene 1Gene 1Gene 2Gene 2Gene 3Gene 3Gene 4Gene 4Gene 5Gene 5Gene 6Gene 6Gene 7Gene 7
Ex 3Ex 3 Ex 6Ex 6 Ex 8Ex 8Ex 4Ex 4 Ex 5Ex 5 Ex 9Ex 9Ex 2Ex 2Ex 1Ex 1 Ex 7Ex 7Gene 1Gene 1Gene 2Gene 2Gene 3Gene 3Gene 4Gene 4Gene 5Gene 5Gene 6Gene 6Gene 7Gene 7
Group 1Group 1 Group 2Group 2 Group 3Group 3
ANOVA ANOVA -- 223. Calculate an F3. Calculate an F--ratio for each gene, whereratio for each gene, where
Mean square (groups)Mean square (groups)F = F = ----------------------------------------------------------------, which is a measure of , which is a measure of
Mean square (error) Mean square (error)
Between groups variabilityBetween groups variability----------------------------------------------------------------------------Within groups variabilityWithin groups variability
The larger the value of F, the greater the The larger the value of F, the greater the difference among the group difference among the group means means relative to the relative to the sampling error variabilitysampling error variability (which is the (which is the within within groups variabilitygroups variability).).
i.e., the larger the value of F, the more likely it is that the i.e., the larger the value of F, the more likely it is that the differences differences among the group means reflect “real” differences among the meansamong the group means reflect “real” differences among the means of of the populations they are drawn from, rather than being due to rthe populations they are drawn from, rather than being due to random andom sampling error.sampling error.
ANOVA ANOVA -- 33
4. The 4. The pp--valuevalue associated with an Fassociated with an F--value is the value is the probability that an Fprobability that an F--value that large would be obtained value that large would be obtained if there were no differences among group means (i.e., if there were no differences among group means (i.e., given the null hypothesis). given the null hypothesis).
Therefore, the smaller the pTherefore, the smaller the p--value, the less likely it is value, the less likely it is that the null hypothesis is validthat the null hypothesis is valid, i.e., the differences , i.e., the differences among group means are more likely to reflect real among group means are more likely to reflect real population differences as ppopulation differences as p--values decrease in values decrease in magnitude.magnitude.
ANOVA ANOVA -- 44
5. P5. P--values can be obtained for the Fvalues can be obtained for the F--values from a values from a theoretical Ftheoretical F--distribution, assuming that the populations distribution, assuming that the populations from which the data are obtained from which the data are obtained
•• are normally distributedare normally distributed, and , and
•• have homogeneous varianceshave homogeneous variances..
The test is considered robust to violations of these The test is considered robust to violations of these assumptions, provided sample sizes are relatively large assumptions, provided sample sizes are relatively large and similar across groups.and similar across groups.
ANOVA ANOVA –– 55
6. P6. P--values can be obtained from permutation tests (just values can be obtained from permutation tests (just like in tlike in t--tests), if one does not want to rely on the tests), if one does not want to rely on the assumptions needed for using the Fassumptions needed for using the F--distribution. distribution.
PP--values can also be corrected for multiple comparisons values can also be corrected for multiple comparisons (using Bonferroni or other procedures).(using Bonferroni or other procedures).
Volcano PlotsVolcano PlotsCombines Combines pp--values and fold change measuresvalues and fold change measuresSignificant genes appear in upper cornersSignificant genes appear in upper corners
Finding Significant GenesFinding Significant Genes
Mean log(ratio)
log 1
0(p-
valu
e)
Goal is identify genes (or experiments) which haveGoal is identify genes (or experiments) which have““similarsimilar”” patterns of expressionpatterns of expression
This is a problem in data miningThis is a problem in data mining
““Clustering AlgorithmsClustering Algorithms”” are most widely usedare most widely used
TypesTypesAgglomerative: HierarchicalAgglomerative: HierarchicalDivisive: Divisive: kk--means, means, SOMsSOMsNonclusteringNonclustering: Principal Component Analysis : Principal Component Analysis
(PCA)(PCA)
All depend on how one measures distanceAll depend on how one measures distance
Multiple Experiments?Multiple Experiments?
Similar expressionSimilar expression
Crucial concept for understanding clusteringCrucial concept for understanding clustering
Each gene is represented by a vector where Each gene is represented by a vector where coordinates are its values log(ratio) in each coordinates are its values log(ratio) in each experimentexperiment
xx = log(ratio)= log(ratio)expt1expt1yy = log(ratio)= log(ratio)expt2expt2zz = log(ratio)= log(ratio)expt3expt3etc.etc.
Expression Vectors
xx
yy
zz
Crucial concept for understanding clusteringCrucial concept for understanding clustering
Each gene is represented by a vector where Each gene is represented by a vector where coordinates are its values log(ratio) in each coordinates are its values log(ratio) in each experimentexperiment
xx = log(ratio)= log(ratio)expt1expt1yy = log(ratio)= log(ratio)expt2expt2zz = log(ratio)= log(ratio)expt3expt3etc.etc.
For example, if we do six experiments, For example, if we do six experiments, GeneGene11 = (= (--1.2, 1.2, --0.5, 0, 0.25, 0.75, 1.4) 0.5, 0, 0.25, 0.75, 1.4) GeneGene22 = (0.2, = (0.2, --0.5, 1.2, 0.5, 1.2, --0.25, 0.25, --1.0, 1.5) 1.0, 1.5) GeneGene33 = (1.2, 0.5, 0, = (1.2, 0.5, 0, --0.25, 0.25, --0.75, 0.75, --1.4) 1.4) etc.etc.
Expression Vectors
These gene expression vectors of log(ratio) values These gene expression vectors of log(ratio) values can be used to construct an expression matrixcan be used to construct an expression matrix
Expression Matrix
Exp
tE
xpt 11
Exp
tE
xpt 22
Exp
tE
xpt 33
Exp
tE
xpt 44
Exp
tE
xpt 55
Exp
tE
xpt 66
GeneGene11 --1.2 1.2 --0.5 0 0.25 0.75 1.4 0.5 0 0.25 0.75 1.4 GeneGene22 0.2 0.2 --0.5 1.2 0.5 1.2 --0.25 0.25 --1.0 1.5 1.0 1.5 GeneGene33 1.2 0.5 0 1.2 0.5 0 --0.25 0.25 --0.75 0.75 --1.41.4etc.etc.
This is often represented as a red/green colored This is often represented as a red/green colored matrixmatrix
The Expression Matrix is a representation of data from multipThe Expression Matrix is a representation of data from multipmicroarray experiments.microarray experiments.
Each element is a Each element is a log ratio, usually log ratio, usually
log log 2 2 (Cy5/Cy3) (Cy5/Cy3)
Red indicates a Red indicates a positive log ratio, i.e, Cy5 > Cy3 positive log ratio, i.e, Cy5 > Cy3
Green indicates aGreen indicates anegative lognegative log--ratio, ratio, i.e., Cy5 < Cy3 i.e., Cy5 < Cy3
Black indicates a Black indicates a loglog--ratio of zero, ratio of zero, i.e., Cy5 and Cy3 i.e., Cy5 and Cy3 are very close in are very close in value value
Gray indicates missing data Gray indicates missing data
Exp
1Ex
p 1
Exp
2Ex
p 2
Exp
3Ex
p 3
Exp
4Ex
p 4
Exp
5Ex
p 5
Exp
6Ex
p 6
Gene 1Gene 1
Gene 2Gene 2
Gene 3Gene 3
Gene 4Gene 4
Gene 5Gene 5
Gene 6Gene 6
Expression Matrix
Expression Vectors As Points inExpression Vectors As Points in‘Expression Space’‘Expression Space’
Experiment 1Experiment 1
Experiment 2Experiment 2
Experiment 3Experiment 3 Similar ExpressionSimilar Expression
Exp 1Exp 1 Exp 2Exp 2 Exp 3Exp 3G1G1
G2G2
G3G3
G4G4
G5G5
xx
yy
zz
Distances are measured Distances are measured ““betweenbetween”” expression expression vectorsvectors
Distance metrics define the way we measure Distance metrics define the way we measure distancesdistances
Many different ways to measure distance:Many different ways to measure distance:Euclidean distanceEuclidean distancePearson correlation coefficient(s)Pearson correlation coefficient(s)Manhattan distanceManhattan distanceMutual informationMutual informationKendallKendall’’s s TauTauetc.etc.
Each has different properties and can reveal different Each has different properties and can reveal different features of the datafeatures of the data
Distance metrics
Distance and Similarity Distance and Similarity The ability to calculate a distance (or similarity The ability to calculate a distance (or similarity --its inverse) between two expression vectors is its inverse) between two expression vectors is fundamental to clustering algorithmsfundamental to clustering algorithms
Distance between vectors is the basis upon Distance between vectors is the basis upon which decisions are made when grouping which decisions are made when grouping similar patterns of expressionsimilar patterns of expression
Selection of a Selection of a distance metricdistance metric defines the defines the concept of distance for a particular experimentconcept of distance for a particular experiment
DistanceDistance
SimilaritySimilarity
2. Manhattan: Σi = 1 |xiA – xiB|6
Distance: a measure of similarity between genesDistance: a measure of similarity between genes
Exp 1Exp 1 Exp 2Exp 2 Exp 3Exp 3 Exp 4Exp 4 Exp 5Exp 5 Exp 6Exp 6
Gene AGene A
Gene BGene B
xx1A1A xx2A2A xx3A3A xx4A4A xx5A5A xx6A6A
xx1B1B xx2B2B xx3B3B xx4B4B xx5B5B xx6B6B
61. Euclidean: √Σi = 1 (xiA - xiB)2 pA
pB
3. Pearson correlation
Some distances: (MeV provides 11 metrics)Some distances: (MeV provides 11 metrics)
Distance Is Defined by a Distance Is Defined by a MetricMetric
Euclidean Pearson*Distance Metric:
6.0
1.4
+1.00
-0.05D
D
-3
0
3
log2
(cy5
/cy3
)
Once a distance metric has been selected, the Once a distance metric has been selected, the starting point for all clustering methods is a starting point for all clustering methods is a ““distance distance matrixmatrix””
Distance Matrix
Gen
eG
ene 11
Gen
eG
ene 22
Gen
eG
ene 33
Gen
eG
ene 44
Gen
eG
ene 55
Gen
eG
ene 66
GeneGene11 0 1.5 1.2 0.25 0.75 1.4 0 1.5 1.2 0.25 0.75 1.4 GeneGene22 1.5 1.5 0 1.3 0.55 2.0 1.5 0 1.3 0.55 2.0 1.5 GeneGene33 1.2 1.3 0 1.3 0.75 0.31.2 1.3 0 1.3 0.75 0.3GeneGene44 0.25 0.55 0.25 0.55 1.31.3 0 0.25 0.4 0 0.25 0.4 GeneGene55 0.75 2.0 0.75 0.25 0 1.2 0.75 2.0 0.75 0.25 0 1.2 GeneGene66 1.4 1.5 0.3 0.4 1.2 01.4 1.5 0.3 0.4 1.2 0
The elements of this matrix are the pairThe elements of this matrix are the pair--wise wise distances. Note that the matrix is symmetric about distances. Note that the matrix is symmetric about the diagonal.the diagonal.
MeV: Data Mining ToolsMeV: Data Mining Tools Alexander SaeedAlexander SaeedAlexander Alexander SturnSturn
Nirmal BhagabatiNirmal BhagabatiJohn BraistedJohn Braisted
SyntekSyntek Inc.Inc.DatanautDatanaut, Inc., Inc.
☺☺ Available as OSIAvailable as OSI--certified open sourcecertified open source
Hierarchical ClusteringHierarchical Clustering1. Calculate the distance between all genes. Find the smallest 1. Calculate the distance between all genes. Find the smallest
distance. If several pairs share the same similarity, use a distance. If several pairs share the same similarity, use a predetermined rule to decide between alternatives.predetermined rule to decide between alternatives.
G1G1G6G6
G3G3
G5G5
G4G4
G2G2
2. Fuse the two selected clusters to produce a new cluster that 2. Fuse the two selected clusters to produce a new cluster that now contains at least two objects. Calculate the distance now contains at least two objects. Calculate the distance between the new cluster and all other clusters.between the new cluster and all other clusters.
3. Repeat steps 1 and 2 until only a single cluster remains.3. Repeat steps 1 and 2 until only a single cluster remains.
G1G1
G6G6
G3G3
G5G5
G4G4
G2G2
4. Draw a tree representing the results.4. Draw a tree representing the results.
Hierarchical ClusteringHierarchical Clustering
g8g8g1g1 g2g2 g3g3 g4g4 g5g5 g6g6 g7g7
g7g7g1g1 g8g8 g2g2 g3g3 g4g4 g5g5 g6g6
g1 is most like g8g1 is most like g8
g7g7g1g1 g8g8 g4g4 g2g2 g3g3 g5g5 g6g6
g4 is most like {g1, g8}g4 is most like {g1, g8}
(HCL(HCL--2)2)
g7g7g1g1 g8g8 g4g4 g2g2 g3g3 g5g5 g6g6
Hierarchical ClusteringHierarchical Clustering
g6g6g1g1 g8g8 g4g4 g2g2 g3g3 g5g5 g7g7
g5 is most like g7 g5 is most like g7
g6g6g1g1 g8g8 g4g4 g5g5 g7g7 g2g2 g3g3
{g5,g7} is most like {g1, g4, g8} {g5,g7} is most like {g1, g4, g8}
(HCL(HCL--3)3)
Hierarchical TreeHierarchical Tree
g6g6g1g1 g8g8 g4g4 g5g5 g7g7 g2g2 g3g3
(HCL(HCL--4)4)
Agglomerative Linkage MethodsAgglomerative Linkage Methods
Linkage methods are rules or metrics that return a Linkage methods are rules or metrics that return a value that can be used to determine which elements value that can be used to determine which elements (clusters) should be linked.(clusters) should be linked.
Three linkage methods that are commonly used are: Three linkage methods that are commonly used are:
•• Single LinkageSingle Linkage•• Average LinkageAverage Linkage•• Complete LinkageComplete Linkage
(HCL(HCL--6)6)
ClusterCluster--toto--cluster distance is defined as the cluster distance is defined as the minimum distanceminimum distance between members of one cluster between members of one cluster and members of the another cluster. Single linkage and members of the another cluster. Single linkage tends to create ‘elongated’ clusters with individual tends to create ‘elongated’ clusters with individual genes chained onto clusters.genes chained onto clusters.
DDABAB = min ( = min ( d(ud(uii, , vvjj) )) )
where u where u ∈ ∈ A and v A and v ∈ ∈ BBfor all i = 1 to Nfor all i = 1 to NAA and j = 1 to Nand j = 1 to NBB
Single LinkageSingle Linkage
(HCL(HCL--7)7)
DDABAB
ClusterCluster--toto--cluster distance is defined as the cluster distance is defined as the average average distancedistance between all members of one cluster and all between all members of one cluster and all members of another cluster. Average linkage has a members of another cluster. Average linkage has a slight tendency to produce clusters of similar slight tendency to produce clusters of similar variance.variance.
DDABAB = 1/(N= 1/(NAANNBB) ) Σ ΣΣ Σ ( ( d(ud(uii, , vvjj) )) )
where u where u ∈ ∈ A and v A and v ∈ ∈ BBfor all i = 1 to Nfor all i = 1 to NAA and j = 1 to Nand j = 1 to NBB
Average LinkageAverage Linkage
(HCL(HCL--8)8)
DDABAB
ClusterCluster--toto--cluster distance is defined as the cluster distance is defined as the maximum distancemaximum distance between members of one cluster between members of one cluster and members of the another cluster. Complete and members of the another cluster. Complete linkage tends to create clusters of similar size and linkage tends to create clusters of similar size and variability.variability.
DDABAB = max ( = max ( d(ud(uii, , vvjj) )) )
where u where u ∈ ∈ A and v A and v ∈ ∈ BBfor all i = 1 to Nfor all i = 1 to NAA and j = 1 to Nand j = 1 to NBB
Complete LinkageComplete Linkage
(HCL(HCL--9)9)
DDABAB
Comparison of Linkage Comparison of Linkage MethodsMethods
SingleSingle AverageAverage CompleteComplete(HCL(HCL--10)10)
BootstrappingBootstrappingBootstrapping Bootstrapping –– resamplingresampling with replacementwith replacementOriginal expression matrix:Original expression matrix:
Exp 1Exp 1 Exp 2Exp 2 Exp 3Exp 3 Exp 4Exp 4 Exp 5Exp 5 Exp 6Exp 6
Gene 1Gene 1Gene 2Gene 2Gene 3Gene 3Gene 4Gene 4Gene 5Gene 5Gene 6Gene 6
Various bootstrapped matrices (by experiments):Various bootstrapped matrices (by experiments):Exp 2Exp 2 Exp 3Exp 3 Exp 4Exp 4
Gene 1Gene 1Gene 2Gene 2Gene 3Gene 3Gene 4Gene 4Gene 5Gene 5Gene 6Gene 6
Exp 2Exp 2 Exp 4Exp 4 Exp 4Exp 4 Exp 1Exp 1 Exp 3Exp 3 Exp 5Exp 5 Exp 6Exp 6
Gene 1Gene 1Gene 2Gene 2Gene 3Gene 3Gene 4Gene 4Gene 5Gene 5Gene 6Gene 6
Exp 1Exp 1 Exp 5Exp 5
JackknifingJackknifing
Jackknifing Jackknifing –– resamplingresampling without replacementwithout replacementOriginal expression matrix:Original expression matrix:
Exp 1Exp 1 Exp 2Exp 2 Exp 3Exp 3 Exp 4Exp 4 Exp 5Exp 5 Exp 6Exp 6
Gene 1Gene 1Gene 2Gene 2Gene 3Gene 3Gene 4Gene 4Gene 5Gene 5Gene 6Gene 6
Various jackknifed matrices (by experiments):Various jackknifed matrices (by experiments):
Exp 1Exp 1 Exp 3Exp 3 Exp 4Exp 4 Exp 5Exp 5 Exp 6Exp 6
Gene 1Gene 1Gene 2Gene 2Gene 3Gene 3Gene 4Gene 4Gene 5Gene 5Gene 6Gene 6
Exp 1Exp 1 Exp 2Exp 2 Exp 3Exp 3 Exp 4Exp 4 Exp 6Exp 6
Gene 1Gene 1Gene 2Gene 2Gene 3Gene 3Gene 4Gene 4Gene 5Gene 5Gene 6Gene 6
Analysis of bootstrapped and Analysis of bootstrapped and jackknifed support treesjackknifed support trees
Bootstrapped or jackknifed expression matrices are created Bootstrapped or jackknifed expression matrices are created many times by randomly many times by randomly resamplingresampling the original expression the original expression matrix, using either the bootstrap or jackknife procedure.matrix, using either the bootstrap or jackknife procedure.Each time, hierarchical trees are created from the Each time, hierarchical trees are created from the resampledresampledmatrices.matrices.The trees are compared to the tree obtained from the original The trees are compared to the tree obtained from the original data set.data set.The more frequently a given cluster from the original tree is The more frequently a given cluster from the original tree is found in the found in the resampledresampled trees, the stronger the support for the trees, the stronger the support for the cluster. cluster. As each As each resampledresampled matrix lacks some of the original data, matrix lacks some of the original data, high support for a cluster means that the clustering is not high support for a cluster means that the clustering is not biased by a small subset of the data.biased by a small subset of the data.
Self Organizing Tree AlgorithmSelf Organizing Tree AlgorithmDopazo, J. , J.M Carazo, Phylogenetic Dopazo, J. , J.M Carazo, Phylogenetic reconstruction using and unsupervised reconstruction using and unsupervised growing neural network that adopts the growing neural network that adopts the topology of a phylogenetic tree. J. Mol. Evol. topology of a phylogenetic tree. J. Mol. Evol. 44:22644:226--233, 1997.233, 1997.
Herrero, J., A. Valencia, and J. Dopazo. A Herrero, J., A. Valencia, and J. Dopazo. A hierarchical unsupervised growing neural hierarchical unsupervised growing neural network for clustering gene expression network for clustering gene expression patterns. Bioinformatics, 17(2):126patterns. Bioinformatics, 17(2):126--136, 136, 2001.2001.
SOTA SOTA -- 11
SOTA CharacteristicsSOTA CharacteristicsDivisive clustering, allowing high level hierarchical Divisive clustering, allowing high level hierarchical structure to be revealed without having to completely structure to be revealed without having to completely partition the data set down to single gene vectorspartition the data set down to single gene vectorsData set is reduced to clusters arranged in a binary Data set is reduced to clusters arranged in a binary tree topologytree topologyThe number of resulting clusters is not fixed before The number of resulting clusters is not fixed before clusteringclusteringNeural network approach which has advantages Neural network approach which has advantages similar to SOMs such as handling large data sets that similar to SOMs such as handling large data sets that have large amounts of ‘noise’have large amounts of ‘noise’
SOTA SOTA -- 22
SOTA TopologySOTA Topology
Parent Parent NodeNode
ααpp
Winning Winning CellCell
Sister Sister CellCell
ααwwαα ss
αα∗ ∗ = = migration factor (migration factor (ααss < < ααp p < < ααww))
SOTA SOTA -- 33
MembersMembers
Centroid Centroid VectorVector
Adaptation OverviewAdaptation OverviewEach gene vector associated with the parent is Each gene vector associated with the parent is
compared to the centroid vector of its offspring compared to the centroid vector of its offspring cells.cells.
The most similar cell’s centroid and its The most similar cell’s centroid and its neighboring cells are adapted using the neighboring cells are adapted using the appropriate migration weights.appropriate migration weights.
SOTA SOTA -- 44
Following the presentation of all genes to the Following the presentation of all genes to the system a measure of system diversity is used system a measure of system diversity is used to determine if training has found an optimal to determine if training has found an optimal position for the offspring.position for the offspring.
If the system diversity improves (decreases) If the system diversity improves (decreases) then another training epoch is started then another training epoch is started otherwise training ends and a new cycle starts otherwise training ends and a new cycle starts with a cell division.with a cell division.
SOTA SOTA -- 55
The most ‘diverse’ cell The most ‘diverse’ cell is selected for division is selected for division at the start of the next at the start of the next training cycle.training cycle.
SOTA SOTA -- 66
Growth TerminationGrowth Termination
Expansion stops Expansion stops when the most when the most diverse cell’s diverse cell’s diversity falls below diversity falls below a threshold.a threshold.
SOTA SOTA -- 77
0
0.05
0.1
0.15
0.2
0 100 200 300 400 500
Adaptation Epoch Number
Tree
Div
ersi
ty
Each training cycle ends when the Each training cycle ends when the overall tree diversity ‘stabilizes’.overall tree diversity ‘stabilizes’.This triggers a cell division andThis triggers a cell division andpossibly a new training cycle.possibly a new training cycle.
SOTA SOTA -- 88
1. Specify number of 1. Specify number of clustersclusters, e.g., 5. , e.g., 5.
2. Randomly assign genes to clusters.2. Randomly assign genes to clusters.
G1G1 G2G2 G3G3 G4G4 G5G5 G6G6 G7G7 G8G8 G9G9 G10G10 G11G11 G12G12 G13G13
KK--Means/Medians Clustering Means/Medians Clustering –– 11
KK--Means/Medians Clustering Means/Medians Clustering –– 223. Calculate mean/median expression profile of each cluster.3. Calculate mean/median expression profile of each cluster.
4. Shuffle genes among clusters such that each gene is now 4. Shuffle genes among clusters such that each gene is now in the cluster whose mean expression profile (calculated in in the cluster whose mean expression profile (calculated in step 3) is the closest to that gene’s expression profile.step 3) is the closest to that gene’s expression profile.
G1G1 G2G2G3G3 G4G4 G5G5G6G6
G7G7
G8G8 G9G9G10G10
G11G11
G12G12
G13G13
5. Repeat steps 3 and 4 until genes cannot be shuffled around 5. Repeat steps 3 and 4 until genes cannot be shuffled around any more, OR a userany more, OR a user--specified number of iterations has been specified number of iterations has been reached. reached.
kk--means is most useful when the user has an means is most useful when the user has an a prioria priori hypothesis hypothesis about the number of clusters the genes should belong to.about the number of clusters the genes should belong to.
SelfSelf--organizing maps (organizing maps (SOMsSOMs) ) –– 111. Specify the number of nodes (clusters) desired, and 1. Specify the number of nodes (clusters) desired, and
also specify a 2also specify a 2--D geometry for the nodes, e.g., D geometry for the nodes, e.g., rectangular or hexagonalrectangular or hexagonal N = NodesN = Nodes
G = GenesG = GenesG1G1 G6G6
G3G3
G5G5G4G4
G2G2
G11G11
G7G7G8G8
G10G10G9G9
G12G12 G13G13G14G14
G15G15
G19G19G17G17
G22G22
G18G18G20G20
G16G16
G21G21G23G23
G25G25G24G24
G26G26 G27G27
G29G29G28G28
N1N1 N2N2
N3N3 N4N4
N5N5 N6N6
SOMsSOMs –– 222. Choose a random gene, e.g., G92. Choose a random gene, e.g., G9
3. Move the nodes in the direction of G9. The node 3. Move the nodes in the direction of G9. The node closest to G9 (N2) is moved the most, and the other closest to G9 (N2) is moved the most, and the other nodes are moved by smaller varying amounts. The nodes are moved by smaller varying amounts. The farther away the node is from N2, the less it is moved. farther away the node is from N2, the less it is moved.
G1G1 G6G6
G3G3
G5G5G4G4
G2G2
G11G11
G7G7G8G8
G10G10G9G9
G12G12 G13G13G14G14
G15G15
G19G19G17G17
G22G22
G18G18G20G20
G16G16
G21G21G23G23
G25G25 G24G24
G26G26 G27G27
G29G29G28G28
N1N1 N2N2
N3N3 N4N4
N5N5 N6N6
SOMsSOMs –– 334. Steps 2 and 3 (i.e., choosing a random gene and moving the 4. Steps 2 and 3 (i.e., choosing a random gene and moving the
nodes towards it) are repeated many (usually several nodes towards it) are repeated many (usually several thousand) times. However, with each iteration, the amount that thousand) times. However, with each iteration, the amount that the nodes are allowed to move is decreased.the nodes are allowed to move is decreased.
5. Finally, each node will “nestle” among a cluster of 5. Finally, each node will “nestle” among a cluster of genes, and a gene will be considered to be in the cluster genes, and a gene will be considered to be in the cluster if its distance to the node in that cluster is less than its if its distance to the node in that cluster is less than its distance to any other node. distance to any other node.
G1G1 G6G6
G3G3
G5G5G4G4
G2G2
G11G11
G7G7G8G8
G10G10G9G9N1N1 N2N2
G12G12 G13G13G14G14
G15G15G26G26 G27G27
G29G29G28G28N3N3
N4N4
G19G19G17G17
G22G22
G18G18G20G20
G16G16
G21G21G23G23
G25G25 G24G24N5N5N6N6
SOM Neighborhood OptionsSOM Neighborhood Options
G11G11
G7G7 G8G8
G10G10G9G9
N1N1 N2N2
N3N3 N4N4
N5N5 N6N6
G11G11
G7G7 G8G8
G10G10G9G9
N1N1 N2N2
N3N3 N4N4
N5N5 N6N6
Bubble Bubble NeighborhoodNeighborhood
GaussianGaussian
NeighborhoodNeighborhoodradiusradius
All move, alpha is scaled.All move, alpha is scaled.Some move, alpha Some move, alpha is constant.is constant.
Compute first Compute first principle component principle component of expression matrixof expression matrixShave off Shave off αα% (default % (default 10%) of genes with 10%) of genes with lowest values of dot lowest values of dot product with 1product with 1stst
principal componentprincipal component
OrthogonalizeOrthogonalize expression expression matrix with respect to the matrix with respect to the average gene in the cluster average gene in the cluster and repeat shaving and repeat shaving procedureprocedure
Repeat until only one Repeat until only one gene remainsgene remains
Results in a series Results in a series of nested clustersof nested clusters
Choose cluster of Choose cluster of appropriate size appropriate size as determined by as determined by gap statistic gap statistic calculationcalculation
Gene ShavingGene Shaving
Create random permutations of the Create random permutations of the expression matrix and calculate Rexpression matrix and calculate R22
for eachfor each
Compare RCompare R22 of each cluster to that of of each cluster to that of the entire expression matrix the entire expression matrix
Choose the cluster whose RChoose the cluster whose R22 is furthest is furthest from the average Rfrom the average R22 of the permuted of the permuted expression matrices.expression matrices.
Gap statistic calculation Gap statistic calculation (choosing cluster size)(choosing cluster size)
Quality measure for clusters:Quality measure for clusters:
Large RLarge R22 implies a implies a tight cluster of tight cluster of coherent genescoherent genes
within variance within variance between variance between variance
RR22 ==
between between variance of mean variance of mean gene across gene across experimentsexperiments
within within variance of each gene variance of each gene about the cluster about the cluster average average
Gene ShavingGene Shaving
The final cluster The final cluster contains a set of genes contains a set of genes
that are greatly affected that are greatly affected by the experimental by the experimental
conditions conditions in a similar way.in a similar way.
Principal Components AnalysisPrincipal Components Analysis1.1. PCA simplifies the “views” of the data.PCA simplifies the “views” of the data.
2.2. Suppose we have measurements for each gene on Suppose we have measurements for each gene on multiple experiments.multiple experiments.
3.3. Suppose some of the experiments are correlated.Suppose some of the experiments are correlated.
4.4. PCA will ignore the redundant experiments, and will take a PCA will ignore the redundant experiments, and will take a weighted average of some of the experiments, thus weighted average of some of the experiments, thus possibly making the trends in the data more interpretable. possibly making the trends in the data more interpretable.
5. The components can be thought of as axes in n5. The components can be thought of as axes in n--dimensional space, where n is the number of components. dimensional space, where n is the number of components. Each axis represents a different trend in the data.Each axis represents a different trend in the data.
Principal Components AnalysisPrincipal Components Analysis
““Cloud” of data points (e.g., genes) Cloud” of data points (e.g., genes) in Nin N--dimensional space, N = # hybridizationsdimensional space, N = # hybridizations
Data points resolved along 3 principalData points resolved along 3 principalcomponent axescomponent axes..In this example:In this example:
xx--axisaxis could mean a continuum from overcould mean a continuum from over--to underto under--expressionexpression
yy--axisaxis could mean that “blue” genes are overcould mean that “blue” genes are over--expressed in first five expressed in first five exptsexpts and under and under expressed in the remaining expressed in the remaining exptsexpts, while “brown” genes are under, while “brown” genes are under--expressed in the first expressed in the first five five exptsexpts, and over, and over--expressed in the remaining expressed in the remaining exptsexpts. .
zz--axisaxis might represent different cyclic patterns, e.g., “red” genes mimight represent different cyclic patterns, e.g., “red” genes might be overght be over--expressed expressed in oddin odd--numbered numbered exptsexpts and underand under--expressed in evenexpressed in even--numbered ones, whereas the numbered ones, whereas the opposite is true for “purple” genes.opposite is true for “purple” genes.
Interpretation of components is somewhat subjective. Interpretation of components is somewhat subjective.
yy
xx
zz
xx
zz
yy
PCA in picturesPCA in pictures
Principal ComponentsPrincipal Componentspick out the directionspick out the directionsin the data that capturein the data that capturethe greatest variabilitythe greatest variability
xx
zz
yy
PCA in picturesPCA in pictures
zz’’
yy’’
xx’’
The “new” axes are linearThe “new” axes are linearcombinations of the oldcombinations of the oldaxes axes –– typically combinationstypically combinationsof genes or experiments.of genes or experiments.
=a=a11x+bx+b11y+cy+c11zz
=a=a22x+bx+b22y+cy+c22zz
=a=a33x+bx+b33y+cy+c33zz
PCA in picturesPCA in pictures
Projecting the data into aProjecting the data into alower dimensional spacelower dimensional spacecan help visualize relationshipscan help visualize relationships
yy’’
xx’’
PCA in picturesPCA in picturesyy’’
xx’’
Projecting the data into aProjecting the data into alower dimensional spacelower dimensional spacecan help visualize relationshipscan help visualize relationships
Support Vector Machines (SVM)Support Vector Machines (SVM)
Supervised learning techniqueSupervised learning technique
Uses supplied information such as Uses supplied information such as presumptive biological relationships presumptive biological relationships between a set of elements, and the between a set of elements, and the expression profiles of elements to expression profiles of elements to produce a binary classification of produce a binary classification of elements.elements.
ElementsElementsInIn
ClassificationClassification
ElementsElementsOut ofOut of
ClassificationClassification
SVM Process OverviewSVM Process OverviewInitial Initial
ClassificationClassificationDataData
SVMSVMTrainingTraining
WeightsWeights SVMSVMClassificationClassification
DataData
SVM ClassificationSVM Classification
SVM attempts to find an optimal SVM attempts to find an optimal separating separating hyperplanehyperplane between members between members of the two initial classifications.of the two initial classifications.
Separating Separating hyperplanehyperplane
SVM Kernel ConstructionSVM Kernel Construction
The expression data can be transformed to a The expression data can be transformed to a higher dimensional space (feature space) by higher dimensional space (feature space) by applying a kernel function. This transformation applying a kernel function. This transformation can have the effect of allowing a separating can have the effect of allowing a separating hyperplanehyperplane to be found.to be found.
Practical SVM IssuesPractical SVM Issues
Results depend heavily on the input Results depend heavily on the input parameters. parameters. Using a high degree kernel function Using a high degree kernel function risks artificial separation of the data. risks artificial separation of the data. An iterative approach to increasing the An iterative approach to increasing the kernel power is advisable.kernel power is advisable.
SVM ResultsSVM ResultsTwo classes are producedTwo classes are produced
Positive Class: contains elements with Positive Class: contains elements with expression patterns similar to those in the expression patterns similar to those in the positive examples in the training set. positive examples in the training set. Negative Class: contains all other members of Negative Class: contains all other members of the input set.the input set.
Each of these classes has elements that fall in Each of these classes has elements that fall in two groupstwo groups
Those initially in the class (true positives and Those initially in the class (true positives and true negatives)true negatives)Those recruited into the class (false positives Those recruited into the class (false positives and false negatives)and false negatives)
A theory has only the possibility of being right A theory has only the possibility of being right or wrong. A model has a third possibility; it may or wrong. A model has a third possibility; it may
be right but irrelevant.be right but irrelevant.
–– Manfred Manfred EigenEigen
The TIGR Gene Index TeamThe TIGR Gene Index TeamFoo CheungFoo Cheung
Svetlana KaramychevaSvetlana KaramychevaYudan LeeYudan Lee
Babak ParviziBabak ParviziGeo PerteaGeo Pertea
Razvan SultanaRazvan SultanaJennifer TsaiJennifer Tsai
John QuackenbushJohn QuackenbushJoseph WhiteJoseph White
Funding provided by the Department of EnergyFunding provided by the Department of Energyand the National Science Foundationand the National Science Foundation
TIGR Human/Mouse/ArabidopsisTIGR Human/Mouse/ArabidopsisExpression TeamExpression Team
Emily ChenEmily ChenBryan FrankBryan Frank
Renee GaspardRenee GaspardJeremy HassemanJeremy Hasseman
Heenam KimHeenam KimLara LinfordLara Linford
Simon KwongSimon KwongJohn QuackenbushJohn Quackenbush
Shuibang WangShuibang WangYonghong WangYonghong Wang
Ivana YangIvana YangYan YuYan Yu
Array Software Hit TeamArray Software Hit TeamNirmal BhagabatiNirmal Bhagabati
John BraistedJohn BraistedTracey CurrierTracey Currier
Jerry LiJerry LiWei LiangWei Liang
John QuackenbushJohn QuackenbushAlexander I. SaeedAlexander I. Saeed
Vasily SharovVasily SharovMathangi Mathangi ThaiagarjianThaiagarjian
Joseph WhiteJoseph WhiteAssistantAssistantSue MineoSue MineoFunding provided by the National Cancer Institute,Funding provided by the National Cancer Institute,
the National Heart, Lung, Blood Institute,the National Heart, Lung, Blood Institute,and the National Science Foundationand the National Science Foundation
H. Lee Moffitt Center/USFH. Lee Moffitt Center/USFTimothy J. YeatmanTimothy J. Yeatman
Greg BloomGreg Bloom
TIGR PGA CollaboratorsTIGR PGA CollaboratorsNorman LeeNorman LeeRenae MalekRenae Malek
HongHong--Ying WangYing WangTruong LuuTruong Luu
Bobby BehbahaniBobby Behbahani
TIGR Faculty, IT Group, and StaffTIGR Faculty, IT Group, and Staff
<[email protected]><[email protected]>AcknowledgmentsAcknowledgments
PGA CollaboratorsPGA CollaboratorsGary Churchill (TJL)Gary Churchill (TJL)Greg Evans (NHLBI)Greg Evans (NHLBI)Harry Harry GavarasGavaras (BU)(BU)
Howard Jacob (MCW)Howard Jacob (MCW)Anne Kwitek (MCW)Anne Kwitek (MCW)Allan Pack (Penn)Allan Pack (Penn)
Beverly Paigen (TJL)Beverly Paigen (TJL)Luanne Peters (TJL)Luanne Peters (TJL)
David Schwartz (Duke)David Schwartz (Duke)
EmeritusEmeritusJennifer Cho (TGI)Jennifer Cho (TGI)
Ingeborg Holt (TGI)Ingeborg Holt (TGI)Feng Liang (TGI)Feng Liang (TGI)
KristieKristie Abernathy (Abernathy (mAmA))Sonia Sonia Dharap(mADharap(mA))
Julie EarleJulie Earle--Hughes (Hughes (mAmA))Cheryl Gay (Cheryl Gay (mAmA))Priti Hegde (Priti Hegde (mAmA))
Rong Rong QiQi ((mAmA))Erik Snesrud (Erik Snesrud (mAmA))
TM4 SoftwareTM4 Software: : http://www.tigr.org/software/tm4/http://www.tigr.org/software/tm4/
Normalization working groupNormalization working groupWeb page:Web page:
www.dnachip.org/mged/normalization.htmlwww.dnachip.org/mged/normalization.htmlMailing list:Mailing list:
[email protected]@ebi.ac.uk