DNA Microarray Bioinformatics - #27612
Normalization
Getting the numbers comparable
DNA Microarray Bioinformatics - #27612
Sample PreparationHybridization
Array designProbe design
QuestionExperimental Design
Buy Chip/Array
Statistical AnalysisFit to Model (time series)
Expression IndexCalculation
Advanced Data AnalysisClustering PCA Classification Promoter AnalysisMeta analysis Survival analysis Regulatory Network
ComparableGene Expression Data
Normalization
Image analysis
The DNA Array Analysis Pipeline
DNA Microarray Bioinformatics - #27612
Intensities are not just mRNAconcentrations
• Tissue contamination• RNA degradation• RNA purification• Reverse transcription• Amplification efficiency• Dye effect (cy3/cy5)
• Spotting• DNA-support binding• Other issues related toarray manufacturing
• ‘Background’ correction• Image segmentation• Hybridization efficiencyand specificity
• Spatial effects
Example of spatial effects on microarrays
Spatial biasestimate
Raw data
The distribution of solventsand temperature over thearray surface and thewashing procedure, mayresult in spatial effects
DNA Microarray Bioinformatics - #27612
Gene-specific variation
Spotting efficiency,– Spot size– Spot shape
Cross-/unspecifichybridization
Biological variation– Effect– Noise
Global variation
Amount of RNA in the biopsy
Efficiencies of:– RNA extraction– Reverse transcription– amplification– Labeling– Photodetection
Systematic
Two kinds of variation
Stochastic
DNA Microarray Bioinformatics - #27612
Stochastic noise we use statistics to deal with
PCA Plot of 34 patients, 8973 dimensions (genes) reduced to 2
DNA Microarray Bioinformatics - #27612
...like we will see tomorrow
PCA for 100 most significant genes reduced to 2 dimensions
DNA Microarray Bioinformatics - #27612
Gene-specific variation:
• Too random to be explicitlyaccounted for• “noise”
Array-specific variation:
• Similar effect on manymeasurements• Corrections can beestimated from data
Normalization Statistical testing
Sources of variation
Systematic Stochastic
DNA Microarray Bioinformatics - #27612
Calibration = Normalization = Scaling
DNA Microarray Bioinformatics - #27612
Nonlinear normalization
DNA Microarray Bioinformatics - #27612
The Qspline method
From the empirical distribution, a number of quantiles are calculated foreach of the channels to be normalized (one channel shown in red) and forthe reference distribution (shown in black)A QQ-plot is made and a normalization curve is constructed by fitting acubic spline functionAs reference one can use an artificial “median array” for a set of arraysor use a log-normal distribution, which is a good approximation.
DNA Microarray Bioinformatics - #27612
Once again…qspline
When many microarrays are to benormalized to each other an averagearray can be used as target
Accumulating quantiles
DNA Microarray Bioinformatics - #27612
Lowess Normalization
One of the most commonly utilized normalizationtechniques is the LOcally Weighted ScatterplotSmoothing (LOWESS) algorithm.
M
A
* * * * ** *
DNA Microarray Bioinformatics - #27612
Invariant set normalization (Li and Wong)
A invariant set of probes is used
-Probes that does does not change intensity rank between arrays
-A piecewise linear median line is calculated
-This curve is used for normalization
DNA Microarray Bioinformatics - #27612
Spatial biasestimate
Spatial normalization
After intensitynormalization
After spatialnormalization
Raw data After intensitynormalizationAfter intensitynormalization
After spatialnormalizationAfter spatial
normalization
DNA Microarray Bioinformatics - #27612
Sample PreparationHybridization
Array designProbe design
QuestionExperimental Design
Buy Chip/Array
Statistical AnalysisFit to Model (time series)
Expression IndexCalculation
Advanced Data AnalysisClustering PCA Classification Promoter AnalysisMeta analysis Survival analysis Regulatory Network
ComparableGene Expression Data
Normalization
Image analysis
The DNA Array Analysis Pipeline
DNA Microarray Bioinformatics - #27612
Expression index value
Some microarrays have multiple probes addressingthe expression of the same gene
– Affymetrix chips have 11-20 probe pairs pr. Gene
- Perfect Match (PM)
- MisMatch (MM)
PM: CGATCAATTGCACTATGTCATTTCTMM: CGATCAATTGCAGTATGTCATTTCT
However for downstream analysiswe often want to deal with only onevalue pr. gene.Therefore we want to collapse theintensities from many probes intoone value:a gene expression index value
DNA Microarray Bioinformatics - #27612
Expression index calculation
Simplest method? Median
But more sophisticated methods exists:dChip, RMA and MAS 5 (from Affymetrix)
DNA Microarray Bioinformatics - #27612
dChip (Li & Wong)
Model: PMij = θiφj + εij
Outlier removal:– Identify extreme residuals– Remove– Re-fit– Iterate
Distribution of errors εij assumedindependent of signal strength
(Li and Wong, 2001)
DNA Microarray Bioinformatics - #27612
RMA
Robust Multi-array Average (RMA) expressionmeasure (Irizarry et al., Biostatistics, 2003)
For each probe set, re-write PMij = θiφj as:log(PMij)= log(θi ) + log(φj)
Fit this additive model by iteratively re-weightedleast-squares or median polish
DNA Microarray Bioinformatics - #27612
MAS. 5
MicroArray Suite version 5 uses
MM* is an adjusted MM that is never bigger than PMTukey biweight is a robust average procedure with weightsand outlier rejection
)}{log( *jj MMPMghtTukeyBiweisignal −=
DNA Microarray Bioinformatics - #27612
Std Dev of gene measures from 20 replicate arrays
Methods compared on expression variance
Std Dev of gene measures from 20 replicate arrays
Blue and Red: RMA; Black: dChip; Green: MAS5.0
Expression level
From Terry speed
DNA Microarray Bioinformatics - #27612
Robustness
MAS5.0
(Irizarry et al., Biostatistics, 2003)
MAS 5.0
Log fold change estimate from 1.25ug cRNA
Log
fold
cha
nge
est
imat
e fro
m 2
0ug
cRNA
DNA Microarray Bioinformatics - #27612
RobustnessdChip
(Irizarry et al., Biostatistics, 2003)
dChip
Log fold change estimate from 1.25ug cRNA
Log
fold
cha
nge
est
imat
e fro
m 2
0ug
cRNA
DNA Microarray Bioinformatics - #27612
RobustnessRMA
(Irizarry et al., Biostatistics, 2003)
RMA
Log fold change estimate from 1.25ug cRNA
Log
fold
cha
nge
est
imat
e fro
m 2
0ug
cRNA
DNA Microarray Bioinformatics - #27612
All of this is implemented in…
R
In the BioConductor packages ‘affy’
(Gautier et al., 2003).
DNA Microarray Bioinformatics - #27612
ReferencesLi and Wong, (2001). Model-based analysis of oligonucleotide arrays: Modelvalidation, design issues and standard error application.Genome Biology 2:1–11.
Irizarry, Bolstad, Collin, Cope, Hobbs and Speed, (2003) Summaries of AffymetrixGeneChip probe level data.Nucleic Acids Research 31(4):e15.)
Affymetrix. Affymetrix Microarray Suite User Guide. Affymetrix, Santa Clara, CA,version 5 edition, 2001.
Gautier, Cope, Bolstad, and Irizarry, (2003). affy - an r package for the analysis ofaffymetrix genechip data at the probe level. Bioinformatics