NASC
Normalisation and Analysis of the Affymetrix Data
David J Craigon
NASCWhat I am not going to
talk about• General microarray topics• Biology
NASC
The introduction
NASC
Affymetrix workflow
Biological sample of some sort
AmplifyExtract mRNA
Label and Fragment
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Hybridise to a chipScan chipFind features in scanAnalyse down to one number per gene
NASCWhat do we want to
find out?• We want to find out how much mRNA of
each type was in the original sample
NASC
Biological sample of some sort
AmplifyExtract mRNA
Label and Fragment
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Hybridise to a chipScan chipFind features in scanAnalyse down to one number per gene
Each of these steps need to be proportional
NASC
Biological sample of some sort
AmplifyExtract mRNA
Label and Fragment
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Hybridise to a chipScan chipFind features in scanAnalyse down to one number per gene
This talk is about this bit
NASC
Affymetrix Chips• On an Affymetrix chip each oligo
takes up a “square”• The RNA extracted from the plant
is first amplified. Then is labelled. This allows the scanner to see it.
• The RNA is then hybridised to the array. Matching RNA for that square sticks to the square, and can be seen by the scanner.
• By observing the intensity of a square, the amount of RNA bound to that oligo can be calculated
NASC
Design of the oligos
• Series of oligos designed for one gene• Each oligo comes in two versions…
5’ 3’
NASC
Match and mismatch• The exact match is a section of
the mRNA sequence you wish to probe for
• The mismatch is identical except for one base difference from it’s exact match counterpart, and is used to calculate a background.
• There are typically 11 “probe pairs” scattered around the chip- called a probe set.
• By combining the expression values for a probe set, a value for the expression of mRNA can be found.
NASCEXP, DAT, CEL, CHP
files• EXP file- experiment file• DAT file- the picture- like
a TIFF.• CEL file- a
unnormalised number for each probe.
• CHP file- one number for each probeset
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
NASCWhat do you think of it
so far?So far…• What we want to find out is the amount
of each mRNA in the starting sample.• The mRNA hybridises to a series of
probes.• We can get a number for each probe
from the CEL file.
NASC
The rest of this talk
We are going to go through four distinct ways of determining “Signal” values from CEL file data
• MAS 4• MAS 5• MBEI (dChip)• RMA
NASC
Mismatch probes in detail
NASCAll about mismatch
probesATGCTGTACAATCGCTTGATACTGGATGCTGTACAATAGCTTGATACTGGATGCTGTACAATAGCTTGATACTGG
Mismatch probe:
Target sequence:
Perfect match probe:
NASCWhy do we have
mismatch probes?• Mismatch probes (MM) are trying to
detect background.• The mismatch probes are supposed to
detect things that are close but not an exact match.
• It is assumed that these things also bind to the perfect match (PM), erroneously.
NASCYes folks, it’s
Expression Method No 1!
• The original method that was used by MAS 4
NASC
MAS 4 Algorithm
€
AvDiff =1
# A(PM j −MM j )
j∈A
∑For a probe set:• A is the set of probes you haven’t thrown away due to being outliers
• j=0 to the number of probesets• In English, the formula is very simple- throw away the outliers, then simply
average the differences between PM and MM of the probes you’ve got left.
NASCProblems with the MAS4 algorithm
• Better fit with log(PM) preferred
NASCExpression Method No
2!• MAS 5 method. • Still used by GCOS-
the current Affymetrix supplied method.
NASCNormalisation
Procedure• Before any work is done with the “CEL”
data, the CEL file is normalised.• Corrects for intra-chip differences
NASCNormalisation
Procedure• Divides the chip into K
zones (by default, 16 zones)
• Select the lowest 2% of probes (of any description)
• Assume these are “switched off”
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
NASCNormalisation
Procedure• Calculate Mean, SD of
these “switched off” probes for each section.
• Used as background.• Each point’s local
background weighted difference between each zone
• Subtract background from each probe.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
NASC
MAS 5 Algorithm
€
Signal = TukeyBiweight{log(PM j − IM j )}
For a probe set:• Tukey’s Biweight is an average that minimises the effect of outliers.• IM is the “ideal mismatch”. This is the same as the MM intensity,
except in the case where the MM is greater than the PM, in which case a new MM values is calculated based on other probes nearby
NASC
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
NASCMAS4 to MAS5
comparison
€
Signal = TukeyBiweight{log(PM j − IM j )}
€
AvDiff =1
# A(PM j −MM j )
j∈A
∑
NASC
Signal Normalisation
• To try to eliminate chip-to-chip variability.• Sort the signal values and remove the top and
bottom 2%• Calculate a scaling factor to adjust this middle
96%’s mean to 100 (configurable, and variable)• Multiply all signal values by the scaling factor• Affymetrix state that scaling factors should be
similar for arrays to be comparable
NASCExpression Method No
3!• The MBEI method of
Li and Wong.• Found in dChip, so
often known as the dChip method.
NASC
Observation
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
NASC
Observation
• The probes are vastly variable in effectiveness
• Li and Wong point out that the difference between probes is much greater than the difference between arrays!
• They contend that any proper model should take this into account.
NASC
MBEI model
€
MM ij = v j +θ iα j + ε
PM ij = v j +θ iα j +θ iΦ j + ε
∴ PM ij −MM ij =θ iΦ j + ε
NASC
MBEI model
€
MM ij = v j +θ iα j + ε
PM ij = v j +θ iα j +θ iΦ j + ε
∴ PM ij −MM ij =θ iΦ j + ε
Baseline response due to noise
Expression value (the thing we are interested in)
Rate of increase of PM probe as signal increases (separate for each probe)
Rate of increase of MM probe as signal increases (really? See later)
Error term
NASCModel is fitted over all
chips• Processes an entire experiment at once• Model is fitted using residual sum of
squares
• In their paper on the subject they talk a lot about how you can use this model to detect outliers, scratches on the array, etc. I’m not going to talk about that.
NASC
RMA paper observations
NASCA spiked in experiment
from the RMA paper• It would be useful if we had an
experiment where we “knew the answer”
• Run a series of experiments with a fixed background, but spike in some artificial RNA for a series of probes, at different concentrations.
NASC
Mismatch probes
• Mismatch probes are supposed to calculate what similar things hybridise to probes, to detect background for PM probes.
• The background should be at a relatively low level most of the time…
NASC
Yikes!• Actually MM>PM between 33% and 40% of the time!
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
NASC
Mismatch probes
• Mismatch probes are supposed to calculate what similar things hybridise to probes, to detect background for PM probes.
• The amount of this stuff shouldn’t depend on how much “interesting” RNA there is about…
NASC
Man the lifeboats!
NASCSome observations from the RMA paper
… perfect match probes appear to be additive (in the log scale)
NASC
• The amount of signal does affect mismatch probes.
• Clearly some of the useful mRNA is hybidising to the MM probes.
• This kind of shock has led to some people abandoning the use of MM probes altogether!
NASC
What’s going on?
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
NASC
Perfect match probes
… in RMA, in the log scale, they assume that probe effects are effectively additive
NASC
How RMA (roughly) works
NASC
RMA process
• Normalise array• Fit model
NASCNormalisation procedure involves adusting
distributions
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
NASC
RMA process
• Normalise array• Fit model
NASC
Fit model
Correct background using estimate from all mismatch probes for each array.
Fit model:
€
PM =θ +α + εLog scale expression value
Additive probe affinitive effect for this probe over all slides
Background corrected PM value
NASC
In summary then…
• There are various ways you can get from a CEL file to expression estimates.
• These models are derived by considering the behaviour of PM and MM probes
• Both dChip and RMA show better results than the standard Affy algorithm
• MM probes in particular behave contrary to how you would expect.
NASCEnough theory- how do you actually do these things?
• The MAS5 algorithm can be performed using (erm) MAS5!
• dChip is a piece of software that will be making an appearance later this afternoon, and can do the MBEI algorithm
• The RMA authors have a piece of software called RMAExpress, which does RMA for Windows.
• All of these algorithms can be done using the Bioconductor package in R.
NASC