Post on 16-Feb-2018
transcript
1
Data Normalization and Standardizationthe benefits of pre-processing
microarray data
Ben BolstadStatistics, University of California, Berkeley
bmb@bmbolstad.comhttp://bmbolstad.com
2
Outline• Introduction• Pre-processing methodologies as they relate to
Two channel arraysAffymetrix GeneChips (a popular single channel array)
3
Biological Question
Experimental Design
Microarray Experiment
Pre-processingLow-level analysis
Image Quantification
Normalization
Summarization
Background Adjustment
Quality Assessment
High-level analysisEstimation Testing Annotation ….. Clustering Discrimination
Biological verification and interpretation
Images
Expression ValuesArray 1 Array 2 Array 3
Gene 1 10.05 9.58 9.76
Gene 2 4.12 4.16 4.05
Gene 3 6.05 6.04 6.08
Workflow for a typical microarrayexperiment
4
Introduction to preprocessing• Pre-processing typically constitutes the initial (and
possibly most important) step in the analysis of data from any microarray experiment
• Often ignored or treated like a black box (but it shouldn’t be)
• Consists of:Data explorationBackground correction, normalization, summarizationQuality Assessment
• These are interlinked steps
5
Background Correction/Signal Adjustment
• A method which does some or all of the following:Corrects for background noise, processing effects on the arrayAdjusts for cross hybridization (non-specific binding)Adjust estimated expression values to fall across an appropriate range
6
Normalization“Non-biological factors can contribute to the variability of data ...
In order to reliably compare data from multiple probe arrays, differences of non-biological origin must be minimized.“1
• Normalization is the process of reducing unwanted variation either within or between arrays. It may use information from multiple chips.
• Typical assumptions of most major normalization methods are (one or both of the following):
Only a minority of genes are expected to be differentially expressed between conditions Any differential expression is as likely to be up-regulation as down-regulation (ie about as many genes going up in expression as are going down between conditions)
1 GeneChip 3.1 Expression Analysis Algorithm Tutorial, Affymetrix technical support
7
A brief word on the term “Normalization”
• Many use the term “normalization” to refer to everything being discussed in this session. In other words they treat “normalization” and “pre-processing” as being synonymous with each other.
• I view normalization as just one of the steps in the process (although a very important one).
8
Summarization• Reducing multiple measurements on the same
gene down to a single measurement by combining in some manner.
• Most relevant to Affymetrix Arrays as we will see a little later ….
9
Quality Assessment• Need to be able to differentiate between good and
bad data. • Bad data could be caused by poor hybridization,
artifacts on the arrays, inconsistent sample handling, …..
• An admirable goal would be to reduce systematic differences with data analysis techniques.
• Sometimes there is no option but to completely discard an array from further analysis. How to decide …..
10
Two-channel arrays
11
Image analysis for two color arrays
• The raw data from a cDNA microarray experiment consist of pairs of image files, 16-bit TIFFs, one for each of the dyes.
• Image analysis is required to extract measures of the red and green fluorescence intensities for each spot on the array.
12
Image analysis1. Addressing. Estimate location of spot centers.
2. Segmentation. Classify pixels as foreground (signal) or background.
3. Information extraction. For each spot on the array and each dye
• signal intensities;• background intensities; • quality measures.
R and G for each spot on the array.
13
Good: low bg, lots of d.e. Bad: high bg, ghost spots, little d.e.
Co-registration and overlay offers a quick visualization,revealing information on colour balance, uniformity ofhybridization, spot uniformity, background, and artifiactssuch as dust or scratches
Red/Green overlay images
14Signal/Noise = log2(spot intensity/background intensity)
Histograms
15Slide 3 of the swirl data: used in all that follows.
16
Tools for exploring the data
R vs G
Important: Always log, always rotate
Bad
17
Tools for exploring the data
log2R vs log2G
Important: Always log, always rotate
Better
18
Tools for exploring the data
M=log2R/G vs A=log2√RG
Important: Always log, always rotate
Best
19
MA-plot
20
Spatial plots: background
21
Spatial plots: log ratios (M)
No reason to constrain yourself to red/green when visualizing
22
Boxplots
23
Background correction• Normally this is just a matter of subtracting the background
value in the Red channel of the foreground Red intensity and the same for the Green channel intensities for each spot.
i.e. R’= R – Rb, G’=G-Gb
where R, Rb, G, Gb are all from the output of the image analysis stage (there are some who use models based on these to derive corrections)
• From here on in we will assume that background correction has taken place.
24
Background Correction• Note that the image analysis program you use can
have quite an impact at this stage by drastically increasing variability, particularly in low intensities.
Note this not swirl.3
GenePix SpotSame array, different image analysis and background correction
25
Normalization for two color arrays
• Why?To correct for systematic differences between samples on the same slide, or between slides, which do not represent true biological variation between samples.
• How do we know it is necessary? By examining self-self hybridizations, where no true differential expression is occurring.We find dye biases which vary with overall spot intensity, location on the array, plate origin, pins, scanning parameters,….
26
Levels of Normalization for two color arrays
• Within-slidesWhich genes to use?Location normalizationScale normalization
• Paired-slides (dye-swap)Self-normalization
• Between-slides
27
False color overlay Boxplots within Grid plots MA-plots
Self-self hybridizations
28
log2R/G → log2R/G - c = log2R/ (kG)
Standard practice (in most software)c is a constant such as the mean or median log ratio.
Scaling Normalization
29
MA-plot after scaling
Before Scaling After Scaling
30
Intensity dependent adjustment
log2 R/G -> log2 R/G - c(A) = log2 R/(k(A)G)• Compute c by robust locally weighted regression of
M on A. • We typically use a loess curve for this purpose.
31
MA-plot after loess normalization
After global loess normalization
32
Boxplot: print-tip effects remain after global loess normalization
33
Within print-tip group normalization
• In addition to intensity-dependent variation in log ratios, spatial bias can also be a significant source of systematic error. Most normalization methods do not correct for spatial effects produced by hybridization artifacts or print-tip or plate effects during the construction of the microarrays.
• It is possible to correct for both print-tip and intensity-dependent bias by performing LOWESS fits to the data within print-tip groups, i.e.log2 R/G -> log2 R/G - ci(A) = log2 R/(ki(A)G),
• where ci(A) is the LOWESS fit to the MA-plot for the ith grid only.
34
Print-tip normalized data: MA-plot
35
Print-tip normalized data:boxplot
36
Smoothed histograms of M values
Black: unnormalized; red: global median; green: global lowess; blue: print-tip lowess
37
MSP titration series(Microarray Sample Pool)
Control set to aid intensity- dependent normalization
Different concentrations
Spotted evenly spread across the slide
Pool the whole library
38Yellow: GAPDH, tubulin Light blue: MSP pool / titration
Orange: Schadt-Wong rank invariant set Red line: lowess smooth
MSP normalization compared to other methods
39
Composite normalization
Before and after composite normalization
-MSP lowess curve-Global lowess curve-Composite lowess curve(Other colours control spots)
ci(A)=αAg(A)+(1-αA)fi(A)
40
Paired-slides: dye-swap• Slide 1, M = log2 (R/G) - c• Slide 2, M’ = log2 (R’/G’) - c’
Combine by subtracting the normalized log-ratios:[ (log2 (R/G) - c) - (log2 (R’/G’) - c’) ] / 2
≈ [ log2 (R/G) + log2 (G’/R’) ] / 2≈ [ log2 (RG’/GR’) ] / 2provided c = c’.Assumption: the normalization functions are thesame for the two slides.
41
Checking the assumption
MA plot for slides 1 and 2: it isn’t always like this.
42
Result of self-normalization(M - M’)/2 vs. (A + A’)/2
43
One way of taking scale into account
MADi
MADii =1
I∏I
Assumption: All slides have the same spread in M
True log ratio is mij where i represents different slides and j represents different spots.
Observed is Mij, whereMij = ai mij
Robust estimate of ai is
MADi = medianj { |yij - median(yij) | }
44
Scale normalization: between slides
Boxplots of log ratios from 3 replicate self-self hybridizations.
Before normalization After location normalization After scale normalization
45
Before normalization After location normalization After scale normalization
Scale normalization: swirl dataset
46
Other between slide normalizations
• Quantile normalization applied separately to R and G channels (after within chip normalization)
47
Two Channel Summary• Background Correction
Taking too much off can greatly increase variability• Normalization
Reduces systematic (not random) effectsMakes it possible to compare several arraysUse logratios (M vs A-plots)Lowess normalization (dye bias)MSP titration series – composite normalizationPin-group location normalizationPin-group scale normalizationBetween slide scale normalization
48
Single-channel arrays
49
Affymetrix GeneChip• Commericial mass produced high
density oligonucleotide array technology developed by Affymetrix http://www.affymetrix.com
• Single channel microarray
Image courtesy of Affymetrix.
50
Probes and Probesets
Typically 11 probe(pairs) in a probesetLatest GeneChips have as many as:54,000 probesets
1.3 Million probesCounts for HG-U133A plus 2.0 arrays
51
Two Probe Types
TAGGTCTGTATGACAGACACAAAGAAGATG
CAGACATAGTGTCTGTGTTTCTTCT
CAGACATAGTGTGTGTGTTTCTTCT
PM: the Perfect Match
MM: the Mismatch
Reference Sequence
52
Image Analysis
53
Chip dat file – checkered board – close up pixel selection
54
Chip cel file – checkered board
Courtesy: F. Colin
55
Boxplot raw intensities
Array 1 Array 2 Array 3 Array 4
56
Density plots
57
Pairwise MA plotsArray 1
Array 2
Array 3
Array 4
M=log2arrayi/arrayjA=1/2*log2(arrayi*arrayj)
58
Boxplots comparing M
Array 1 Array 2 Array 3 Array 4
M
59
RMA Background Approach• Convolution Model
= +
ObservedPM
SignalS
NoiseN
( )Exp α ( )2,N μ σ
( )
2
E
,
1
a pm ab bS PM pm a b
a pm ab
a o bb
μ σ α
φ
σ
φ ⎛ ⎞ ⎛ ⎞⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠
⎛ ⎞ ⎛ ⎞⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠
−−= = +
−
= − −
+Φ −
=
Φ
60
GCRMA Background Approach
• PM=Opm+Npm+S• MM=Omm+Nmm
• O – Optical noise• N – non-specific binding• S – Signal
• Assume O is distributed Normal • log(Npm )and log(Nmm ) are assumed bi-variate
normal with correlation 0.7 • log(S) assumed exponential(1)
61
GCRMA continued• An experiment was carried out where yeast RNA was
hybridized to human chips, so all binding expected to be non specific.
• Fitted a model to predict log intensity from sequence composition gives base and position effects
• Uses these effects to predict an affinity for any given sequence call this A. The means of the distributions for the Npm, Nmm terms are functions of the affinities.
62
Non-Biological variability is a problem for single channel
arrays
5 scanners for 6 dilution groups
Log2
PM
inte
nsity
63
Normalization• In case of single channel microarray data this is
carried out only across arrays.• Could generalize methods we applied to two color
arrays, but several problems:Typically several orders of magnitude more probes on an Affymetrix array then spots on a two channel arrayWith single channel arrays we are dealing with absolute intensities rather than relative intensities.
• Need something fast
64
Quantile Normalization• Normalize so that the quantiles of each chip are
equal. Simple and fast algorithm. Goal is to give same distribution to each chip.
Target Distribution
Original Distribution
65
It works!!Unnormalized Scaling
Quantile Normalization
66
It Reduces VariabilityFold changeExpression Values
Also no serious bias effects. For more see Bolstad et al (2003)
Unnormalized Quantile Scaling
Unnormalized Quantile Scaling
67
Summarization• Problem: Calculating gene expression values.• How do we reduce the 11-20 probe intensities for each
probeset on to a gene expression value?• Our Approach
RMA – a robust multi-chip linear model fit on the log scale• Some Other Approaches
Single chipAvDiff (Affymetrix) – no longer recommended for use due to many flawsMas 5.0 (Affymetrix) – use a 1 step Tukey-biweight to combine the probe intensities in log scale
Multiple ChipMBEI (Li-Wong dChip) – a multiplicative model on natural scale
68
General Probe Level Model
• Where f(X) is function of factor (and possibly covariate) variables (our interest will be in linear functions)
• is a pre-processed probe intensity (usually log scale)
• Assume that
f( )kij kijy ε= +X
E 0kijε⎡ ⎤ =⎣ ⎦2Var kij kε σ⎡ ⎤ =⎣ ⎦
kijy
69
Parallel Behavior Suggests Multi-chip Model
Array Array
PM
pro
be in
tens
ity
PM
pro
be in
tens
ity
Differentially expressing Non Differential
70
Probe Pattern Suggests Including Probe-Effect
PM
pro
be in
tens
ity
PM
pro
be in
tens
ity
Differentially expressing Non Differential
Probe Number Probe Number
71
Also Want Robustness
PM
pro
be in
tens
ity
Non Differential
PM
pro
be in
tens
ity
PM
pro
be in
tens
ity
Differentially expressing
PM
pro
be in
tens
ity
Differentially expressing Non Differential
72
The RMA model
whereis a probe-effect i= 1,…,Iis chip-effect ( is log2 gene expression on array j) j=1,…,Jk=1,…,K is the number of probesets
( )( )2log N Bkij kijy PM=
kij k ki kj kijy m α β ε= + + +
kiα
kjβ k kjm β+
73
Median Polish Algorithm
11 1
1
0
00 0 0
J
I IJ
y y
y y
L
M O M M
L
L11 1 1
1
1
ˆ ˆ ˆ
ˆ ˆ ˆˆ ˆ ˆ
J
I IJ I
J m
ε ε α
ε ε α
β β
L
M O M M
L
L
Iterate
Sweep Rows
Sweep Columns
median median 0i jα β= =
median median 0i ij j ijε ε= =
ImposesConstraints
74
RMA mostly does well in practice
Detecting Differential Expression Not noisy in low intensities
RMA
MAS 5.0
75
One DrawbackRMA MAS 5.0
Linearity across concentration. GCRMA fixes this problemConcentration Concentration
log2
Exp
ress
ion
Val
ue
log2
Exp
ress
ion
Val
ue
76
GCRMA improve linearity
77
An Alternative Method for Fitting a PLM
• Robust regression using M-estimation• In this talk, we will use Huber’s influence function.
The software handles many more.• Fitting algorithm is IRLS with weights dependent on
current residuals ( )kij
kij
rr
ψ
78
Variance Covariance Estimates
• Suppose model is • Huber (1981) gives three forms for estimating variance
covariance matrix
Y X β ε= +
( ) ( ) ( )2 1 11 1/ Ti
i
n p r W X X Wψκ
− −− ∑
( ) ( )
( )( )
2
122
1/
1/
iTi
ii
n p rX X
n r
ψκ
ψ
−−
⎡ ⎤′⎢ ⎥⎣ ⎦
∑
∑
( )
( )
2
11/( )
1/
ii
ii
n p rW
n r
ψκ
ψ−
−
′
∑∑
We will use this form'TW X X= Ψ
79
We Will Focus on the Summarization PLM
• Array effect model
With constraint
kij ki kj kijy α β ε= + +
10
I
kiiα
=
=∑Probe Effect
Array Effect
Pre-processedLog PM intensity
80
Quality Assessment• Problem: Judge quality of chip data
• Question: Can we do this with the output of the Probe Level Modeling procedures?
• Answer: Yes. Use weights, residuals, standard errors and expression values.
81
Chip pseudo-images
82
An Image Gallery
http://PLMImageGallery.bmbolstad.com
“Tricolor”
“Crop Circles”
“Ring of Fire”
83
NUSE PlotsNormalizedUnscaledStandardErrors
84
RLE Plots
RelativeLogExpression
85
Summary of One Channel Arrays
• Background correctionRMA modelGCRMA model
• NormalizationQuantile normalization
• SummarizationRobust multi-chip probe level modeling
• Quality Assessment
86
Acknowledgements• Terry Speed• Rafael Irizarry• Julia Brettschneider • Francois Colin• Jean Yang• Zhijin (Jean) Wu• Gordon Smyth• James Westenhall• Any one else …
87
References• Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP. Normalization for cDNA
microarray data: a robust composite method addressing single and multiple slide
systematic variation. Nucleic Acids Res. 2002 Feb 15;30(4):e15.• Yang, Y. H., Buckley, M. J., Dudoit, S., and Speed, T. P. (2002). Comparison of methods
for image analysis on cDNA microarray data. Journal of Computational and Graphical Statistics, 11 (1), 108-136.
• Smyth, G. K., Thorne, N. P. and Wettenhall J. (2004) limma: Linear Models for Microarray Data User's Guide. The Walter and Eliza Hall Institute of Medical Research.
• Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P., A comparison of normalization methods for high density oligonucleotide array data based on variance and bias, Bioinformatics, 19, 185 (2003).
• Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, and Speed TP. Summaries ofAffymetrix GeneChip Probe Level Data. Nucleic Acids Research, 31(4):e15, 2003.
• Bolstad BM, Collin F, Brettschneider J, Simpson K, Cope L, Irizarry RA, and Speed TP. (2005) Quality Assessment of Affymetrix GeneChip Data in Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Gentleman R, Carey V, Huber W, Irizarry R, and Dudoit S. (Eds.), Springer
• Wu, Z., Irizarry, R., Gentleman, R., Martinez Murillo, F. Spencer, F. A Model Based Background Adjustment for Oligonucleotide Expression Arrays. Journal of American Statistical Association 99, 909-917 (2004)