Data Normalization and Standardization - bmbolstad.combmbolstad.com/talks/Bolstad - Data...

transcript

Data Normalization and Standardizationthe benefits of pre-processing

microarray data

Ben BolstadStatistics, University of California, Berkeley

bmb@bmbolstad.comhttp://bmbolstad.com

Outline• Introduction• Pre-processing methodologies as they relate to

Two channel arraysAffymetrix GeneChips (a popular single channel array)

Biological Question

Experimental Design

Microarray Experiment

Pre-processingLow-level analysis

Image Quantification

Normalization

Summarization

Background Adjustment

Quality Assessment

High-level analysisEstimation Testing Annotation ….. Clustering Discrimination

Biological verification and interpretation

Images

Expression ValuesArray 1 Array 2 Array 3

Gene 1 10.05 9.58 9.76

Gene 2 4.12 4.16 4.05

Gene 3 6.05 6.04 6.08

Workflow for a typical microarrayexperiment

Introduction to preprocessing• Pre-processing typically constitutes the initial (and

possibly most important) step in the analysis of data from any microarray experiment

• Often ignored or treated like a black box (but it shouldn’t be)

• Consists of:Data explorationBackground correction, normalization, summarizationQuality Assessment

• These are interlinked steps

Background Correction/Signal Adjustment

• A method which does some or all of the following:Corrects for background noise, processing effects on the arrayAdjusts for cross hybridization (non-specific binding)Adjust estimated expression values to fall across an appropriate range

Normalization“Non-biological factors can contribute to the variability of data ...

In order to reliably compare data from multiple probe arrays, differences of non-biological origin must be minimized.“1

• Normalization is the process of reducing unwanted variation either within or between arrays. It may use information from multiple chips.

• Typical assumptions of most major normalization methods are (one or both of the following):

Only a minority of genes are expected to be differentially expressed between conditions Any differential expression is as likely to be up-regulation as down-regulation (ie about as many genes going up in expression as are going down between conditions)

1 GeneChip 3.1 Expression Analysis Algorithm Tutorial, Affymetrix technical support

A brief word on the term “Normalization”

• Many use the term “normalization” to refer to everything being discussed in this session. In other words they treat “normalization” and “pre-processing” as being synonymous with each other.

• I view normalization as just one of the steps in the process (although a very important one).

Summarization• Reducing multiple measurements on the same

gene down to a single measurement by combining in some manner.

• Most relevant to Affymetrix Arrays as we will see a little later ….

Quality Assessment• Need to be able to differentiate between good and

bad data. • Bad data could be caused by poor hybridization,

artifacts on the arrays, inconsistent sample handling, …..

• An admirable goal would be to reduce systematic differences with data analysis techniques.

• Sometimes there is no option but to completely discard an array from further analysis. How to decide …..

Two-channel arrays

Image analysis for two color arrays

• The raw data from a cDNA microarray experiment consist of pairs of image files, 16-bit TIFFs, one for each of the dyes.

• Image analysis is required to extract measures of the red and green fluorescence intensities for each spot on the array.

Image analysis1. Addressing. Estimate location of spot centers.

2. Segmentation. Classify pixels as foreground (signal) or background.

3. Information extraction. For each spot on the array and each dye

• signal intensities;• background intensities; • quality measures.

R and G for each spot on the array.

Good: low bg, lots of d.e. Bad: high bg, ghost spots, little d.e.

Co-registration and overlay offers a quick visualization,revealing information on colour balance, uniformity ofhybridization, spot uniformity, background, and artifiactssuch as dust or scratches

Red/Green overlay images

14Signal/Noise = log2(spot intensity/background intensity)

Histograms

15Slide 3 of the swirl data: used in all that follows.

Tools for exploring the data

R vs G

Important: Always log, always rotate

log2R vs log2G

Better

M=log2R/G vs A=log2√RG

MA-plot

Spatial plots: background

Spatial plots: log ratios (M)

No reason to constrain yourself to red/green when visualizing

Boxplots

Background correction• Normally this is just a matter of subtracting the background

value in the Red channel of the foreground Red intensity and the same for the Green channel intensities for each spot.

i.e. R’= R – Rb, G’=G-Gb

where R, Rb, G, Gb are all from the output of the image analysis stage (there are some who use models based on these to derive corrections)

• From here on in we will assume that background correction has taken place.

Background Correction• Note that the image analysis program you use can

have quite an impact at this stage by drastically increasing variability, particularly in low intensities.

Note this not swirl.3

GenePix SpotSame array, different image analysis and background correction

Normalization for two color arrays

• Why?To correct for systematic differences between samples on the same slide, or between slides, which do not represent true biological variation between samples.

• How do we know it is necessary? By examining self-self hybridizations, where no true differential expression is occurring.We find dye biases which vary with overall spot intensity, location on the array, plate origin, pins, scanning parameters,….

Levels of Normalization for two color arrays

• Within-slidesWhich genes to use?Location normalizationScale normalization

• Paired-slides (dye-swap)Self-normalization

• Between-slides

False color overlay Boxplots within Grid plots MA-plots

Self-self hybridizations

log2R/G → log2R/G - c = log2R/ (kG)

Standard practice (in most software)c is a constant such as the mean or median log ratio.

Scaling Normalization

MA-plot after scaling

Before Scaling After Scaling

Intensity dependent adjustment

log2 R/G -> log2 R/G - c(A) = log2 R/(k(A)G)• Compute c by robust locally weighted regression of

M on A. • We typically use a loess curve for this purpose.

MA-plot after loess normalization

After global loess normalization

Boxplot: print-tip effects remain after global loess normalization

Within print-tip group normalization

• In addition to intensity-dependent variation in log ratios, spatial bias can also be a significant source of systematic error. Most normalization methods do not correct for spatial effects produced by hybridization artifacts or print-tip or plate effects during the construction of the microarrays.

• It is possible to correct for both print-tip and intensity-dependent bias by performing LOWESS fits to the data within print-tip groups, i.e.log2 R/G -> log2 R/G - ci(A) = log2 R/(ki(A)G),

• where ci(A) is the LOWESS fit to the MA-plot for the ith grid only.

Print-tip normalized data: MA-plot

Print-tip normalized data:boxplot

Smoothed histograms of M values

Black: unnormalized; red: global median; green: global lowess; blue: print-tip lowess

MSP titration series(Microarray Sample Pool)

Control set to aid intensity- dependent normalization

Different concentrations

Spotted evenly spread across the slide

Pool the whole library

38Yellow: GAPDH, tubulin Light blue: MSP pool / titration

Orange: Schadt-Wong rank invariant set Red line: lowess smooth

MSP normalization compared to other methods

Composite normalization

Before and after composite normalization

-MSP lowess curve-Global lowess curve-Composite lowess curve(Other colours control spots)

ci(A)=αAg(A)+(1-αA)fi(A)

Paired-slides: dye-swap• Slide 1, M = log2 (R/G) - c• Slide 2, M’ = log2 (R’/G’) - c’

Combine by subtracting the normalized log-ratios:[ (log2 (R/G) - c) - (log2 (R’/G’) - c’) ] / 2

≈ [ log2 (R/G) + log2 (G’/R’) ] / 2≈ [ log2 (RG’/GR’) ] / 2provided c = c’.Assumption: the normalization functions are thesame for the two slides.

Checking the assumption

MA plot for slides 1 and 2: it isn’t always like this.

Result of self-normalization(M - M’)/2 vs. (A + A’)/2

One way of taking scale into account

MADii =1

Assumption: All slides have the same spread in M

True log ratio is mij where i represents different slides and j represents different spots.

Observed is Mij, whereMij = ai mij

Robust estimate of ai is

MADi = medianj { |yij - median(yij) | }

Scale normalization: between slides

Boxplots of log ratios from 3 replicate self-self hybridizations.

Before normalization After location normalization After scale normalization

Scale normalization: swirl dataset

Data Normalization and Standardization - bmbolstad.combmbolstad.com/talks/Bolstad - Data...

Documents