Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee...

Structure and Analysis ofAffymetrix Arrays

Monnie McGee

Department of Statistical Science

Southern Methodist University

UTSW Microarray Analysis Course, October 28, 2005 – p.1/56

Outline

Brief Review of Spotted Array Technology

Structure of Affymetrix Arrays

Exploratory Data Analysis

Affymetrix Data Files

Obtaining Gene Expression Values

Software


Microarray Measurements

All raw measurements are fluorescence intensities

Target cDNA (or mRNA) is fluorescently labeled

Molecules in dye are excited using a laser

Measurement is a count of the photons emitted

Entire slide or chip is scanned, and the result is a digitalimage

Image is processed to locate probes and assignintensity measurements to each probe


Microarray Technologies

Two Channel Spotted ArraysRobotic MicrospottingProbes are 300 to 3000 base pairs in lengthLong-oligo arrays: probes are uniformly 60 to 90 bpCommerical arrays using inkjet technology

Single-channel ArraysHigh-density short oligo (25 bp) arrays (Affymetrix,Nimblegen)


Spotted Arrays

Diagram courtesy of Columbia Department of Computer Science


The Affymetrix Chip

Human Genome U133 Plus 2.0 Array

Courtesy of Affymetrix

Some Definitions

Probes = 25 bpsequences

Probe sets = 11 to 20probes corresponding toa particular gene or EST

Chip contains 54K probesets


In situ Synthesis of Probes

Image Courtesy of Affymetrix


Probe Selection: HG-U133 Plus 2.0Sequence data for new content obtained from dbEST,GenBank, and RefSeq.

Draft assembly of Human Genome (NCBI Build 31)used to assess sequence orientation and quality.

Probes selected from the 600 bases most proximal tothe 3′ end of each transcript.

Probe Selection regions defined by the following:3′ ends of RefSeq and complete CDS mRNAsequencesEight or more 3′ EST reads terminating at thesample position (evidence for polyadenylation)3′ end of the assembly (consensus end).

Details found in Affymetrix Technical Note (2003).


Types of Probe Sets

No suffix: predicted to perfectly match a singletranscript

“_a” suffix: recognize multiple alternative transcriptsfrom the same gene

“_s” suffix: common probes among multiple transcriptsfrom separate genes

“_x” suffix: contain some probes that are identical orhighly similar to other sequences.


mRNA Hybridizes to Probes



Sizes of Various GeneChips

Arrays for 27 organisms

Arabidopsis (2), Drosophilia (2), Mouse (5), Human (8),Yeast (2)

Arabidopsis: 24K genes, 11 pairs per probe setC Elegans: 22.5K genes, 11 pairs per probe setDrosophilia: 13.5K genes, 14 pairs per probe setHuman HG-U133 plus 2.0: 54K genes, 11-20 pairsper probe set.

Source: http://www.affymetrix.com/support/technical/datasheets.affx


h

Perfect Match vs. Mismatch

PM Probe = 25 bp probe perfectly complementary to aspecific region of a gene

MM Probe = 25 bp probe agreeing with a PM apart fromthe middle base.

The middle base is a transition (A ⇐⇒ T, C ⇐⇒ G) ofthat base


Perfect Match vs. Mismatch

PM Probe = 25 bp probe perfectly complementary to aspecific region of a gene

MM Probe = 25 bp probe agreeing with a PM apart fromthe middle base.

The middle base is a transition (A ⇐⇒ T, C ⇐⇒ G) ofthat base



Riddle of the Mismatches

Mismatches were designed to capture non-specifichybridization

Hypothesized True Signal = PM - MM

Problem: Approximately 30% of the mismatches aregreater than their corresponding perfect matches.


Riddle of the Mismatches

Mismatches were designed to capture non-specifichybridization

Hypothesized True Signal = PM - MM

Problem: Approximately 30% of the mismatches aregreater than their corresponding perfect matches.

WHY ?


PM and MM Example

Target Transcript for Human recA gene:

ctcagcttaagtcatggaattctagaggatgtatctcacaagtaggatcaag

c t c a g c t t a a g t c a t g g a a t t c t a g PM1

c t c a g c t t a a g t g a t g g a a t t c t a g MM1

t c a g c t t a a g t c a t g g a a t t c t a g a PM2

t c a g c t t a a g t c t t g g a a t t c t a g a PM2

a t t c t a g a g g a t g t a t c t c a c a a g t PM3

a t t c t a g a g g a t c t a t c t c a c a a g t MM3

a g g a t g t a t c t c a c a a g t a g g a t c a PM4

a g g a t g t a t c t c t c a a g t a g g a t c a MM4


PM and MM Example

Target Transcript for Human recA gene:

ctcagcttaagtcatggaattctagaggatgtatctcacaagtaggatcaag

c t c a g c t t a a g t c a t g g a a t t c t a g PM1

c t c a g c t t a a g t g a t g g a a t t c t a g MM1

t c a g c t t a a g t c a t g g a a t t c t a g a PM2

t c a g c t t a a g t c t t g g a a t t c t a g a PM2

a t t c t a g a g g a t g t a t c t c a c a a g t PM3

a t t c t a g a g g a t c t a t c t c a c a a g t MM3

a g g a t g t a t c t c a c a a g t a g g a t c a PM4

a g g a t g t a t c t c t c a a g t a g g a t c a MM4

Morals: Large Overlap of sequences and variable GC content


Other Sources of VariationSystematic

Amount of RNA in biopsy extraction, Efficiencies of RNA

extraction, reverse transcription, labeling, photodetection, GC

content of probes

Similar effect on many measurements

Corrections can be estimated from data

Calibration corrections


Other Sources of VariationSystematic

Amount of RNA in biopsy extraction, Efficiencies of RNA

extraction, reverse transcription, labeling, photodetection, GC

content of probes

Similar effect on many measurements

Corrections can be estimated from data

Calibration corrections

StochasticPCR yield, DNA quality, Spotting efficiency, spot size,

Non-specific hybridization, Stray signal

Too random to be explicitly accounted for in a model

Noise components & “Schmutz”


Quality Control

We wish to find and eliminate problem probes beforeanalyzing the data

Problems may be local (scratch on the array,inadequate washing) or global (background set toohigh)

Look at image plots, histograms, MA plots, boxplots,etc.


Contaminated Image

Image courtesy of http//:www.biostat.harvard.edu/complab/dchip


h

Why Normalize ?

Ensure that differences in intensities are truly due todifferential expression, not printing, hybridization, orscanning artifacts

Must be done before an analysis which involvescomparison of intensities within or between slides

Procedures depend on the array technology


Dilution Data

Human liver tissue hybridized to human array HGU95A

Large range of proportions and dilutions

Our data hybridized at 10.0 and 20.0 µg

Two replicate arrays for each generated cRNA

Each array replicate was processed in a differentscanner

For more information, see http://qlotus02.genelogic.com/datasets.nsf/


h

Histograms from Dilution Study

6 8 10 12 14

0.0

0.1

0.2

0.3

0.4

0.5

0.6

log intensity

dens

ity


Boxplots

X20A X20B X10A X10B

68

1012

14

Small part of dilution study


M-A PlotsPlot of log fold change for gene j (Mj) versus the averagelog intensity for that gene (Aj).

6 8 10 12 14

−1

01

23

4

10B vs pseudo−median reference chip

A

M

Median: −0.535IQR: 0.207


Exploratory Data Analysis


Exploratory Data Analysis (cont’d)


Affymetrix Files

CDF file: Chip description file, describes which probesgo into which probe sets

DAT file: TIFF Image file, 107 pixels, ∼ 50 MB

CEL file: Probe intensities, ∼ 600,000 numbers

CHP file: Gene expression values as calculated byGeneChip Operating Software (GCOS)

Probe sets correspond to genes, gene fragments, or ESTs


Affymetrix DAT file

Scan of whole chip (left) and top left-hand corner (right) ofArabidopsis thaliana Genome Array.

Images courtesy of NASCA Arrays Help.


From DAT to CEL

CEL files contain fluorescence intensity values for all probepairs and all probe sets.

Use gridding to estimate location of probe cell centers

Remove outer 36 pixels → 8 × 8 pixels

PM (MM) intensity is the 75th percentile of the 8× 8 pixelvalues

Background: Average of the lowest 2% of probe cells is sub-

tracted


Analysis Tasks

Identify up- and down-regulated genes.

Find groups of genes with similar expression profiles.

Find groups of experiments (tissues) with similarexpression profiles.

Find genes that explain observed differences amongtissues (feature selection).


From CEL to Gene Expression

Computing Expression Values for each probe set requiresthree steps which begin with probe level data:


From CEL to Gene Expression

Computing Expression Values for each probe set requiresthree steps which begin with probe level data:

Central Dogma of Microarray Analysis:

Background correction (local vs. global)

Normalization (baseline array vs. complete data)

Summarization (single vs. multiple chips)


From CEL to Gene ExpressionThe “Big Four” algorithms for correcting, normalizing, andsummarizing probe level data.

Microarray Analysis Suite 5.0 (MAS5 - Affymetrix, 2001,2003)

Model Based Expression Index (MBEI - Li and Wong,2001a,b)

Robust Multichip Analysis (RMA - Irizarry et. al., 2003)

Significance Analysis of Microarrays (SAM - Tusher,Tibshirani, and Chu, 2001)


Background Correction in MAS 5.0Affymetrix proposed two methods: location specificadjustment and ideal mismatch.

Location Specific Adjustment:

Array is split into K rectangular zones, denotedZk, k = 1, . . . ,K. The default for K is 16.

Control cells and masked cells are not used in thecalculation

Intensities within zones are ranked and the lowest 2% ischosen as the background b for that zone (bZk)

Standard deviation of bZk is calculated as an estimateof the background variability n for each zone (nZk).


Background Correction in MAS 5.0Result is smoothed via the following formula

wk(x, y) =1

d2k(x, y) + ψ

The background is given by

b(x, y) =1

∑Kk=1wk(x, y)

K∑

k=1

wk(x, y)bZk

where dk(x, y) is the Euclidean distance between chip coor-

dinate (x, y) and the center of the kth zone and ψ is a smooth-

ing parameter (100 by default).


LSA ContinuedCalculate a local noise background n based on thestandard deviation of the lowest 2% of the backgroundin that zone (nZk).

Weight n(Zk) for background values using sameformula as for smoothing of background correction

Set threshold and floor such that no value is adjustedbelow that threshold.

Compute the Adjusted Intensity, A(x, y), via

A(x, y) = max(I ′(x, y) − b(x, y), fn(x, y))

where I ′(x, y) = max(I ′(x, y), 0.5) is the cell intensity at chip

coordinates (x, y), and f (default 0.5) is the threshold.


Affy Method 2: Ideal Mismatch

IMi,j =

MMij MMij < PMijPMij

2SBiMMij ≥ PMij and SBi > τc

PMij

2

0

@

τc

1+τc−SBi

τs

1

A

MMij ≥ PMij and SBi ≤ τc.

where τs is a cutoff describing the variability of the probepairs within the probe set, and τc is some tolerance level.

Defaults: τc = 0.03, τs = 10.

Now Signal = Tbi (PVi,1, . . . , PVi,ni), where

PVi,j = log2(max(PMij − IMij , δ)), for δ small.


Normalization in MAS 5.0Let X by a p× n matrix with columns representing arraysand rows probes or probesets.

Pick a column of X = log(X) to serve as baseline array, saycolumn j.

1. Compute (trimmed) mean of column j. Call this X̃j.

2. Compute (trimmed) mean of column i. Call this X̃i.

3. Compute βi = X̃j

X̃i

.

4. Multiply elements of column i by βi.

Repeat 2 – 4 for all columns.


Summarization in MAS 5.0A signal (expression) value is calculated by combining theprobe intensities for each probe pair within a probe set.

Find a typical log ratio of PM to MM for probe pair j inprobe set i- known as Specific Background

SBi = Tbi(log2(PMij) − log2(MMij) : j = 1, . . . , ni)

where Tbi is the Tukey Biweight.

If SBi is large, values from the probe set are useddirectly to construct the ideal mismatch (IM) for a probepair.

If SBi is small (as defined by τc), smooth MM to usemore of PM value as IM.


Content of the CHP fileData analysis output for a Single Array Analysis includesthe following:

List of probes (transcripts)

Stat Pairs: Number of probe pairs to interrogate eachgene

Stat Pairs Used: Number of pairs used to calculatesignal

Signal: Raw Adjusted Intensity

Detection Call: presence or absence of transcript

Detection P-value: p-value used to determine presenceor absence of transcript


What is a P-value?

The probability that a test statistic as extreme or moreextreme will be obtained assuming that the null hypothesisof the test is true.

For probe pairs, the null hypothesis is that there is nosignificant difference in intensity between PM and MMvalues for the same probe pair.


Absolute Analysis of One ArrayFour steps to calculating presence/absence of transcripts:

1. Remove saturated prove pairs and ignore probe pairs where

PM ≈ MM + τ (default: τ = 0.015).

2. Calculate discrimination scores (Ri) for each probe pair

Ri =PMi − MMi

PMi + MMi

3. Use Wilcoxon’s signed-rank test to calculate a p-value for each

pair

4. Compare the p-value wtih preset significance levels as follows:

Present if p < α1 (default: α1 = 0.04).

Marginal if α1 = p < α2

Absent if p ≥ α2 (default: α2 = 0.06).


Comparisons of Multiple ArraysLet γ1 and γ2 be user defined thresholds for change callssuch that 0 < γ1 < γ2 < 1.

p = Change p-value, calculated using signed rank testcomparing PM and MM differences for each probe pair in aprobe set present on both arrays being compared.

Possible Outcomes:

Increase (p < γ1)

Marginal Increase (γ1 ≤ p < γ2)

No Change (γ2 ≤ p ≤ 1 − γ2)

Marginal Decrease (1 − γ2 > p ≤ 1 − γ1)

Decrease (p > 1 − γ1)

Source: http://www.wadsworth.org/genomics/microarray/


h

Marginal CallsWhat do I do with Marginal Calls?



Ignore them (treat them as absent)

Include them (treat them as present)





Include them with some probability (detection filter -McClintick, et. al., 2003)






Examine literature






Examine literature

Examine other arrays for the call of that same transcript






Examine literature

Examine other arrays for the call of that same transcript

Rules for the inclusion of marginal calls seem to be an openresearch question.


Problem: Multiple Comparisons

The Type I Error is the probability of rejecting the nullhypothesis when it is true (1 - sensitivity).

α1 & γ1 are meant to control P(Type I Error).

If α1 = 0.04, there are 4 chances in 100 that we will obtain afalse positive result.


Problem: Multiple Comparisons

The Type I Error is the probability of rejecting the nullhypothesis when it is true (1 - sensitivity).

α1 & γ1 are meant to control P(Type I Error).

If α1 = 0.04, there are 4 chances in 100 that we will obtain afalse positive result.

For absolute analysis, approximately 600,000 statisticaltests are done for each array.

At α = 0.04, we expect 600, 000 × 0.04 = 24, 000 false positiveresults!

Solutions: Bonferroni Adjustment, False Discovery Rate,etc.


Model Based Expression IndexFit the following model using multiple chips for one gene:

yij = PMij −MMij = θiφj + ǫij

where θi is the expression index in chip iφj is a scaling factor characterizing probe pair jǫij are normal errors

Least squares estimates for parameters are carried out byiteratively fitting the set of θs and φs, treating the other setas known.

Standard errors of θ used to identify array outliers

Standard errors of φ used to identify probe outliers

MBEI model can also be based on PM only value


Normalization in MBEI

Non–linear, baseline array method

1. Pick a column of X to serve as baseline array, saycolumn j. For MBEI, the common baseline array is onehaving median overall brightness.

2. Fit a smooth non-linear relationship mapping column ito the baseline. Call this f̂i.

3. Normalized values for column j are given by f̂i(Xj).

4. Repeat 2 and 3 for all columns of X.

Various non-linear relationships are possible:cross-validated splines, running median lines, loesssmoothers, etc.


Summarization in MBEI

For each probeset n = 1, . . . , NP , fit the model

log2

(

y(n)ij

)

= β(n)j + α

(n)i + ǫ

(n)ij

where α(n)i is a probe effect and ǫ(n)

ij are errors.

Use standard linear regression techniques to fit themodel.

The estimated β(n)j are the base 2 log expression

values.

Outlier arrays, probes, and individual intensities areremoved prior to summarization.


Background Correction in RMA

Assumption:X = S + Y

where

X = observed probe–level intensity

S ∼ E(α) = true signal

Y ∼ TN(µ, σ2) = background noise

Reference: Irizarry et. al., Biostatistics, 2003


RMA for the Right–Brained ...

Image courtesy of Terry Speed


Parameter EstimationBackground Corrected intensity is Eij = E(Sij|Xij),where i = 1 . . . G, and j = 1, . . . , J .

We need to estimate µ, σ, and α.


Parameter EstimationBackground Corrected intensity is Eij = E(Sij|Xij),where i = 1 . . . G, and j = 1, . . . , J .

We need to estimate µ, σ, and α.

How does RMA estimate the parameters?

µ = Mode of observations to the left of the overall mode

σ = Sample standard deviation for observations to left ofoverall mode

α = Mode of observations to the right of the overall mode


Normalization in RMA

Quantile Normalization Algorithm

Given n arrays of length p, form matrix X of dimensionp× n where each array is a column.

Sort each column of X to give Xsort.

Take the mean across rows of Xsort.

Assign this mean to each element in the row to getquantile equalized X ′

sort.

Rearrange each column of X ′

sort to have the sameordering as the original matrix X to obtain Xnormalized.


Summarization in RMA

Median Polish Algorithm (Tukey 1977, Bolstad 2004)

Fits the following model

log2

(

y(n)ij

)

= µ(n) + θ(n)j + α

(n)i + ǫ

(n)ij

with constraints

median(θj) = median(αi) = 0

mediani(ǫij) = medianj(ǫij) = 0.


Median Polish AlgorithmForm a matrix for each probe set n such that the probes arein rows and the arrays are in columns.

Add a row and a column to give matrix of the form:

e11 . . . e1NAa1

... . . . ......

eIn1 . . . eInNAaIn

b1 . . . bNAm

where, initially, eij = y(n)ij and ai = bj = m = 0.


Median Polish (continued)

Take the median across columns, subtracting resultsfrom each element in that row and adding it to the finalcolumn

Take medians across rows, subtracting results fromeach element in that column and adding them to thefinal row.

Continue until the changes become small or zero

In conclusion: µ̂ = m, θ̂j = bj, and α̂i = ai.


Significance Analysis of Microarrays

Algorithm to determine “significantly” expressed genes

Original article mentions use of GeneChip AnalysisSuite software for background correction, normalizationand summarization.

Assigns a score to each gene on the basis of change ingene expression relative to the standard deviation ofrepeated measurements.

If the score exceeds a threshold, use permutations ofrepeated measurements to estimate the percentage ofgenes identified by chance.

More Information:http://www-stat.stanford.edu/ tibs/SAM/.


h

Microarray SoftwareOpen Source

Bioconductor: Calculates RMA, MBEI, MAS5,

http://www.bioconductor.org

dChip (MBEI only, http://biosun1.harvard.edu/complab/dchip/)

Significance Analysis of Microarrays (SAM)

Generalized Probe Model (GPM - Fan, et. al.2005,

http://qge.fhcrc.org/probeplus)

Commerical

GCOS, MAS 5.0 (Affymetrix)

S-Plus ArrayAnalyzer: Calculates RMA, MBEI, MAS5*

Iobion GeneTraffic: RMA, MBEI, MAS5*


h

h

h

References1. Affymetrix, Inc (2001). "Statistical Algorithms Reference". Data Analysis

Fundamentals Technical Manual, Chapter 5. www.affymetrix.com.

2. Affymetrix Technical Note: Design and Performance of the GeneChip Human GenomeU133 Plus 2.0 and Human Genome U133A Plus 2.0 Arrays (2003).www.affymetrix.com.

3. Affymetrix, Inc (2002). Statistical Algorithms Description Document.www.affymetrix.com.

4. Bolstad, Ben (2004). Low Level Analysis of High-density Oligonucleotide Array Data:Background, Normalization and Summarization. Dissertation. University of California,Berkeley.

5. Fan W, Pritchard JI, Olson JM, Khalid N, and Zhao LP (2005). A class of models foranalyzing gene expression analysis array data. BMC Genomics, 6:16,http://www.biomedcentral.com/1471-2164/6/16/.

6. Irizarry, R. A. , Bolstad, B. M. , Collin, F., Cope, L. M., Hobbs, B., and Speed, T. P.(2003). Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Research,31 (4): e15.

7. Irizarry, R. A. , Hobbs, B., Collin, F., Beazer-Barclay, Y. D., Antonellis, K. J., Scherf, U.,and Speed, T. P. (2003). Exploration, normalization, and summaries of high densityoligonucleotide array probe level data. Biostatistics, 4: 249–264.


h

References Continued8. Li, C. and Wong. H. W. (2001). Model-based analysis of oligonucleotide arrays:

Expression index computation and outlier detection. Proceedings of the NationalAcademy of Sciences, 98 (1): 31-36.

9. Li, C. and Wong. H. W. (2001). Model-based analysis of oligonucleotide arrays: modelvalidation, design issues and standard error application. Genome Biology, 8 (8):research0032.1-0032.11.

10. McClintick JN, Jerome RE, Nicholson CR, Crabb DW, Edenberg HJ (2003).Reproducibility of oligonucleotide arrays using small samples. BMC Genomics:4(4),http://www.biomedcentral.com/1471-2164/4/4.

11. Naef, F and Magnasco (2003). Solving the riddle of the bright mismatches: Labelingand effective binding in oligonucleotide arrays. Physical Review, 68.

12. Tukey JW (1977). Exploratory Data Analysis. Addison-Wesley, ReadingMassachusetts.

13. Tusher VG, Tibshirani R and Chu G (2001). Significance analysis of microarraysapplied to the ionizing radiation response. Proceedings of the National Academy ofSciences 98: 5116-5121 (Apr 24).


h

Date post:	09-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee...

Documents