+ All Categories
Home > Documents > Analysis of Affymetrix GeneChip Data

Analysis of Affymetrix GeneChip Data

Date post: 14-Jan-2016
Category:
Upload: henrik
View: 29 times
Download: 0 times
Share this document with a friend
Description:
Analysis of Affymetrix GeneChip Data. EPP 245/298 Statistical Analysis of Laboratory Data. Basic Design of Expression Arrays. For each gene that is a target for the array, we have a known DNA sequence. - PowerPoint PPT Presentation
Popular Tags:
31
1 Analysis of Affymetrix GeneChip Data EPP 245/298 Statistical Analysis of Laboratory Data
Transcript
Page 1: Analysis of Affymetrix GeneChip Data

1

Analysis of Affymetrix GeneChip Data

EPP 245/298

Statistical Analysis of

Laboratory Data

Page 2: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

2

Basic Design of Expression Arrays

• For each gene that is a target for the array, we have a known DNA sequence.

• mRNA is reverse transcribed to DNA, and if a complementary sequence is on the on a chip, the DNA will be more likely to stick

• The DNA is labeled with a dye that will fluoresce and generate a signal that is monotonic in the amount in the sample

Page 3: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

3

TAAATCGATACGCATTAGTTCGACCTATCGAAGACCCAACACGGATTCGATACGTTAATATGACTACCTGCGCAACCCTAACGTCCATGTATCTAATACGATTTAGCTATGCGTAATCAAGCTGGATAGCTTCTGGGTTGTGCCTAAGCTATGCAATTATACTGATGGACGCGTTGGGATTGCAGGTACATAGATTATGC

Exon Intron

Probe Sequence

•cDNA arrays use variable length probes•Long oligoarrays use 60-70mers•Affymetrix GeneChips use multiple 25-mers

•For each exon, 8-20 distinct probes •May overlap •May cover more than one exon

•Affymetrix chips also use mismatch (MM) probes that have the same sequence as perfect match probes except for the middle base which is changed to inhibitbinding.•This is supposed to act as a control, but often instead binds to another mRNA species, so many analysts do not use them

Page 4: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

4

Expression Indices

• A key issue with Affymetrix chips is how to summarize the multiple data values on a chip for each probe set (aka gene).

• There have been a large number of suggested methods.

• Generally, the worst ones are those from Affy, by a long way; worse means less able to detect real differences

Page 5: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

5

Usable Methods

• Li and Wong’s dCHIP and follow on work is demonstrably better than MAS 4.0 and MAS 5.0, but not as good as RMA and GLA

• The RMA method of Irizarry et al. is available in Bioconductor.

• The GLA method (Durbin, Rocke, Zhou) is also available in Bioconductor

Page 6: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

6

How to Install R and Bioconductor

• Download R install file from website and run installation

• Download Bioconductor as given below

>source("http://www.bioconductor.org/biocLite.R")>biocLite()

Page 7: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

7

Bioconductor Documentation> library(affy)Welcome to Bioconductor Vignettes contain introductory material. To view, simply type: openVignette() For details on reading vignettes, see the openVignette help page.> openVignette()Please select (by number) a vignette 1: affy primer 2: affy: Built-in Processing Methods 3: affy: Custom Processing Methods (HowTo) 4: affy: Automatic downloading of cdfenvs (HowTo): 5: affy: Import Methods (HowTo) 6: Biobase Primer 7: Howto Bioconductor 8: HowTo HowTo 9: esApply Introduction Selection: 6[1] TRUE

Page 8: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

8

The exprSet class

• An object of class exprSet has several slots– exprs A matrix of expression levels with

chips as columns and “genes” as rows.– se.exprs A matrix of standard errors for exprs if available.

– phenoData An object of class phenoData containing phenotype or experimental data.

Page 9: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

9

– description A description of the experiment, object of class MIAME

– annotation A character string indicating the base name for the associated annotation

– notes Notes describing features or aspects of the data, experiment, etc.

• The phenoData class has slots– pData A dataframe where rows are cases

and columns are variables– varLabels A list of labels and descriptions

for the variables– The number of rows in pData is the same as

the number of columns in exprs.

Page 10: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

10

> library(affy)> data(geneData)> class(geneData)[1] "matrix"> dim(geneData)[1] 500 26> data(geneCov)> dim(geneCov)[1] 26 3> covN <- list(cov1 = "Covariate 1; 2 levels",cov2 = "Covariate 2; 2 levels",+ cov3 <- "Covariate 3; 3 levels")> pD <- new("phenoData",pData=geneCov, varLabels = covN)> eSet <- new("exprSet",exprs=geneData,phenoData=pD)> dim(eSet@exprs)[1] 500 26

Page 11: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

11

Reading Affy Data into R

• The CEL files contain the data from an array. We will look at data from an older type of array, the U95A which contains 12,625 probe sets and 409,600 probes.

• The CDF file contains information relating probe pair sets to locations on the array. These are built into the affy package for standard types.

Page 12: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

12

The ReadAffy function

• ReadAffy() function reads all of the CEL files in the current working directory into an object of class AffyBatch

• ReadAffy(widget=T) does so in a GUI that allows entry of other characteristics of the dataset

• You can also specify filenames, phenotype or experimental data, and MIAME information

Page 13: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

13

> getClass("exprSet")

Slots: Name: exprs se.exprs phenoData descriptionClass: exprMatrix exprMatrix phenoData characterORMIAME

annotation notes character character

> getClass("AffyBatch")

Slots: Name: cdfName nrow ncol exprsClass: character numeric numeric exprMatrix

se.exprs phenoData description annotation notes exprMatrix phenoData characterORMIAME character character

Extends: "exprSet"

Page 14: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

14

A Sample Experiment

• This consists of 10 samples one each from 10 cell lines.

• The cell lines are of 5 types

• We have 10 Affymetrix U95A arrays

Page 15: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

15

group <- data.frame(as.factor(rep(1:5,each=2)))covN <- list(group = "Type of Cell Line")pD <- new("phenoData",pData=group,varLabels=covN)myMIAME <- new("MIAME",name="John Post, PhD", lab="Jane Q. Investigator, MD", contact="916-734-0000", title="Sample Experiment for EPP 298")

Data <- ReadAffy(filenames=c("LN0A.CEL","LN0B.CEL","LN1A.CEL","LN1B.CEL", "LN2A.CEL","LN2B.CEL","LN3A.CEL","LN3B.CEL","LN4A.CEL","LN4B.CEL"), phenoData=pD,description=myMIAME)

> dim(Data@exprs)[1] 409600 10> Data@exprs[1:5,] LN0A.CEL LN0B.CEL LN1A.CEL LN1B.CEL LN2A.CEL LN2B.CEL LN3A.CEL LN3B.CEL LN4A.CEL LN4B.CEL[1,] 137.0 103.0 120.0 133.0 124.0 188.0 183.0 143.0 117.0 93[2,] 2484.0 2040.8 2065.3 1687.5 9454.5 7732.8 7813.5 9873.8 2286.8 1930[3,] 274.0 146.0 145.0 133.0 159.0 190.0 185.0 153.0 133.0 141[4,] 1966.8 2387.0 2465.0 1961.0 9753.0 8088.0 7047.0 14705.0 2593.0 2125[5,] 71.5 51.5 65.0 58.5 73.3 68.3 66.3 87.3 58.8 53

Page 16: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

16

> Data@phenoData phenoData object with 1 variables and 10 cases varLabels group: Type of Cell Line

> Data@phenoData@pData as.factor.rep.1.5..each...2..1 12 13 24 25 36 37 48 49 510 5> Data@descriptionExperimenter name: John Post, PhD Laboratory: Jane Q. Investigator, MD Contact information: 916-734-0000 Title: Sample Experiment for EPP 298 URL: No abstract available.

Information is available on: preprocessing

Page 17: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

17

Expression Indices

• The 409,600 rows of the expression matrix in the AffyBatch object Data each correspond to a probe (25-mer)

• Ordinarily to use this we need to combine the probe level data for each probe set into a single expression number

• This has conceptually several steps

Page 18: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

18

Steps in Expression Index Construction

• Background correction is the process of adjusting the signals so that the zero point is similar on all parts of all arrays.

• We like to manage this so that zero signal after background correction corresponds approximately to zero amount of the mRNA species that is the target of the probe set.

Page 19: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

19

• Data transformation is the process of changing the scale of the data so that it is more comparable from high to low.

• Common transformations are the logarithm and generalized logarithm

• Normalization is the process of adjusting for systematic differences from one array to another.

• Normalization may be done before or after transformation, and before or after probe set summarization.

Page 20: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

20

• One may use only the perfect match (PM) probes, or may subtract or otherwise use the mismatch (MM) probes

• There are many ways to summarize 20 PM probes and 20 MM probes on 10 arrays (total of 200 numbers) into 10 expression index numbers

Page 21: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

21

The RMA Method

• Background correction that does not make 0 signal correspond to 0 amount

• Quantile normalization

• Log2 transform

• Median polish summary of PM probes

Page 22: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

22

> eidata <- rma(Data)Background correctingNormalizingCalculating Expression> class(eidata)[1] "exprSet"attr(,"package")[1] "Biobase"> dim(eidata@exprs)[1] 12625 10

Page 23: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

23

> summary(eidata@exprs) LN0A.CEL LN0B.CEL LN1A.CEL LN1B.CEL Min. : 2.724 Min. : 2.598 Min. : 2.634 Min. : 2.655 1st Qu.: 4.489 1st Qu.: 4.469 1st Qu.: 4.471 1st Qu.: 4.482 Median : 6.090 Median : 6.071 Median : 6.076 Median : 6.086 Mean : 6.135 Mean : 6.139 Mean : 6.135 Mean : 6.142 3rd Qu.: 7.450 3rd Qu.: 7.485 3rd Qu.: 7.474 3rd Qu.: 7.476 Max. :12.175 Max. :12.258 Max. :12.140 Max. :11.999 LN2A.CEL LN2B.CEL LN3A.CEL LN3B.CEL Min. : 2.622 Min. : 2.743 Min. : 2.687 Min. : 2.655 1st Qu.: 4.461 1st Qu.: 4.474 1st Qu.: 4.433 1st Qu.: 4.447 Median : 6.016 Median : 6.067 Median : 6.021 Median : 6.034 Mean : 6.124 Mean : 6.139 Mean : 6.130 Mean : 6.132 3rd Qu.: 7.438 3rd Qu.: 7.434 3rd Qu.: 7.456 3rd Qu.: 7.465 Max. :13.277 Max. :13.346 Max. :13.267 Max. :13.287 LN4A.CEL LN4B.CEL Min. : 2.775 Min. : 2.667 1st Qu.: 4.483 1st Qu.: 4.449 Median : 6.077 Median : 6.053 Mean : 6.136 Mean : 6.134 3rd Qu.: 7.469 3rd Qu.: 7.488 Max. :12.058 Max. :12.153

Page 24: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

24

The GLA Method

• The Glog Average (GLA) method is simpler than the RMA method, though it can require estimation of a parameter

• Background correction is intended to make a measured value of zero correspond to a zero quantity in the sample

• Transformation uses the glog ~ ln for large values

• Normalization via lowess• Summary is a simple average of PM probes

Page 25: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

25

> source("gla.r")> emat.gla <- gla(Data)> class(emat.gla)[1] "matrix"> dim(emat.gla)[1] 12625 10

Page 26: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

26

> summary(emat.gla) X1 X2 X3 X4 Min. :3.859 Min. :3.801 Min. :3.811 Min. :3.816 1st Qu.:4.948 1st Qu.:4.939 1st Qu.:4.942 1st Qu.:4.942 Median :5.456 Median :5.449 Median :5.446 Median :5.453 Mean :5.569 Mean :5.573 Mean :5.569 Mean :5.566 3rd Qu.:6.032 3rd Qu.:6.052 3rd Qu.:6.038 3rd Qu.:6.034 Max. :8.794 Max. :8.724 Max. :8.733 Max. :8.731 X5 X6 X7 X8 Min. : 3.800 Min. : 3.896 Min. :3.792 Min. :3.866 1st Qu.: 4.950 1st Qu.: 4.952 1st Qu.:4.945 1st Qu.:4.936 Median : 5.455 Median : 5.459 Median :5.458 Median :5.441 Mean : 5.595 Mean : 5.578 Mean :5.597 Mean :5.578 3rd Qu.: 6.058 3rd Qu.: 6.030 3rd Qu.:6.062 3rd Qu.:6.036 Max. :10.083 Max. :10.270 Max. :9.948 Max. :9.976 X9 X10 Min. :3.907 Min. :3.835 1st Qu.:4.949 1st Qu.:4.937 Median :5.452 Median :5.453 Mean :5.569 Mean :5.579 3rd Qu.:6.033 3rd Qu.:6.053 Max. :8.772 Max. :8.732

Page 27: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

27

glog <- function(y,lambda){ yt <- log(y+sqrt(y^2+lambda)) return(yt)}

Page 28: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

28

lnorm<- function(mat1,span=.1){print ("Start Lowess Normalization")

mat2 <- as.matrix(mat1) p <- dim(mat2)[1] n <- dim(mat2)[2] rmeans <- apply(mat2,1,mean)

r<-rnorm(p,0,1e-7)rmeans<-r+rmeans rranks <- rank(rmeans) matsort <- mat2[order(rranks),]

r0 <- 1:p lcol <- function(x) {

lx <- lowess(r0,x,f=span)$y # returns vector of length p

} lmeans <- apply(matsort,2,lcol) # column lowess lgrand <- apply(lmeans,1,mean) # grand lowess lgrand <- matrix(rep(lgrand,n),byrow=F,ncol=n) # grand lowess matnorm0 <- matsort-lmeans+lgrand # normalized matrix matnorm1 <- matnorm0[rranks,] # returns row order to original print ("End Lowess Normalization") return(matnorm1)

}

Page 29: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

29

gla <- function(AB,lambda=1000,alpha=50){ gn <- geneNames(AB) I <- length(gn) ## ind will be a vector with i repeated as many# times as gn[i] has PM probes# This shows which gene number is associated with each probe# ind<-vector() for (i in 1:I) { ind <- c(ind, rep(i, dim(pm(AB,gn[i]))[1])) }

mat <- pm(AB) EI <- matrix(0, I, dim(mat)[2]) mat.glog<-lnorm(glog(mat-alpha,lambda))

for (i in 1:I) { tmp <- apply(mat.glog[ind==i,],2,mean) EI[i,] <- t(tmp) } return(EI)}

Page 30: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

30

Probe Sets not Genes

• It is unavoidable to refer to a probe set as measuring a “gene”, but nevertheless it can be deceptive

• The annotation of a probe set may be based on homology with a gene of possibly known function in a different organism

• Only a relatively few probe sets correspond to genes with known function and known structure in the organism being studied

Page 31: Analysis of Affymetrix GeneChip Data

November 10, 2005 EPP 245 Statistical Analysis of Laboratory Data

31

Exercise (Optional)

• Download the ten arrays from the web site, and the gla.r file

• Load the arrays into R using Read.Affy and construct the RMA expression indices

• Source the gla.r functions, and construct the GLA expression indices


Recommended