+ All Categories
Home > Documents > Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset...

Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset...

Date post: 22-Dec-2015
Category:
View: 248 times
Download: 2 times
Share this document with a friend
Popular Tags:
62
Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package
Transcript
Page 1: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Multiple testing, correlation and regression, and clustering in R

Multtest packageAnscombe dataset and stats packageCluster package

Page 2: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Multtest package

The multtest package contains a collection of functions for multiple hypothesis testing.

These functions can be used to identify differentially expressed genes in microarray experiments, i.e., genes whose expression levels are associated with a response or covariate of interest.

Page 3: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Multtest package

These procedures are implemented for tests based on t–statistics, F–statistics, paired t–statistics, Wilcoxon statistics. The multtest package implements multiple testing

procedures for controlling different Type I error rates. It includes procedures for controlling the family–wise Type I error rate (FWER): Bonferroni, Hochberg (1988), Holm (1979).

It also includes procedures for controlling the false discovery rate (FDR): Benjamini and Hochberg (1995) and Benjamini and Yekutieli (2001) step–up procedures.

Page 4: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Data Analysis

Leukemia Data by Golub et al. (1999)

> library(multtest, verbose = FALSE)

> data(golub)

Page 5: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Golub Data

Golub et al. (1999) were interested in identifying genes that are differentially expressed in patients with two type of leukemias, acute lymphoblastic leukemia (ALL, class 0) and acute myeloid leukemia (AML, class 1).

Gene expression levels were measured using Affymetrix high–density oligonucleotide chips containing p = 6, 817 human genes.

The learning set comprises n = 38 samples, 27 ALL cases and 11 AML cases (data available at http://www.genome.wi.mit.edu/MPR).

Page 6: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Golub Data

Three preprocessing steps were applied to the normalized matrix of intensity values available on the website: (i) thresholding: floor of 100 and ceiling of 16,000; (ii) filtering: exclusion of genes with max / min <= 5 or (max−min) <=

500, where max and min refer respectively to the maximum and minimum intensities for a particular gene across mRNA samples;

(iii) base 10 logarithmic transformation.

Boxplots of the expression levels for each of the 38 samples revealed the need to standardize the expression levels within arrays before combining data across samples. The data were then summarized by a 3, 051×38 matrix X = (xji), where xji denotes the expression level for gene j in tumor mRNA sample i.

Page 7: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Golub Dataset

The dataset golub contains the gene expression data for the 38 training set tumor mRNA samples and 3,051 genes retained after pre–processing. The dataset includes golub: a 3, 051 × 38 matrix of expression levels; golub.gnames: a 3, 051 × 3 matrix of gene

identifiers; golub.cl: a vector of tumor class labels (0 for

ALL, 1 for AML).

Page 8: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Golub Dataset

> dim(golub)

> [1] 3051 38

Page 9: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Golub Dataset

Page 10: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Golub Dataset

Page 11: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

The mt.teststat and mt.teststat.num.denum functions

Used to to compute test statistics for each row of a data frame, e.g., two–sample Welch t–statistics, Wilcoxon statistics, F–statistics, paired t–statistics, block F–statistics.

Page 12: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

usage

mt.teststat(X,classlabel,test="t",na=.mt.naNUM,nonpara="n")

mt.teststat.num.denum(X,classlabel,test="t",na=.mt.naNUM,nonpara="n")

'test="wilcoxon"''test="pairt"''test="f"'

Page 13: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

two–sample t–statistics

comparing, for each gene, expression in the ALL cases to expression in the AML cases

> teststat <- mt.teststat(golub, golub.cl)

Page 14: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

QQ plot

First create an empty .pdf file to plot the QQ plot.

> pdf("mtQQ.pdf") Plot the qqplot> qqnorm(teststat) Plot a diagonal line> qqline(teststat) Shuts down the graphical object (e.g., pdf

file)> dev.off()

Page 15: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

What is a QQ plot?

Page 16: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

What is a QQ plot?

The quantile-quantile (q-q) plot is a graphical technique for determining if two data sets come from populations with a common distribution (e.g. normal distribution).

A q-q plot is a plot of the quantiles of the first data set (expected) against the quantiles of the second data set (observed).

Page 17: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

What is a QQ plot?

By a quantile, we mean the fraction (or percent) of points below the given value. That is, the 0.3 (or 30%) quantile is the point at which 30% percent of the data fall below and 70% fall above that value.

A 45-degree reference line is also plotted. If the two sets come from a population with

the same distribution, the points should fall approximately along this reference line.

Page 18: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Store the numerator and denominator of the test stat.

Create a variable tmp in which you store the numerators and denominators of the test statistic

tmp <- mt.teststat.num.denum(golub, golub.cl, test = "t")

Name the numerator as num (teststat.num atribute of the tmp object)

> num <- tmp$teststat.num

Name the denominator as denum (teststat.denum atribute of the tmp object).

> denum <- tmp$teststat.denum

Page 19: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Plot the num to denum

Create a pdf devise

> pdf("mtNumDen.pdf") Plot

> plot(sqrt(denum), num) Shut off the graphics

> dev.off()

Page 20: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Plot the num to denum

0.3 0.4 0.5 0.6 0.7 0.8

-2-1

01

23

sqrt(denum)

nu

m

Page 21: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

mt.rawp2adjp function

This function computes adjusted p–values for simple multiple testing procedures from a vector of raw (unadjusted) p–values.

The procedures include the Bonferroni, Holm (1979), Hochberg (1988), and Sidak procedures for strong control of the family–wise Type I error rate (FWER), and the Benjamini and Hochberg (1995) and Benjamini and Yekutieli (2001) procedures for (strong) control of the false discovery rate (FDR).

Page 22: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Raw p-values (unadjusted)

As a first approximation, compute raw nominal two–sided p–values for the 3051 test statistics using the standard Gaussian distribution. Create a variable known as rawp0 that contain the p-values of the 3051 test statistics.

> rawp0 <- 2 * (1 - pnorm(abs(teststat)))

> rawp0[1:5][1] 0.07854436 0.36289759 0.92191171 0.73463771 0.17063542

Page 23: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

The order function

> aa<-c(3,5,2,1,9)> ia<-order(aa) #index order

> ia

[1] 4 3 1 2 5> aa[ia] # order aa according to ia

[1] 1 2 3 5 9

Page 24: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Adjusted p-values

Create a vector of character strings containing the names of the multiple testing procedures for which adjusted p-values are to be computed. This vector should include any of the following: '"Bonferroni"', '"Holm"', '"Hochberg"', '"SidakSS"', '"SidakSD"', '"BH"', '"BY"'.

> procs <- c("Bonferroni", "Holm",+ "Hochberg", "SidakSS", "SidakSD",+ "BH", "BY")

Page 25: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Adjusted p-values in the order of the gene list

Adjusted p–values can be stored in the original gene order in adjp using order(res$index)

> res <- mt.rawp2adjp(rawp0, procs)

> adjp <- res$adjp[order(res$index), ]

Page 26: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Adjusted p-values in the order of significance Adjusted p–values can be stored in the

order of significance> adjp <- res$adjp

> adjp[1:5,]

Page 27: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

mt.reject function to list number of genes rejected

Page 28: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

mt.reject function to list number of genes rejected

Page 29: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Find the genes most significant

Page 30: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Get the data of significantly modulated genes

> which <- mt.reject(cbind(rawp0, adjp), 0.000001)$which[, 2]

> sum(which)[1] 143> gsignificant<-golub[which,]> dim(gsignificant)[1] 143 38> gsignificant[1:5,1:5] [,1] [,2] [,3] [,4] [,5][1,] -1.45769 -0.32639 -1.46227 -0.18983 -0.12402[2,] 0.86019 -0.14271 0.67037 0.70706 0.87697[3,] -1.00702 -0.89365 -1.21154 -1.40715 -1.42668[4,] -1.27427 -0.66834 -0.58252 -1.40715 -0.03531[5,] -0.45670 0.48916 -0.48474 0.02261 0.00704

Page 31: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

mt.plot function

The mt.plot function produces a number of graphical summaries for the results of multiple testing procedures and their corresponding adjusted p–values.

Page 32: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

mt.plot function

To produce plots of adjusted p–values for the Bonferroni, maxT, Benjamini and Hochberg 7 (1995), and Benjamini and Yekutieli (2001) procedures use

Page 33: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Plot adjp values over number of rejected hypotheses

> res <- mt.rawp2adjp(rawp, c("Bonferroni", "BH", "BY"))> adjp <- res$adjp[order(res$index), ]> allp <- cbind(adjp, maxT)> dimnames(allp)[[2]] <- c(dimnames(adjp)[[2]], "maxT")> procs <- dimnames(allp)[[2]]> procs <- procs[c(1, 2, 5, 3, 4)]> cols <- c(1, 2, 3, 5, 6)> ltypes <- c(1, 2, 2, 3, 3)> mt.plot(adjp,teststat,plottype = "pvsr")

Page 34: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Correlation and Regression in R

Anscombe dataset> data(anscombe)

> ? anscombe

Page 35: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Anscombe dataset

Page 36: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Get summaries

Page 37: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

cor function

Page 38: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Find correlation coefficient of all pairwise combinations

Page 39: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Find correlation coefficient of specific pairs

> cor(anscombe[,1],anscombe[,2])[1] 1> cor(anscombe[,1],anscombe[,4])[1] -0.5

Page 40: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

cor.test function

Page 41: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

cor.test

> cor.test(anscombe[,1],anscombe[,4])

Pearson's product-moment correlation

data: anscombe[, 1] and anscombe[, 4] t = -1.7321, df = 9, p-value = 0.1173alternative hypothesis: true correlation is not

equal to 0 95 percent confidence interval: -0.8460984 0.1426659 sample estimates: cor -0.5

Page 42: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Plot scatterplots

> plot(anscombe[,1],anscombe[,4])

4 6 8 10 12 14

81

01

21

41

61

8

anscombe[, 1]

an

sco

mb

e[,

4]

Page 43: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Application to Golub dataset

Page 44: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Calculate the linear regression coefficients and get a summary> ll<-lm(anscombe[,4] ~ anscombe[,1])> summary(ll)

Page 45: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Get the anova table

Page 46: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Plot the regression line

> plot(anscombe[,1],anscombe[,4])

> abline(lm(anscombe[,4] ~ anscombe[,1]))

4 6 8 10 12 14

81

01

21

41

61

8

anscombe[, 1]

an

sco

mb

e[,

4]

Page 47: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

ll$fitted.values

Page 48: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

ll$coefficients, ll$residuals

Page 49: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

hclust function

Page 50: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

hclust for golub data (correlation distance with average linkage)

21

29

38

33

37

30

36 1

22

53

43

52

83

13

21

73

62

31

09

11

21

41

82

02

61

61

9 13

24

51

52

71

4 78

22

0.1

0.2

0.3

0.4

0.5

Cluster Dendrogram

hclust (*, "average")as.dist(1 - cor(golub))

He

igh

t

Page 51: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

hclust for golub data (correlation distance with complete linkage)

38

33

37

29

30

36

21

17

26

16

19

20

13

24

51

53

62

31

09

11

21

42

71

4 78

22

18

28

12

25

34

35 3

13

2

0.1

0.2

0.3

0.4

0.5

0.6

Cluster Dendrogram

hclust (*, "complete")as.dist(1 - cor(golub))

He

igh

t

Page 52: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

hclust for golub data (euclidean distance with single linkage)

21

17

18

27

30

36

29

22

21

41

03

62

39

11

38

33

37 12

25 28

32

31

34

35

82

01

26

16

19

4 72

41

35

152

53

03

54

04

5

Cluster Dendrogram

hclust (*, "single")dist(t(golub))

He

igh

t

Page 53: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

hclust for significant golub data (euclidean distance with single linkage)

12

91

01

24

12

55

43

52

85

94

48

51

10

90

10

61

51

37

13

81

39

37

47

12

66

38

82

71

34

79

80

76

12

11

31

81

38

21

11

10

47

81

40

31

34

98

89

82

61

23

92

11

36

11

27

16

95

29

96

39

14

2 75

71

28

14

32 6

67

97

11

33

52

19

74

72

23

73

13

29

95

51

35

56

13

6 59

72

41

82

5 41

07

84

94

10

91

14

51

68

69 45

65 46

41

75

49

10

21

22

50

10

08

31

28

61

41

17

11

61

43

27

71

05

11

22

23

61

01

11

1 13

62

21

60

87

38

58

11

81

15

11

9 93

30

33

10

84

31

20

91

70

13

0 53

40

66

64

20

42

10

34

81

17

12

34

56

7

Cluster Dendrogram

hclust (*, "single")dist(gsignificant)

He

igh

t

Page 54: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

heatmap function

Page 55: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

heatmap for significant genes

> attributes(clust.euclid.single)$names[1] "merge" "height" "order" "labels" "method" [6] "call" "dist.method"

$class[1] "hclust"

> clust.euclid.single$order [1] 129 10 124 125 54 35 28 59 44 85 110 90 106 15 137 138 139 37 [19] 47 126 63 88 27 134 79 80 76 121 131 81 3 82 11 1 104 78 [37] 140 31 34 98 89 8 26 123 92 113 61 127 16 95 29 96 39 142 [55] 7 57 128 14 32 6 67 9 71 133 52 19 74 72 23 73 132 99 [73] 55 135 56 136 5 97 24 18 25 4 107 84 94 109 114 51 68 69 [91] 45 65 46 41 75 49 102 122 50 100 83 12 86 141 17 116 143 2[109] 77 105 112 22 36 101 111 13 62 21 60 87 38 58 118 115 119 93[127] 30 33 108 43 120 91 70 130 53 40 66 64 20 42 103 48 117> gsignificantordered<-gsignificant[clust.euclid.single$order,]> heatmap(gsignificantordered)

Page 56: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Heatmap for significant genes

35 28 32 38 33 37 36 30 29 31 34 25 12 18 7 22 27 8 17 14 2 11 9 10 3 23 6 21 26 1 19 16 4 5 20 24 13 15

343532332131291920394036273837302412912132812712812663467696558686661747395948788799089837784858691929382564675645476786362818071605972705014214314128225749259798961001151141041081074511431474752484211611718105106110109445510111911811111211311311341331321361375151716109261391401301351382351531231224112512499120121102103

Page 57: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

kmeans function

Page 58: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

kmeans for samples based on significant genes

-1.5 -1.0 -0.5 0.0 0.5 1.0

-0.5

0.0

0.5

1.0

1.5

t(gsignificant)[,1]

t(g

sig

nifi

can

t)[,2

]

Page 59: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

pam function

Page 60: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Pam on samples based on significant genes

Page 61: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Silioutte plot

21235322838293134303336372517141810

6222311

79358

212719

126

42413201615

Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

Silhouette plot of pam(x = as.dist(1 - cor(gsignificant)), k = 2, diss = TRUE)

Average silhouette width : 0.56

n = 38 2clustersCj

j : nj | aveiCjsi

1 : 25 | 0.5

2 : 13 | 0.66

> plot(clust.pam.2)

Page 62: Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.

Microarray Data Analysis Software

http://rana.lbl.gov/EisenSoftware.htm http://classify.stanford.edu/ http://www.broad.mit.edu/cancer/software/software.html http://homes.esat.kuleuven.be/~dna/Biol/Software.html http://vortex.cs.wayne.edu/Projects.html http://visitor.ics.uci.edu/cgi-bin/genex/rcluster/index.cgi http://www-stat.stanford.edu/%7Etibs/SAM/index.html http://maexplorer.sourceforge.net/

http://ihome.cuhk.edu.hk/~b400559/arraysoft.html http://genome.ws.utk.edu/resources/restech.shtml


Recommended