1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical...

Post on 19-Dec-2015

214 views 0 download

Tags:

transcript

1 of 40

Henrik Bengtsson

Get started with RGet started with R

Mathematical Statistics,

Centre for Mathematical Sciences,

Lund University.

hb@maths.lth.se

Slides available @ http://www.maths.lth.se/bioinformatics/calendar/20031113/

2 of 40

Why not R?• Currently there are no good image analysis packages.

Matlab is still better here.• Currently S-Plus (commercial software similar to R) can

handle larger data set.• ...yeah, why not?

3 of 40

Why R?• R was written by scientists to be used by scientists.• R is a great language.• Runs on Windows, Mac, Unix, Linux, ...• Easy to install.• R is free.• Hundreds(!) of packages available.• Papers are often published with a R package.• Great help and documentation.• 24/7 online mail support.• A great community – friendly and helpful people...• Easy to call Fortran, C, Java, ... libraries• Everyone else is and will be using R

(-Yes, I‘m willing to bet!)

4 of 40

Starting and quitting ROn Windows click on the R icon or as on Unix type R:

markov{hb}: R

R : Copyright 2003, The R Development Core TeamVersion 1.7.1 (2003-06-16)

R is free software and comes with ABSOLUTELY NO WARRANTY.You are welcome to redistribute it under certain conditions.Type `license()' or `licence()' for distribution details.

R is a collaborative project with many contributors.Type `contributors()' for more information.

Type `demo()' for some demos, `help()' for on-line help, or`help.start()' for a HTML browser interface to help.Type `q()' to quit R.

> 1+2+3[1] 6> q()Save workspace image? [y/n/c]: n

markov{hb}:

markov{hb}: R

R : Copyright 2003, The R Development Core TeamVersion 1.7.1 (2003-06-16)

R is free software and comes with ABSOLUTELY NO WARRANTY.You are welcome to redistribute it under certain conditions.Type `license()' or `licence()' for distribution details.

R is a collaborative project with many contributors.Type `contributors()' for more information.

Type `demo()' for some demos, `help()' for on-line help, or`help.start()' for a HTML browser interface to help.Type `q()' to quit R.

> 1+2+3[1] 6> q()Save workspace image? [y/n/c]: n

markov{hb}:

5 of 40

Getting help

> help.start()> help.start()

• Browse the help

> help(lm)> help(lm)

> ?lm> ?lm

• Getting help on a specific function

> help.search(“linear model”)> help.search(“linear model”)

• Searching for help

> help.search(“linear model”)Help files with alias or title matching 'linear model' using fuzzymatching:

anova.glm(base) Analysis of Deviance for Generalized Linear Model Fitsanova.lm(base) ANOVA for Linear Model Fitsglm(base) Fitting Generalized Linear Modelsfamily.glm(base) Accessing Generalized Linear Model Fitssummary.glm(base) Summarizing Generalized Linear Model Fitslm(base) Fitting Linear Modelsfamily.lm(base) Accessing Linear Model Fitssummary.lm(base) Summarizing Linear Model Fitslm.fit(base) Fitter Functions for Linear Modelsloglin(base) Fitting Log-Linear Modelspredict.lm(base) Predict method for Linear Model Fitscv.glm(boot) Cross-validation for Generalized Linear Modelsglm.diag.plots(boot) Diagnostics plots for generalized linear

> help.search(“linear model”)Help files with alias or title matching 'linear model' using fuzzymatching:

anova.glm(base) Analysis of Deviance for Generalized Linear Model Fitsanova.lm(base) ANOVA for Linear Model Fitsglm(base) Fitting Generalized Linear Modelsfamily.glm(base) Accessing Generalized Linear Model Fitssummary.glm(base) Summarizing Generalized Linear Model Fitslm(base) Fitting Linear Modelsfamily.lm(base) Accessing Linear Model Fitssummary.lm(base) Summarizing Linear Model Fitslm.fit(base) Fitter Functions for Linear Modelsloglin(base) Fitting Log-Linear Modelspredict.lm(base) Predict method for Linear Model Fitscv.glm(boot) Cross-validation for Generalized Linear Modelsglm.diag.plots(boot) Diagnostics plots for generalized linear

6 of 40

Running help examples> help(lm)> example(lm)lm> ctl <- c(4.17, 5.58, 5.18, 6.11, 4.5, 4.61, 5.17, 4.53, 5.33, 5.14)lm> trt <- c(4.81, 4.17, 4.41, 3.59, 5.87, 3.83, 6.03, 4.89, 4.32, 4.69)lm> group <- gl(2, 10, 20, labels = c("Ctl", "Trt"))lm> weight <- c(ctl, trt)lm> anova(lm.D9 <- lm(weight ~ group))

Analysis of Variance TableResponse: weight Df Sum Sq Mean Sq F value Pr(>F)group 1 0.6882 0.6882 1.4191 0.249Residuals 18 8.7293 0.4850

lm> summary(lm.D90 <- lm(weight ~ group - 1))

Call:lm(formula = weight ~ group - 1)

Residuals: Min 1Q Median 3Q Max-1.0710 -0.4938 0.0685 0.2462 1.3690

Coefficients: Estimate Std. Error t value Pr(>|t|)groupCtl 5.0320 0.2202 22.85 9.55e-15 ***groupTrt 4.6610 0.2202 21.16 3.62e-14 ***---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 0.6964 on 18 degrees of freedomMultiple R-Squared: 0.9818, Adjusted R-squared: 0.9798F-statistic: 485.1 on 2 and 18 DF, p-value: < 2.2e-16

lm> summary(resid(lm.D9) - resid(lm.D90)) Min. 1st Qu. Median Mean 3rd Qu. Max.-6.661e-16 -6.245e-17 2.776e-17 1.544e-17 1.110e-16 3.331e-16

lm> opar <- par(mfrow = c(2, 2), oma = c(0, 0, 1.1, 0))lm> plot(lm.D9, las = 1)lm> par(opar)

> help(lm)> example(lm)lm> ctl <- c(4.17, 5.58, 5.18, 6.11, 4.5, 4.61, 5.17, 4.53, 5.33, 5.14)lm> trt <- c(4.81, 4.17, 4.41, 3.59, 5.87, 3.83, 6.03, 4.89, 4.32, 4.69)lm> group <- gl(2, 10, 20, labels = c("Ctl", "Trt"))lm> weight <- c(ctl, trt)lm> anova(lm.D9 <- lm(weight ~ group))

Analysis of Variance TableResponse: weight Df Sum Sq Mean Sq F value Pr(>F)group 1 0.6882 0.6882 1.4191 0.249Residuals 18 8.7293 0.4850

lm> summary(lm.D90 <- lm(weight ~ group - 1))

Call:lm(formula = weight ~ group - 1)

Residuals: Min 1Q Median 3Q Max-1.0710 -0.4938 0.0685 0.2462 1.3690

Coefficients: Estimate Std. Error t value Pr(>|t|)groupCtl 5.0320 0.2202 22.85 9.55e-15 ***groupTrt 4.6610 0.2202 21.16 3.62e-14 ***---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 0.6964 on 18 degrees of freedomMultiple R-Squared: 0.9818, Adjusted R-squared: 0.9798F-statistic: 485.1 on 2 and 18 DF, p-value: < 2.2e-16

lm> summary(resid(lm.D9) - resid(lm.D90)) Min. 1st Qu. Median Mean 3rd Qu. Max.-6.661e-16 -6.245e-17 2.776e-17 1.544e-17 1.110e-16 3.331e-16

lm> opar <- par(mfrow = c(2, 2), oma = c(0, 0, 1.1, 0))lm> plot(lm.D9, las = 1)lm> par(opar)

7 of 40

r-help mailing list

• Ask questions on the r-help mailing list.• Get answers 24/7 (including Xmas day!)• Sign up on http://www.r-project.org/

• Things to think of before sending a question:– Make sure you read the help/docs first.– Try to add a code example what you are trying to

do.– Tell what version of R you are using.

8 of 40

A first example

> x <- c(6,5,4,3,2,1)> x[1] 6 5 4 3 2 1> sum(x)[1] 21> x[c(1,3,5)][1] 6 4 2> x <- 6:1> x[1] 6 5 4 3 2 1> x[1:3][1] 6 5 4> x[1:3]+x[6:4][1] 7 7 7> y <- c(0,10)> x+y[1] 6 15 4 13 2 11

> x <- c(6,5,4,3,2,1)> x[1] 6 5 4 3 2 1> sum(x)[1] 21> x[c(1,3,5)][1] 6 4 2> x <- 6:1> x[1] 6 5 4 3 2 1> x[1:3][1] 6 5 4> x[1:3]+x[6:4][1] 7 7 7> y <- c(0,10)> x+y[1] 6 15 4 13 2 11

> x <- matrix(1:18, nrow=3)> x [,1] [,2] [,3] [,4] [,5] [,6][1,] 1 4 7 10 13 16[2,] 2 5 8 11 14 17[3,] 3 6 9 12 15 18> x[,3][1] 7 8 9> x[-1,4:6] [,1] [,2] [,3][1,] 11 14 17[2,] 12 15 18> str(x) int [1:3, 1:6] 1 2 3 4 5 6 7 8 9 10 ...

> x <- matrix(1:18, nrow=3)> x [,1] [,2] [,3] [,4] [,5] [,6][1,] 1 4 7 10 13 16[2,] 2 5 8 11 14 17[3,] 3 6 9 12 15 18> x[,3][1] 7 8 9> x[-1,4:6] [,1] [,2] [,3][1,] 11 14 17[2,] 12 15 18> str(x) int [1:3, 1:6] 1 2 3 4 5 6 7 8 9 10 ...

Vectors: Matrices (default is to fill column by column):

9 of 40

R is multiplying elementwise!Matrices:> X <- matrix(1:9, ncol=3, byrow=TRUE)> X [,1] [,2] [,3][1,] 1 2 3[2,] 4 5 6[3,] 7 8 9> I <- diag(1, nrow=3)> I [,1] [,2] [,3][1,] 1 0 0[2,] 0 1 0[3,] 0 0 1> I * X [,1] [,2] [,3][1,] 1 0 0[2,] 0 5 0[3,] 0 0 9> I %*% X [,1] [,2] [,3][1,] 1 2 3[2,] 4 5 6[3,] 7 8 9

> X <- matrix(1:9, ncol=3, byrow=TRUE)> X [,1] [,2] [,3][1,] 1 2 3[2,] 4 5 6[3,] 7 8 9> I <- diag(1, nrow=3)> I [,1] [,2] [,3][1,] 1 0 0[2,] 0 1 0[3,] 0 0 1> I * X [,1] [,2] [,3][1,] 1 0 0[2,] 0 5 0[3,] 0 0 9> I %*% X [,1] [,2] [,3][1,] 1 2 3[2,] 4 5 6[3,] 7 8 9

Why? Because it turns out that it is more common than vector and matrix multiplications.

10 of 40

R is looping over missed values- really useful indeed!

> x <- matrix(1:16, nrow=3)Warning message:Replacement length not a multiple of the elements to replace in matrix(...)> x [,1] [,2] [,3] [,4] [,5] [,6][1,] 1 4 7 10 13 16[2,] 2 5 8 11 14 1[3,] 3 6 9 12 15 2> y <- c(-1,1)# Multiply (“moving down the columns”)> x * y [,1] [,2] [,3] [,4] [,5] [,6][1,] -1 4 -7 10 -13 16[2,] 2 -5 8 -11 14 -1[3,] -3 6 -9 12 -15 2

> x <- matrix(1:16, nrow=3)Warning message:Replacement length not a multiple of the elements to replace in matrix(...)> x [,1] [,2] [,3] [,4] [,5] [,6][1,] 1 4 7 10 13 16[2,] 2 5 8 11 14 1[3,] 3 6 9 12 15 2> y <- c(-1,1)# Multiply (“moving down the columns”)> x * y [,1] [,2] [,3] [,4] [,5] [,6][1,] -1 4 -7 10 -13 16[2,] 2 -5 8 -11 14 -1[3,] -3 6 -9 12 -15 2

Matrices:Vectors:> x <- 7:1> x[1] 7 6 5 4 3 2 1> y <- 1:3> y[1] 1 2 3> x + y[1] 8 8 8 5 5 5 2

> x <- 7:1> x[1] 7 6 5 4 3 2 1> y <- 1:3> y[1] 1 2 3> x + y[1] 8 8 8 5 5 5 2

R will expand the shortest vector before adding!

# Zero out every 2nd:> x <- 3:8> y <- c(1,0)> x * y[1] 3 0 5 0 7 0

# Zero out every 2nd:> x <- 3:8> y <- c(1,0)> x * y[1] 3 0 5 0 7 0

11 of 40

> col <- c(“red”, “blue”)> plot(x,y, col=col, cex=1:3, lwd=4)

> col <- c(“red”, “blue”)> plot(x,y, col=col, cex=1:3, lwd=4)

Loop over plot attributes:

A first plot example

> x <- seq(from=1, to=2*pi, length=41)> y <- sin(x)# Scatter plot, with red data points of size 2> plot(x,y, col="red", cex=2)# A line with double width through data points> lines(x,y, lwd=2)

> x <- seq(from=1, to=2*pi, length=41)> y <- sin(x)# Scatter plot, with red data points of size 2> plot(x,y, col="red", cex=2)# A line with double width through data points> lines(x,y, lwd=2)

> dev.print(png, "sin.png")> dev.print(postscript, “sin.ps”)

> dev.print(png, "sin.png")> dev.print(postscript, “sin.ps”)

Write last plot to file:

(also in menus on Windows)

12 of 40

Questions for Clarification?!(2 minutes)

13 of 40

Putting names on things - helps you avoid stupid bugs etc.

Vectors: > x <- c(87,76.3,1.67)> x[1] 87.0 76.3 1.67> names(x) <- c(“age”, “weight”, “height”) > x age weight height 87.0 76.3 1.67# Alternatively> x <- c(age=87, weight=76.3, height=1.67)

> x <- c(87,76.3,1.67)> x[1] 87.0 76.3 1.67> names(x) <- c(“age”, “weight”, “height”) > x age weight height 87.0 76.3 1.67# Alternatively> x <- c(age=87, weight=76.3, height=1.67)

Why? > x[“age”][1] 87.0> bmi <- x[“weight”]/x[“height”]^2

> x[“age”][1] 87.0> bmi <- x[“weight”]/x[“height”]^2

> x[1][1] 87.0> bmi <- x[2]/x[3]^2

> x[1][1] 87.0> bmi <- x[2]/x[3]^2

cf.

14 of 40

Putting names on things

Matrices: > x1 <- c(87,76.3,1.67)> x2 <- c(78,96.3,1.84)> x3 <- c(45,62.9,1.54)> x <- matrix(c(x1,x2,x3), nrow=3, byrow=TRUE) > x [,1] [,2] [,3][1,] 87 76.3 1.67[2,] 78 96.3 1.84[3,] 45 62.9 1.54> colnames(x) <- c("age", "weight", "height")> rownames(x) <- c(“jon”, “kim”, “dan”)> x age weight heightjon 87 76.3 1.67kim 78 96.3 1.84dan 45 62.9 1.54> x[“jon”,] age weight height 87.00 76.30 1.67> x[,c("weight","age")] weight agejon 76.3 87kim 96.3 78dan 62.9 45> bmi <- x[,"weight"]/x[,"height"]^2> bmi jon kim dan27.35846 28.44400 26.52218

> x1 <- c(87,76.3,1.67)> x2 <- c(78,96.3,1.84)> x3 <- c(45,62.9,1.54)> x <- matrix(c(x1,x2,x3), nrow=3, byrow=TRUE) > x [,1] [,2] [,3][1,] 87 76.3 1.67[2,] 78 96.3 1.84[3,] 45 62.9 1.54> colnames(x) <- c("age", "weight", "height")> rownames(x) <- c(“jon”, “kim”, “dan”)> x age weight heightjon 87 76.3 1.67kim 78 96.3 1.84dan 45 62.9 1.54> x[“jon”,] age weight height 87.00 76.30 1.67> x[,c("weight","age")] weight agejon 76.3 87kim 96.3 78dan 62.9 45> bmi <- x[,"weight"]/x[,"height"]^2> bmi jon kim dan27.35846 28.44400 26.52218

15 of 40

Lists> x <- c(a=1:2, b=3, c=5:8)> xa1 a2 b c1 c2 c3 c4 1 2 3 5 6 7 8> x <- list(a=1:2, b=3, c=5:8)> x$a[1] 1 2$b[1] 3$c[1] 5 6 7 8> x$a[1] 1 2> x[[“a”]][1] 1 2> x[2:3]$b[1] 3$c[1] 5 6 7 8> x[c(“b”,”c”)]$b[1] 3$c[1] 5 6 7 8

> x <- c(a=1:2, b=3, c=5:8)> xa1 a2 b c1 c2 c3 c4 1 2 3 5 6 7 8> x <- list(a=1:2, b=3, c=5:8)> x$a[1] 1 2$b[1] 3$c[1] 5 6 7 8> x$a[1] 1 2> x[[“a”]][1] 1 2> x[2:3]$b[1] 3$c[1] 5 6 7 8> x[c(“b”,”c”)]$b[1] 3$c[1] 5 6 7 8

> x <- list(a=1:4, b=4:6+2i, src=“GenePix”)> x$a[1] 1 2 3 4$b[1] 4+2i 5+2i 6+2i$src[1] "GenePix“

> x <- list(a=1:4, b=4:6+2i, src=“GenePix”)> x$a[1] 1 2 3 4$b[1] 4+2i 5+2i 6+2i$src[1] "GenePix“

Vectors and matrices can only containone type of data at the time, e.g. either numbers or strings, but lists can carry mixed types:

16 of 40

Data frames

> df <- data.frame(name=c(“jon”,”kim”,”dan”), age=c(87,78,45), weight=c(76.3,96.3,62.9), height=c(1.67,1.84,1.54))> df name age weight height1 jon 87 76.3 1.672 kim 78 96.3 1.843 dan 45 62.9 1.54> df$weight[1] 76.3 96.3 62.9> df[,c("name","age")] name age1 jon 872 kim 783 dan 45

> as.matrix(df) name age weight height1 "jon" "87" "76.3" "1.67"2 "kim" "78" "96.3" "1.84"3 "dan" "45" "62.9" "1.54"

> df <- data.frame(name=c(“jon”,”kim”,”dan”), age=c(87,78,45), weight=c(76.3,96.3,62.9), height=c(1.67,1.84,1.54))> df name age weight height1 jon 87 76.3 1.672 kim 78 96.3 1.843 dan 45 62.9 1.54> df$weight[1] 76.3 96.3 62.9> df[,c("name","age")] name age1 jon 872 kim 783 dan 45

> as.matrix(df) name age weight height1 "jon" "87" "76.3" "1.67"2 "kim" "78" "96.3" "1.84"3 "dan" "45" "62.9" "1.54"

Data frames are very powerful. Simply speaking you can treat them as both lists and matrices:

17 of 40

Importing data

> df <- read.table(“foo.txt”, header=TRUE)> dim(df)[1] 221952 6> summary(df) slide spot R Min. :1.00 Min. : 1 Min. : 30.0 1st Qu.:1.75 1st Qu.:13873 1st Qu.: 60.0 Median :2.50 Median :27745 Median : 102.0 Mean :2.50 Mean :27745 Mean : 487.5 3rd Qu.:3.25 3rd Qu.:41616 3rd Qu.: 233.0 Max. :4.00 Max. :55488 Max. :65211.0 G Rb Gb Min. : 24.0 Min. : 32.00 Min. : 24.00 1st Qu.: 38.0 1st Qu.: 40.00 1st Qu.: 26.00 Median : 59.0 Median : 49.00 Median : 32.00 Mean : 194.7 Mean : 51.85 Mean : 31.92 3rd Qu.: 110.0 3rd Qu.: 64.00 3rd Qu.: 36.00 Max. :62341.0 Max. :1234.00 Max. :255.00> str(df)`data.frame': 221952 obs. of 6 variables: $ slide: int 1 1 1 1 1 1 1 1 1 1 ... $ spot : int 1 2 3 4 5 6 7 8 9 10 ... $ R : int 4416 335 39 568 42 43 56 40 7912 51 ... $ G : int 1533 155 47 211 50 110 64 45 4535 65 ... $ Rb : int 39 38 39 39 39 39 39 40 39 39 ... $ Gb : int 43 42 43 42 42 44 42 43 42 48 ...

> df <- read.table(“foo.txt”, header=TRUE)> dim(df)[1] 221952 6> summary(df) slide spot R Min. :1.00 Min. : 1 Min. : 30.0 1st Qu.:1.75 1st Qu.:13873 1st Qu.: 60.0 Median :2.50 Median :27745 Median : 102.0 Mean :2.50 Mean :27745 Mean : 487.5 3rd Qu.:3.25 3rd Qu.:41616 3rd Qu.: 233.0 Max. :4.00 Max. :55488 Max. :65211.0 G Rb Gb Min. : 24.0 Min. : 32.00 Min. : 24.00 1st Qu.: 38.0 1st Qu.: 40.00 1st Qu.: 26.00 Median : 59.0 Median : 49.00 Median : 32.00 Mean : 194.7 Mean : 51.85 Mean : 31.92 3rd Qu.: 110.0 3rd Qu.: 64.00 3rd Qu.: 36.00 Max. :62341.0 Max. :1234.00 Max. :255.00> str(df)`data.frame': 221952 obs. of 6 variables: $ slide: int 1 1 1 1 1 1 1 1 1 1 ... $ spot : int 1 2 3 4 5 6 7 8 9 10 ... $ R : int 4416 335 39 568 42 43 56 40 7912 51 ... $ G : int 1533 155 47 211 50 110 64 45 4535 65 ... $ Rb : int 39 38 39 39 39 39 39 40 39 39 ... $ Gb : int 43 42 43 42 42 44 42 43 42 48 ...

Data in text (ASCII) files can be imported using read.table():

18 of 40

Boxplots and density plots

> df <- read.table(“foo.txt”, header=TRUE)> dim(df)[1] 221952 6> boxplot(df[,c("R","G","Rb","Gb")], col=c("red","green"))# Ooops> boxplot(df[,c("R","G","Rb","Gb")], col=c("red","green"), ylim=c(0,300))# Better!

> df <- read.table(“foo.txt”, header=TRUE)> dim(df)[1] 221952 6> boxplot(df[,c("R","G","Rb","Gb")], col=c("red","green"))# Ooops> boxplot(df[,c("R","G","Rb","Gb")], col=c("red","green"), ylim=c(0,300))# Better!

> plot(density(log(df$G,base=2)), col="green", main="Densities of R and G")> lines(density(log(df$R,base=2)), col="red")

> plot(density(log(df$G,base=2)), col="green", main="Densities of R and G")> lines(density(log(df$R,base=2)), col="red")

19 of 40

Summary of “data types”Vectors (only one type at the time):

> x <- 7:1> x[1] 7 6 5 4 3 2 1> x <- c("a",3,"e")> x[1] "a" “3" "e"

> x <- 7:1> x[1] 7 6 5 4 3 2 1> x <- c("a",3,"e")> x[1] "a" “3" "e"

> x <- matrix(1:18, nrow=3)> x [,1] [,2] [,3] [,4] [,5] [,6][1,] 1 4 7 10 13 16[2,] 2 5 8 11 14 17[3,] 3 6 9 12 15 18> x <- matrix(letters[1:12], nrow=3)> x [,1] [,2] [,3] [,4][1,] "a" "d" "g" "j"[2,] "b" "e" "h" "k"[3,] "c" "f" "i" "l"

> x <- matrix(1:18, nrow=3)> x [,1] [,2] [,3] [,4] [,5] [,6][1,] 1 4 7 10 13 16[2,] 2 5 8 11 14 17[3,] 3 6 9 12 15 18> x <- matrix(letters[1:12], nrow=3)> x [,1] [,2] [,3] [,4][1,] "a" "d" "g" "j"[2,] "b" "e" "h" "k"[3,] "c" "f" "i" "l"

Matrices (rectangular, only one type at the time):

> x <- list(a=1:4, b=4:6+2i, src=“GenePix”)> x$a[1] 1 2 3 4$b[1] 4+2i 5+2i 6+2i$src[1] "GenePix“

> x <- list(a=1:4, b=4:6+2i, src=“GenePix”)> x$a[1] 1 2 3 4$b[1] 4+2i 5+2i 6+2i$src[1] "GenePix“

Lists (any shape, mixed type):

> x <- data.frame(name=c(“jon”,”kim”,”dan”), age=c(87,78,45), weight=c(76.3,96.3,62.9), height=c(1.67,1.84,1.54))> x name age weight height1 jon 87 76.3 1.672 kim 78 96.3 1.843 dan 45 62.9 1.54

> x <- data.frame(name=c(“jon”,”kim”,”dan”), age=c(87,78,45), weight=c(76.3,96.3,62.9), height=c(1.67,1.84,1.54))> x name age weight height1 jon 87 76.3 1.672 kim 78 96.3 1.843 dan 45 62.9 1.54

Data frames (rectangular, mixed type):

20 of 40

Questions for Clarification?!(2 minutes)

21 of 40

Installing R on Windows

22 of 40

Random numbers> x <- rnorm(10000, mean=2, sd=4)> length(x)[1] 10000> mean(x)[1] 2.007904> sd(x)[1] 3.969784> summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max.-12.670 -0.607 2.000 2.008 4.701 15.850> hist(x)# or use probabilities (not counts) on y-axis> hist(x, probability=TRUE)

> x <- rnorm(10000, mean=2, sd=4)> length(x)[1] 10000> mean(x)[1] 2.007904> sd(x)[1] 3.969784> summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max.-12.670 -0.607 2.000 2.008 4.701 15.850> hist(x)# or use probabilities (not counts) on y-axis> hist(x, probability=TRUE)

23 of 40

Defining functions and scripts

> cubic <- function(x) 1 + x^2 - 0.5*x^3> cubic(0:4)[1] 1.0 1.5 1.0 -3.5 -15.0

> cubic <- function(x) 1 + x^2 - 0.5*x^3> cubic(0:4)[1] 1.0 1.5 1.0 -3.5 -15.0

# Read R code from file> source(“cubic.R”)> cubic(0:4)[1] 1.0 1.5 1.0 -3.5 -15.0> cubicfunction(x) { 1 + x^2 - 0.5*x^3}

# Read R code from file> source(“cubic.R”)> cubic(0:4)[1] 1.0 1.5 1.0 -3.5 -15.0> cubicfunction(x) { 1 + x^2 - 0.5*x^3}

24 of 40

Adding data points to an existing plot> f <- function(x) x^2

> x <- seq(-10,10, by=0.01)> eps <- rnorm(length(x), sd=4)> y <- f(x) + eps

# plot() creates a new plot> plot(x,y)

# points() add data points to # an existing plot> points(x,f(1.1*x), col="blue")

# Same for abline() and others> abline(h=0, col="green")> abline(v=0, col="purple")> abline(a=0, b=1, col="orange")

# Same for curve() with add=TRUE> curve(f, col="red“, add=TRUE)

(On request by Linda)

25 of 40

Packages• On CRAN - Comprehensive R Archive

Network – there are today 300+ packages published!

• Browse CRAN at http://www.r-project.org/

• Find what you want.• Dirt simple to install package!

• At the Centre for Mathematical Sciences we try to keep install and update all package on our system.

26 of 40

Install a package• On Windows extremely easy!

• On all systems:> install.packages(c("adapt","maptools"))trying URL `http://cran.r-project.org/bin/windows/contrib/1.7/PACKAGES'Content type `text/plain; charset=iso-8859-1' length 12674 bytesopened URLdownloaded 12Kb

trying URL `http://cran.r-project.org/bin/windows/contrib/1.7/adapt_1.0-3.zip'Content type `application/zip' length 39304 bytesopened URLdownloaded 38Kb

trying URL `http://cran.r-project.org/bin/windows/contrib/1.7/maptools_0.3-2.zip'Content type `application/zip' length 129634 bytesopened URLdownloaded 126Kb

Delete downloaded files (y/N)? yupdating HTML package descriptions> library(maptools)

27 of 40

Update packages• On all systems:

> update.packages() trying URL `http://cran.r-project.org/bin/windows/contrib/1.7/PACKAGES'Content type `text/plain; charset=iso-8859-1' length 12674 bytes opened URLdownloaded 12Kbcluster : Version 1.7.3 in c:/PROGRA~1/R/rw1071/library Version 1.7.6 on CRANUpdate (y/N)? yforeign : Version 0.6-1 in c:/PROGRA~1/R/rw1071/library Version 0.6-3 on CRANUpdate (y/N)? y...trying URL `http://cran.r-project.org/bin/windows/contrib/1.7/foreign_0.6-3.zip'Content type `application/zip' length 109855 bytes opened URLdownloaded 107Kb

Delete downloaded files (y/N)? yupdating HTML package descriptions>

28 of 40

Using/loading a package• On all systems:

> library(maptools)

# Some package loads other package too> library(com.braju.sma)Loading required package: R.ooR.oo v0.44 (2003/10/29) was successfully loaded.Loading required package: R.ioR.io v0.44 (2003/10/29) was successfully loaded.Loading required package: R.graphicsR.graphics v0.44 (2003/10/29) was successfully loaded.com.braju.sma v0.64 (2003/10/31) was successfully loaded.

> example(MAData)

29 of 40

Questions for Clarification?!(2 minutes)

30 of 40

Fit a linear model (i)

> help.search(“linear model”)Help files with alias or title matching 'linear model' using fuzzymatching:

anova.glm(base) Analysis of Deviance for Generalized Linear Model Fitsanova.lm(base) ANOVA for Linear Model Fitsglm(base) Fitting Generalized Linear Modelsfamily.glm(base) Accessing Generalized Linear Model Fitssummary.glm(base) Summarizing Generalized Linear Model Fitslm(base) Fitting Linear Modelsfamily.lm(base) Accessing Linear Model Fitssummary.lm(base) Summarizing Linear Model Fitslm.fit(base) Fitter Functions for Linear Modelsloglin(base) Fitting Log-Linear Modelspredict.lm(base) Predict method for Linear Model Fitscv.glm(boot) Cross-validation for Generalized Linear Modelsglm.diag.plots(boot) Diagnostics plots for generalized linear models

1. Find function to use:

It says that there is a function called lm() in a package called base.(Package base is always loaded when R is started)

31 of 40

Fit a linear model (ii)

> help(lm)

2. Look at the help:

32 of 40

Fit a linear model (iii)

# Data with noise> f <- function(x) 3.3 + 0.85*x> x <- seq(0,10, by=0.01)> eps <- rnorm(length(x), sd=2)> y <- f(x) + eps

# Plot data> plot(x,y)

# Fit a linear model. See ?lm> fit <- lm(y ~ x)> fitCall:lm(formula = y ~ x)Coefficients:(Intercept) x 3.2780 0.8386

# Plot fitted line> abline(fit, col=“red")

3. Test it on our data:

33 of 40

Fit a linear model (iv)

4. Investigate fitted line in detail:> summary(fit)Call:lm(formula = y ~ x)Residuals: Min 1Q Median 3Q Max-8.2449 -1.3234 0.1234 1.2670 6.5321

Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) 3.27797 0.12608 26.00 <2e-16 ***x 0.83863 0.02183 38.41 <2e-16 ***---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 1.996 on 999 degrees of freedomMultiple R-Squared: 0.5963, Adjusted R-squared: 0.5959F-statistic: 1476 on 1 and 999 DF, p-value: < 2.2e-16

34 of 40

Fit a linear model (v)5. What does lm() return?

> str(fit)List of 12 $ coefficients : Named num [1:2] 3.357 0.824 ..- attr(*, "names")= chr [1:2] "(Intercept)" "x" $ residuals : Named num [1:1001] 1.09757 -0.00305 -

0.95878 1.64874 -2.59366 ... ..- attr(*, "names")= chr [1:1001] "1" "2" "3" "4" ... $ effects : Named num [1:1001] -236.55 75.33 -0.99

1.62 -2.63 ... ..- attr(*, "names")= chr [1:1001] "(Intercept)" "x" "" ""

... $ rank : int 2 $ fitted.values: Named num [1:1001] 3.36 3.36 3.37 3.38

3.39 ... ..- attr(*, "names")= chr [1:1001] "1" "2" "3" "4" ... $ assign : int [1:2] 0 1 $ qr :List of 5 ..$ qr : num [1:1001, 1:2] -31.6386 0.0316 0.0316

0.0316 0.0316 ... .. ..- attr(*, "dimnames")=List of 2 .. .. ..$ : chr [1:1001] "1" "2" "3" "4" ...

[ SNIP ]

.. .. ..- attr(*, "order")= int 1

.. .. ..- attr(*, "intercept")= int 1

.. .. ..- attr(*, "response")= int 1

.. .. ..- attr(*, ".Environment")=length 10 <environment>

.. .. ..- attr(*, "predvars")= language list(y, x) attr(*, "class")= chr "lm"

35 of 40

Fit a linear model (vi)6. Related functions etc.

> res <- residuals(fit)> hist(res)> qqnorm(res)

36 of 40

Questions for Clarification?!(2 minutes)

37 of 40

Local Polynomial Regression Fitting (LOESS)

# Read microarray data> df <- read.table("foo.txt", header=TRUE)

# Use only R & G from slide 1> df <- df[df$slide == 1, c(“R”,”G”)]

# Get the log-ratios and log-intensities> M <- log(df$R/df$G, base=2)> A <- log(df$R*df$G, base=2)/2

# Plot M vs A> plot(A,M)

# Fit a LOESS curve> fit <- loess(M ~ A)> Mcurve <- predict(fit, A)> points(A, Mcurve, col="red")

# Remove trend> M <- M - Mcurve> points(A,M, col=”blue”)

38 of 40

Column-by-column standardization of matrix

# Subract the mean and scale by the std. dev.> zscore <- function(y, ...) {+ (y - mean(y, ...)) / sd(y, ...)+ }

# Generate data> x <- matrix(rnorm(1000, mean=2, sd=10), ncol=2)

# Standardize data. If there are NA values, pass argument # na.rm=TRUE to both mean() and sd() via ..., which tells # them to remove NAs first)> z <- apply(x, MARGIN=2, FUN=zscore, na.rm=TRUE)

# Create a vector with the mean and the std. dev. of the first column> c(mean(x[,1]),sd(x[,1]))[1] 2.44114 10.25328

> c(mean(z[,1]),sd(z[,1]))[1] -1.616485e-16 1.000000e+00

(On request by Halfdan)

39 of 40

Links

http://www.r-project.org/ - homepage of the R project - download and install R - sign up on the r-help mailing list - more manuals and examples. - additional packages (CRAN) - conferences and other large-scale projects - ...

http://www.maths.lth.se/help/R/- Help @ MC- Local packages

Although I’m finishing up my thesis... I can help out, but please make sure you read the help first. Don’t forget about the r-help list. Thanks!

40 of 40

Session open for any questions(remaining time)