+ All Categories
Home > Documents > 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical...

1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical...

Date post: 19-Dec-2015
Category:
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
40
1 of 40 Henrik Bengtsson Get started with R Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. [email protected] des available @ http://www.maths.lth.se/bioinformatics/calendar/20031113/
Transcript
Page 1: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

1 of 40

Henrik Bengtsson

Get started with RGet started with R

Mathematical Statistics,

Centre for Mathematical Sciences,

Lund University.

[email protected]

Slides available @ http://www.maths.lth.se/bioinformatics/calendar/20031113/

Page 2: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

2 of 40

Why not R?• Currently there are no good image analysis packages.

Matlab is still better here.• Currently S-Plus (commercial software similar to R) can

handle larger data set.• ...yeah, why not?

Page 3: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

3 of 40

Why R?• R was written by scientists to be used by scientists.• R is a great language.• Runs on Windows, Mac, Unix, Linux, ...• Easy to install.• R is free.• Hundreds(!) of packages available.• Papers are often published with a R package.• Great help and documentation.• 24/7 online mail support.• A great community – friendly and helpful people...• Easy to call Fortran, C, Java, ... libraries• Everyone else is and will be using R

(-Yes, I‘m willing to bet!)

Page 4: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

4 of 40

Starting and quitting ROn Windows click on the R icon or as on Unix type R:

markov{hb}: R

R : Copyright 2003, The R Development Core TeamVersion 1.7.1 (2003-06-16)

R is free software and comes with ABSOLUTELY NO WARRANTY.You are welcome to redistribute it under certain conditions.Type `license()' or `licence()' for distribution details.

R is a collaborative project with many contributors.Type `contributors()' for more information.

Type `demo()' for some demos, `help()' for on-line help, or`help.start()' for a HTML browser interface to help.Type `q()' to quit R.

> 1+2+3[1] 6> q()Save workspace image? [y/n/c]: n

markov{hb}:

markov{hb}: R

R : Copyright 2003, The R Development Core TeamVersion 1.7.1 (2003-06-16)

R is free software and comes with ABSOLUTELY NO WARRANTY.You are welcome to redistribute it under certain conditions.Type `license()' or `licence()' for distribution details.

R is a collaborative project with many contributors.Type `contributors()' for more information.

Type `demo()' for some demos, `help()' for on-line help, or`help.start()' for a HTML browser interface to help.Type `q()' to quit R.

> 1+2+3[1] 6> q()Save workspace image? [y/n/c]: n

markov{hb}:

Page 5: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

5 of 40

Getting help

> help.start()> help.start()

• Browse the help

> help(lm)> help(lm)

> ?lm> ?lm

• Getting help on a specific function

> help.search(“linear model”)> help.search(“linear model”)

• Searching for help

> help.search(“linear model”)Help files with alias or title matching 'linear model' using fuzzymatching:

anova.glm(base) Analysis of Deviance for Generalized Linear Model Fitsanova.lm(base) ANOVA for Linear Model Fitsglm(base) Fitting Generalized Linear Modelsfamily.glm(base) Accessing Generalized Linear Model Fitssummary.glm(base) Summarizing Generalized Linear Model Fitslm(base) Fitting Linear Modelsfamily.lm(base) Accessing Linear Model Fitssummary.lm(base) Summarizing Linear Model Fitslm.fit(base) Fitter Functions for Linear Modelsloglin(base) Fitting Log-Linear Modelspredict.lm(base) Predict method for Linear Model Fitscv.glm(boot) Cross-validation for Generalized Linear Modelsglm.diag.plots(boot) Diagnostics plots for generalized linear

> help.search(“linear model”)Help files with alias or title matching 'linear model' using fuzzymatching:

anova.glm(base) Analysis of Deviance for Generalized Linear Model Fitsanova.lm(base) ANOVA for Linear Model Fitsglm(base) Fitting Generalized Linear Modelsfamily.glm(base) Accessing Generalized Linear Model Fitssummary.glm(base) Summarizing Generalized Linear Model Fitslm(base) Fitting Linear Modelsfamily.lm(base) Accessing Linear Model Fitssummary.lm(base) Summarizing Linear Model Fitslm.fit(base) Fitter Functions for Linear Modelsloglin(base) Fitting Log-Linear Modelspredict.lm(base) Predict method for Linear Model Fitscv.glm(boot) Cross-validation for Generalized Linear Modelsglm.diag.plots(boot) Diagnostics plots for generalized linear

Page 6: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

6 of 40

Running help examples> help(lm)> example(lm)lm> ctl <- c(4.17, 5.58, 5.18, 6.11, 4.5, 4.61, 5.17, 4.53, 5.33, 5.14)lm> trt <- c(4.81, 4.17, 4.41, 3.59, 5.87, 3.83, 6.03, 4.89, 4.32, 4.69)lm> group <- gl(2, 10, 20, labels = c("Ctl", "Trt"))lm> weight <- c(ctl, trt)lm> anova(lm.D9 <- lm(weight ~ group))

Analysis of Variance TableResponse: weight Df Sum Sq Mean Sq F value Pr(>F)group 1 0.6882 0.6882 1.4191 0.249Residuals 18 8.7293 0.4850

lm> summary(lm.D90 <- lm(weight ~ group - 1))

Call:lm(formula = weight ~ group - 1)

Residuals: Min 1Q Median 3Q Max-1.0710 -0.4938 0.0685 0.2462 1.3690

Coefficients: Estimate Std. Error t value Pr(>|t|)groupCtl 5.0320 0.2202 22.85 9.55e-15 ***groupTrt 4.6610 0.2202 21.16 3.62e-14 ***---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 0.6964 on 18 degrees of freedomMultiple R-Squared: 0.9818, Adjusted R-squared: 0.9798F-statistic: 485.1 on 2 and 18 DF, p-value: < 2.2e-16

lm> summary(resid(lm.D9) - resid(lm.D90)) Min. 1st Qu. Median Mean 3rd Qu. Max.-6.661e-16 -6.245e-17 2.776e-17 1.544e-17 1.110e-16 3.331e-16

lm> opar <- par(mfrow = c(2, 2), oma = c(0, 0, 1.1, 0))lm> plot(lm.D9, las = 1)lm> par(opar)

> help(lm)> example(lm)lm> ctl <- c(4.17, 5.58, 5.18, 6.11, 4.5, 4.61, 5.17, 4.53, 5.33, 5.14)lm> trt <- c(4.81, 4.17, 4.41, 3.59, 5.87, 3.83, 6.03, 4.89, 4.32, 4.69)lm> group <- gl(2, 10, 20, labels = c("Ctl", "Trt"))lm> weight <- c(ctl, trt)lm> anova(lm.D9 <- lm(weight ~ group))

Analysis of Variance TableResponse: weight Df Sum Sq Mean Sq F value Pr(>F)group 1 0.6882 0.6882 1.4191 0.249Residuals 18 8.7293 0.4850

lm> summary(lm.D90 <- lm(weight ~ group - 1))

Call:lm(formula = weight ~ group - 1)

Residuals: Min 1Q Median 3Q Max-1.0710 -0.4938 0.0685 0.2462 1.3690

Coefficients: Estimate Std. Error t value Pr(>|t|)groupCtl 5.0320 0.2202 22.85 9.55e-15 ***groupTrt 4.6610 0.2202 21.16 3.62e-14 ***---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 0.6964 on 18 degrees of freedomMultiple R-Squared: 0.9818, Adjusted R-squared: 0.9798F-statistic: 485.1 on 2 and 18 DF, p-value: < 2.2e-16

lm> summary(resid(lm.D9) - resid(lm.D90)) Min. 1st Qu. Median Mean 3rd Qu. Max.-6.661e-16 -6.245e-17 2.776e-17 1.544e-17 1.110e-16 3.331e-16

lm> opar <- par(mfrow = c(2, 2), oma = c(0, 0, 1.1, 0))lm> plot(lm.D9, las = 1)lm> par(opar)

Page 7: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

7 of 40

r-help mailing list

• Ask questions on the r-help mailing list.• Get answers 24/7 (including Xmas day!)• Sign up on http://www.r-project.org/

• Things to think of before sending a question:– Make sure you read the help/docs first.– Try to add a code example what you are trying to

do.– Tell what version of R you are using.

Page 8: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

8 of 40

A first example

> x <- c(6,5,4,3,2,1)> x[1] 6 5 4 3 2 1> sum(x)[1] 21> x[c(1,3,5)][1] 6 4 2> x <- 6:1> x[1] 6 5 4 3 2 1> x[1:3][1] 6 5 4> x[1:3]+x[6:4][1] 7 7 7> y <- c(0,10)> x+y[1] 6 15 4 13 2 11

> x <- c(6,5,4,3,2,1)> x[1] 6 5 4 3 2 1> sum(x)[1] 21> x[c(1,3,5)][1] 6 4 2> x <- 6:1> x[1] 6 5 4 3 2 1> x[1:3][1] 6 5 4> x[1:3]+x[6:4][1] 7 7 7> y <- c(0,10)> x+y[1] 6 15 4 13 2 11

> x <- matrix(1:18, nrow=3)> x [,1] [,2] [,3] [,4] [,5] [,6][1,] 1 4 7 10 13 16[2,] 2 5 8 11 14 17[3,] 3 6 9 12 15 18> x[,3][1] 7 8 9> x[-1,4:6] [,1] [,2] [,3][1,] 11 14 17[2,] 12 15 18> str(x) int [1:3, 1:6] 1 2 3 4 5 6 7 8 9 10 ...

> x <- matrix(1:18, nrow=3)> x [,1] [,2] [,3] [,4] [,5] [,6][1,] 1 4 7 10 13 16[2,] 2 5 8 11 14 17[3,] 3 6 9 12 15 18> x[,3][1] 7 8 9> x[-1,4:6] [,1] [,2] [,3][1,] 11 14 17[2,] 12 15 18> str(x) int [1:3, 1:6] 1 2 3 4 5 6 7 8 9 10 ...

Vectors: Matrices (default is to fill column by column):

Page 9: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

9 of 40

R is multiplying elementwise!Matrices:> X <- matrix(1:9, ncol=3, byrow=TRUE)> X [,1] [,2] [,3][1,] 1 2 3[2,] 4 5 6[3,] 7 8 9> I <- diag(1, nrow=3)> I [,1] [,2] [,3][1,] 1 0 0[2,] 0 1 0[3,] 0 0 1> I * X [,1] [,2] [,3][1,] 1 0 0[2,] 0 5 0[3,] 0 0 9> I %*% X [,1] [,2] [,3][1,] 1 2 3[2,] 4 5 6[3,] 7 8 9

> X <- matrix(1:9, ncol=3, byrow=TRUE)> X [,1] [,2] [,3][1,] 1 2 3[2,] 4 5 6[3,] 7 8 9> I <- diag(1, nrow=3)> I [,1] [,2] [,3][1,] 1 0 0[2,] 0 1 0[3,] 0 0 1> I * X [,1] [,2] [,3][1,] 1 0 0[2,] 0 5 0[3,] 0 0 9> I %*% X [,1] [,2] [,3][1,] 1 2 3[2,] 4 5 6[3,] 7 8 9

Why? Because it turns out that it is more common than vector and matrix multiplications.

Page 10: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

10 of 40

R is looping over missed values- really useful indeed!

> x <- matrix(1:16, nrow=3)Warning message:Replacement length not a multiple of the elements to replace in matrix(...)> x [,1] [,2] [,3] [,4] [,5] [,6][1,] 1 4 7 10 13 16[2,] 2 5 8 11 14 1[3,] 3 6 9 12 15 2> y <- c(-1,1)# Multiply (“moving down the columns”)> x * y [,1] [,2] [,3] [,4] [,5] [,6][1,] -1 4 -7 10 -13 16[2,] 2 -5 8 -11 14 -1[3,] -3 6 -9 12 -15 2

> x <- matrix(1:16, nrow=3)Warning message:Replacement length not a multiple of the elements to replace in matrix(...)> x [,1] [,2] [,3] [,4] [,5] [,6][1,] 1 4 7 10 13 16[2,] 2 5 8 11 14 1[3,] 3 6 9 12 15 2> y <- c(-1,1)# Multiply (“moving down the columns”)> x * y [,1] [,2] [,3] [,4] [,5] [,6][1,] -1 4 -7 10 -13 16[2,] 2 -5 8 -11 14 -1[3,] -3 6 -9 12 -15 2

Matrices:Vectors:> x <- 7:1> x[1] 7 6 5 4 3 2 1> y <- 1:3> y[1] 1 2 3> x + y[1] 8 8 8 5 5 5 2

> x <- 7:1> x[1] 7 6 5 4 3 2 1> y <- 1:3> y[1] 1 2 3> x + y[1] 8 8 8 5 5 5 2

R will expand the shortest vector before adding!

# Zero out every 2nd:> x <- 3:8> y <- c(1,0)> x * y[1] 3 0 5 0 7 0

# Zero out every 2nd:> x <- 3:8> y <- c(1,0)> x * y[1] 3 0 5 0 7 0

Page 11: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

11 of 40

> col <- c(“red”, “blue”)> plot(x,y, col=col, cex=1:3, lwd=4)

> col <- c(“red”, “blue”)> plot(x,y, col=col, cex=1:3, lwd=4)

Loop over plot attributes:

A first plot example

> x <- seq(from=1, to=2*pi, length=41)> y <- sin(x)# Scatter plot, with red data points of size 2> plot(x,y, col="red", cex=2)# A line with double width through data points> lines(x,y, lwd=2)

> x <- seq(from=1, to=2*pi, length=41)> y <- sin(x)# Scatter plot, with red data points of size 2> plot(x,y, col="red", cex=2)# A line with double width through data points> lines(x,y, lwd=2)

> dev.print(png, "sin.png")> dev.print(postscript, “sin.ps”)

> dev.print(png, "sin.png")> dev.print(postscript, “sin.ps”)

Write last plot to file:

(also in menus on Windows)

Page 12: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

12 of 40

Questions for Clarification?!(2 minutes)

Page 13: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

13 of 40

Putting names on things - helps you avoid stupid bugs etc.

Vectors: > x <- c(87,76.3,1.67)> x[1] 87.0 76.3 1.67> names(x) <- c(“age”, “weight”, “height”) > x age weight height 87.0 76.3 1.67# Alternatively> x <- c(age=87, weight=76.3, height=1.67)

> x <- c(87,76.3,1.67)> x[1] 87.0 76.3 1.67> names(x) <- c(“age”, “weight”, “height”) > x age weight height 87.0 76.3 1.67# Alternatively> x <- c(age=87, weight=76.3, height=1.67)

Why? > x[“age”][1] 87.0> bmi <- x[“weight”]/x[“height”]^2

> x[“age”][1] 87.0> bmi <- x[“weight”]/x[“height”]^2

> x[1][1] 87.0> bmi <- x[2]/x[3]^2

> x[1][1] 87.0> bmi <- x[2]/x[3]^2

cf.

Page 14: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

14 of 40

Putting names on things

Matrices: > x1 <- c(87,76.3,1.67)> x2 <- c(78,96.3,1.84)> x3 <- c(45,62.9,1.54)> x <- matrix(c(x1,x2,x3), nrow=3, byrow=TRUE) > x [,1] [,2] [,3][1,] 87 76.3 1.67[2,] 78 96.3 1.84[3,] 45 62.9 1.54> colnames(x) <- c("age", "weight", "height")> rownames(x) <- c(“jon”, “kim”, “dan”)> x age weight heightjon 87 76.3 1.67kim 78 96.3 1.84dan 45 62.9 1.54> x[“jon”,] age weight height 87.00 76.30 1.67> x[,c("weight","age")] weight agejon 76.3 87kim 96.3 78dan 62.9 45> bmi <- x[,"weight"]/x[,"height"]^2> bmi jon kim dan27.35846 28.44400 26.52218

> x1 <- c(87,76.3,1.67)> x2 <- c(78,96.3,1.84)> x3 <- c(45,62.9,1.54)> x <- matrix(c(x1,x2,x3), nrow=3, byrow=TRUE) > x [,1] [,2] [,3][1,] 87 76.3 1.67[2,] 78 96.3 1.84[3,] 45 62.9 1.54> colnames(x) <- c("age", "weight", "height")> rownames(x) <- c(“jon”, “kim”, “dan”)> x age weight heightjon 87 76.3 1.67kim 78 96.3 1.84dan 45 62.9 1.54> x[“jon”,] age weight height 87.00 76.30 1.67> x[,c("weight","age")] weight agejon 76.3 87kim 96.3 78dan 62.9 45> bmi <- x[,"weight"]/x[,"height"]^2> bmi jon kim dan27.35846 28.44400 26.52218

Page 15: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

15 of 40

Lists> x <- c(a=1:2, b=3, c=5:8)> xa1 a2 b c1 c2 c3 c4 1 2 3 5 6 7 8> x <- list(a=1:2, b=3, c=5:8)> x$a[1] 1 2$b[1] 3$c[1] 5 6 7 8> x$a[1] 1 2> x[[“a”]][1] 1 2> x[2:3]$b[1] 3$c[1] 5 6 7 8> x[c(“b”,”c”)]$b[1] 3$c[1] 5 6 7 8

> x <- c(a=1:2, b=3, c=5:8)> xa1 a2 b c1 c2 c3 c4 1 2 3 5 6 7 8> x <- list(a=1:2, b=3, c=5:8)> x$a[1] 1 2$b[1] 3$c[1] 5 6 7 8> x$a[1] 1 2> x[[“a”]][1] 1 2> x[2:3]$b[1] 3$c[1] 5 6 7 8> x[c(“b”,”c”)]$b[1] 3$c[1] 5 6 7 8

> x <- list(a=1:4, b=4:6+2i, src=“GenePix”)> x$a[1] 1 2 3 4$b[1] 4+2i 5+2i 6+2i$src[1] "GenePix“

> x <- list(a=1:4, b=4:6+2i, src=“GenePix”)> x$a[1] 1 2 3 4$b[1] 4+2i 5+2i 6+2i$src[1] "GenePix“

Vectors and matrices can only containone type of data at the time, e.g. either numbers or strings, but lists can carry mixed types:

Page 16: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

16 of 40

Data frames

> df <- data.frame(name=c(“jon”,”kim”,”dan”), age=c(87,78,45), weight=c(76.3,96.3,62.9), height=c(1.67,1.84,1.54))> df name age weight height1 jon 87 76.3 1.672 kim 78 96.3 1.843 dan 45 62.9 1.54> df$weight[1] 76.3 96.3 62.9> df[,c("name","age")] name age1 jon 872 kim 783 dan 45

> as.matrix(df) name age weight height1 "jon" "87" "76.3" "1.67"2 "kim" "78" "96.3" "1.84"3 "dan" "45" "62.9" "1.54"

> df <- data.frame(name=c(“jon”,”kim”,”dan”), age=c(87,78,45), weight=c(76.3,96.3,62.9), height=c(1.67,1.84,1.54))> df name age weight height1 jon 87 76.3 1.672 kim 78 96.3 1.843 dan 45 62.9 1.54> df$weight[1] 76.3 96.3 62.9> df[,c("name","age")] name age1 jon 872 kim 783 dan 45

> as.matrix(df) name age weight height1 "jon" "87" "76.3" "1.67"2 "kim" "78" "96.3" "1.84"3 "dan" "45" "62.9" "1.54"

Data frames are very powerful. Simply speaking you can treat them as both lists and matrices:

Page 17: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

17 of 40

Importing data

> df <- read.table(“foo.txt”, header=TRUE)> dim(df)[1] 221952 6> summary(df) slide spot R Min. :1.00 Min. : 1 Min. : 30.0 1st Qu.:1.75 1st Qu.:13873 1st Qu.: 60.0 Median :2.50 Median :27745 Median : 102.0 Mean :2.50 Mean :27745 Mean : 487.5 3rd Qu.:3.25 3rd Qu.:41616 3rd Qu.: 233.0 Max. :4.00 Max. :55488 Max. :65211.0 G Rb Gb Min. : 24.0 Min. : 32.00 Min. : 24.00 1st Qu.: 38.0 1st Qu.: 40.00 1st Qu.: 26.00 Median : 59.0 Median : 49.00 Median : 32.00 Mean : 194.7 Mean : 51.85 Mean : 31.92 3rd Qu.: 110.0 3rd Qu.: 64.00 3rd Qu.: 36.00 Max. :62341.0 Max. :1234.00 Max. :255.00> str(df)`data.frame': 221952 obs. of 6 variables: $ slide: int 1 1 1 1 1 1 1 1 1 1 ... $ spot : int 1 2 3 4 5 6 7 8 9 10 ... $ R : int 4416 335 39 568 42 43 56 40 7912 51 ... $ G : int 1533 155 47 211 50 110 64 45 4535 65 ... $ Rb : int 39 38 39 39 39 39 39 40 39 39 ... $ Gb : int 43 42 43 42 42 44 42 43 42 48 ...

> df <- read.table(“foo.txt”, header=TRUE)> dim(df)[1] 221952 6> summary(df) slide spot R Min. :1.00 Min. : 1 Min. : 30.0 1st Qu.:1.75 1st Qu.:13873 1st Qu.: 60.0 Median :2.50 Median :27745 Median : 102.0 Mean :2.50 Mean :27745 Mean : 487.5 3rd Qu.:3.25 3rd Qu.:41616 3rd Qu.: 233.0 Max. :4.00 Max. :55488 Max. :65211.0 G Rb Gb Min. : 24.0 Min. : 32.00 Min. : 24.00 1st Qu.: 38.0 1st Qu.: 40.00 1st Qu.: 26.00 Median : 59.0 Median : 49.00 Median : 32.00 Mean : 194.7 Mean : 51.85 Mean : 31.92 3rd Qu.: 110.0 3rd Qu.: 64.00 3rd Qu.: 36.00 Max. :62341.0 Max. :1234.00 Max. :255.00> str(df)`data.frame': 221952 obs. of 6 variables: $ slide: int 1 1 1 1 1 1 1 1 1 1 ... $ spot : int 1 2 3 4 5 6 7 8 9 10 ... $ R : int 4416 335 39 568 42 43 56 40 7912 51 ... $ G : int 1533 155 47 211 50 110 64 45 4535 65 ... $ Rb : int 39 38 39 39 39 39 39 40 39 39 ... $ Gb : int 43 42 43 42 42 44 42 43 42 48 ...

Data in text (ASCII) files can be imported using read.table():

Page 18: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

18 of 40

Boxplots and density plots

> df <- read.table(“foo.txt”, header=TRUE)> dim(df)[1] 221952 6> boxplot(df[,c("R","G","Rb","Gb")], col=c("red","green"))# Ooops> boxplot(df[,c("R","G","Rb","Gb")], col=c("red","green"), ylim=c(0,300))# Better!

> df <- read.table(“foo.txt”, header=TRUE)> dim(df)[1] 221952 6> boxplot(df[,c("R","G","Rb","Gb")], col=c("red","green"))# Ooops> boxplot(df[,c("R","G","Rb","Gb")], col=c("red","green"), ylim=c(0,300))# Better!

> plot(density(log(df$G,base=2)), col="green", main="Densities of R and G")> lines(density(log(df$R,base=2)), col="red")

> plot(density(log(df$G,base=2)), col="green", main="Densities of R and G")> lines(density(log(df$R,base=2)), col="red")

Page 19: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

19 of 40

Summary of “data types”Vectors (only one type at the time):

> x <- 7:1> x[1] 7 6 5 4 3 2 1> x <- c("a",3,"e")> x[1] "a" “3" "e"

> x <- 7:1> x[1] 7 6 5 4 3 2 1> x <- c("a",3,"e")> x[1] "a" “3" "e"

> x <- matrix(1:18, nrow=3)> x [,1] [,2] [,3] [,4] [,5] [,6][1,] 1 4 7 10 13 16[2,] 2 5 8 11 14 17[3,] 3 6 9 12 15 18> x <- matrix(letters[1:12], nrow=3)> x [,1] [,2] [,3] [,4][1,] "a" "d" "g" "j"[2,] "b" "e" "h" "k"[3,] "c" "f" "i" "l"

> x <- matrix(1:18, nrow=3)> x [,1] [,2] [,3] [,4] [,5] [,6][1,] 1 4 7 10 13 16[2,] 2 5 8 11 14 17[3,] 3 6 9 12 15 18> x <- matrix(letters[1:12], nrow=3)> x [,1] [,2] [,3] [,4][1,] "a" "d" "g" "j"[2,] "b" "e" "h" "k"[3,] "c" "f" "i" "l"

Matrices (rectangular, only one type at the time):

> x <- list(a=1:4, b=4:6+2i, src=“GenePix”)> x$a[1] 1 2 3 4$b[1] 4+2i 5+2i 6+2i$src[1] "GenePix“

> x <- list(a=1:4, b=4:6+2i, src=“GenePix”)> x$a[1] 1 2 3 4$b[1] 4+2i 5+2i 6+2i$src[1] "GenePix“

Lists (any shape, mixed type):

> x <- data.frame(name=c(“jon”,”kim”,”dan”), age=c(87,78,45), weight=c(76.3,96.3,62.9), height=c(1.67,1.84,1.54))> x name age weight height1 jon 87 76.3 1.672 kim 78 96.3 1.843 dan 45 62.9 1.54

> x <- data.frame(name=c(“jon”,”kim”,”dan”), age=c(87,78,45), weight=c(76.3,96.3,62.9), height=c(1.67,1.84,1.54))> x name age weight height1 jon 87 76.3 1.672 kim 78 96.3 1.843 dan 45 62.9 1.54

Data frames (rectangular, mixed type):

Page 20: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

20 of 40

Questions for Clarification?!(2 minutes)

Page 21: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

21 of 40

Installing R on Windows

Page 22: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

22 of 40

Random numbers> x <- rnorm(10000, mean=2, sd=4)> length(x)[1] 10000> mean(x)[1] 2.007904> sd(x)[1] 3.969784> summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max.-12.670 -0.607 2.000 2.008 4.701 15.850> hist(x)# or use probabilities (not counts) on y-axis> hist(x, probability=TRUE)

> x <- rnorm(10000, mean=2, sd=4)> length(x)[1] 10000> mean(x)[1] 2.007904> sd(x)[1] 3.969784> summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max.-12.670 -0.607 2.000 2.008 4.701 15.850> hist(x)# or use probabilities (not counts) on y-axis> hist(x, probability=TRUE)

Page 23: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

23 of 40

Defining functions and scripts

> cubic <- function(x) 1 + x^2 - 0.5*x^3> cubic(0:4)[1] 1.0 1.5 1.0 -3.5 -15.0

> cubic <- function(x) 1 + x^2 - 0.5*x^3> cubic(0:4)[1] 1.0 1.5 1.0 -3.5 -15.0

# Read R code from file> source(“cubic.R”)> cubic(0:4)[1] 1.0 1.5 1.0 -3.5 -15.0> cubicfunction(x) { 1 + x^2 - 0.5*x^3}

# Read R code from file> source(“cubic.R”)> cubic(0:4)[1] 1.0 1.5 1.0 -3.5 -15.0> cubicfunction(x) { 1 + x^2 - 0.5*x^3}

Page 24: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

24 of 40

Adding data points to an existing plot> f <- function(x) x^2

> x <- seq(-10,10, by=0.01)> eps <- rnorm(length(x), sd=4)> y <- f(x) + eps

# plot() creates a new plot> plot(x,y)

# points() add data points to # an existing plot> points(x,f(1.1*x), col="blue")

# Same for abline() and others> abline(h=0, col="green")> abline(v=0, col="purple")> abline(a=0, b=1, col="orange")

# Same for curve() with add=TRUE> curve(f, col="red“, add=TRUE)

(On request by Linda)

Page 25: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

25 of 40

Packages• On CRAN - Comprehensive R Archive

Network – there are today 300+ packages published!

• Browse CRAN at http://www.r-project.org/

• Find what you want.• Dirt simple to install package!

• At the Centre for Mathematical Sciences we try to keep install and update all package on our system.

Page 26: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

26 of 40

Install a package• On Windows extremely easy!

• On all systems:> install.packages(c("adapt","maptools"))trying URL `http://cran.r-project.org/bin/windows/contrib/1.7/PACKAGES'Content type `text/plain; charset=iso-8859-1' length 12674 bytesopened URLdownloaded 12Kb

trying URL `http://cran.r-project.org/bin/windows/contrib/1.7/adapt_1.0-3.zip'Content type `application/zip' length 39304 bytesopened URLdownloaded 38Kb

trying URL `http://cran.r-project.org/bin/windows/contrib/1.7/maptools_0.3-2.zip'Content type `application/zip' length 129634 bytesopened URLdownloaded 126Kb

Delete downloaded files (y/N)? yupdating HTML package descriptions> library(maptools)

Page 27: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

27 of 40

Update packages• On all systems:

> update.packages() trying URL `http://cran.r-project.org/bin/windows/contrib/1.7/PACKAGES'Content type `text/plain; charset=iso-8859-1' length 12674 bytes opened URLdownloaded 12Kbcluster : Version 1.7.3 in c:/PROGRA~1/R/rw1071/library Version 1.7.6 on CRANUpdate (y/N)? yforeign : Version 0.6-1 in c:/PROGRA~1/R/rw1071/library Version 0.6-3 on CRANUpdate (y/N)? y...trying URL `http://cran.r-project.org/bin/windows/contrib/1.7/foreign_0.6-3.zip'Content type `application/zip' length 109855 bytes opened URLdownloaded 107Kb

Delete downloaded files (y/N)? yupdating HTML package descriptions>

Page 28: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

28 of 40

Using/loading a package• On all systems:

> library(maptools)

# Some package loads other package too> library(com.braju.sma)Loading required package: R.ooR.oo v0.44 (2003/10/29) was successfully loaded.Loading required package: R.ioR.io v0.44 (2003/10/29) was successfully loaded.Loading required package: R.graphicsR.graphics v0.44 (2003/10/29) was successfully loaded.com.braju.sma v0.64 (2003/10/31) was successfully loaded.

> example(MAData)

Page 29: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

29 of 40

Questions for Clarification?!(2 minutes)

Page 30: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

30 of 40

Fit a linear model (i)

> help.search(“linear model”)Help files with alias or title matching 'linear model' using fuzzymatching:

anova.glm(base) Analysis of Deviance for Generalized Linear Model Fitsanova.lm(base) ANOVA for Linear Model Fitsglm(base) Fitting Generalized Linear Modelsfamily.glm(base) Accessing Generalized Linear Model Fitssummary.glm(base) Summarizing Generalized Linear Model Fitslm(base) Fitting Linear Modelsfamily.lm(base) Accessing Linear Model Fitssummary.lm(base) Summarizing Linear Model Fitslm.fit(base) Fitter Functions for Linear Modelsloglin(base) Fitting Log-Linear Modelspredict.lm(base) Predict method for Linear Model Fitscv.glm(boot) Cross-validation for Generalized Linear Modelsglm.diag.plots(boot) Diagnostics plots for generalized linear models

1. Find function to use:

It says that there is a function called lm() in a package called base.(Package base is always loaded when R is started)

Page 31: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

31 of 40

Fit a linear model (ii)

> help(lm)

2. Look at the help:

Page 32: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

32 of 40

Fit a linear model (iii)

# Data with noise> f <- function(x) 3.3 + 0.85*x> x <- seq(0,10, by=0.01)> eps <- rnorm(length(x), sd=2)> y <- f(x) + eps

# Plot data> plot(x,y)

# Fit a linear model. See ?lm> fit <- lm(y ~ x)> fitCall:lm(formula = y ~ x)Coefficients:(Intercept) x 3.2780 0.8386

# Plot fitted line> abline(fit, col=“red")

3. Test it on our data:

Page 33: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

33 of 40

Fit a linear model (iv)

4. Investigate fitted line in detail:> summary(fit)Call:lm(formula = y ~ x)Residuals: Min 1Q Median 3Q Max-8.2449 -1.3234 0.1234 1.2670 6.5321

Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) 3.27797 0.12608 26.00 <2e-16 ***x 0.83863 0.02183 38.41 <2e-16 ***---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 1.996 on 999 degrees of freedomMultiple R-Squared: 0.5963, Adjusted R-squared: 0.5959F-statistic: 1476 on 1 and 999 DF, p-value: < 2.2e-16

Page 34: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

34 of 40

Fit a linear model (v)5. What does lm() return?

> str(fit)List of 12 $ coefficients : Named num [1:2] 3.357 0.824 ..- attr(*, "names")= chr [1:2] "(Intercept)" "x" $ residuals : Named num [1:1001] 1.09757 -0.00305 -

0.95878 1.64874 -2.59366 ... ..- attr(*, "names")= chr [1:1001] "1" "2" "3" "4" ... $ effects : Named num [1:1001] -236.55 75.33 -0.99

1.62 -2.63 ... ..- attr(*, "names")= chr [1:1001] "(Intercept)" "x" "" ""

... $ rank : int 2 $ fitted.values: Named num [1:1001] 3.36 3.36 3.37 3.38

3.39 ... ..- attr(*, "names")= chr [1:1001] "1" "2" "3" "4" ... $ assign : int [1:2] 0 1 $ qr :List of 5 ..$ qr : num [1:1001, 1:2] -31.6386 0.0316 0.0316

0.0316 0.0316 ... .. ..- attr(*, "dimnames")=List of 2 .. .. ..$ : chr [1:1001] "1" "2" "3" "4" ...

[ SNIP ]

.. .. ..- attr(*, "order")= int 1

.. .. ..- attr(*, "intercept")= int 1

.. .. ..- attr(*, "response")= int 1

.. .. ..- attr(*, ".Environment")=length 10 <environment>

.. .. ..- attr(*, "predvars")= language list(y, x) attr(*, "class")= chr "lm"

Page 35: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

35 of 40

Fit a linear model (vi)6. Related functions etc.

> res <- residuals(fit)> hist(res)> qqnorm(res)

Page 36: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

36 of 40

Questions for Clarification?!(2 minutes)

Page 37: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

37 of 40

Local Polynomial Regression Fitting (LOESS)

# Read microarray data> df <- read.table("foo.txt", header=TRUE)

# Use only R & G from slide 1> df <- df[df$slide == 1, c(“R”,”G”)]

# Get the log-ratios and log-intensities> M <- log(df$R/df$G, base=2)> A <- log(df$R*df$G, base=2)/2

# Plot M vs A> plot(A,M)

# Fit a LOESS curve> fit <- loess(M ~ A)> Mcurve <- predict(fit, A)> points(A, Mcurve, col="red")

# Remove trend> M <- M - Mcurve> points(A,M, col=”blue”)

Page 38: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

38 of 40

Column-by-column standardization of matrix

# Subract the mean and scale by the std. dev.> zscore <- function(y, ...) {+ (y - mean(y, ...)) / sd(y, ...)+ }

# Generate data> x <- matrix(rnorm(1000, mean=2, sd=10), ncol=2)

# Standardize data. If there are NA values, pass argument # na.rm=TRUE to both mean() and sd() via ..., which tells # them to remove NAs first)> z <- apply(x, MARGIN=2, FUN=zscore, na.rm=TRUE)

# Create a vector with the mean and the std. dev. of the first column> c(mean(x[,1]),sd(x[,1]))[1] 2.44114 10.25328

> c(mean(z[,1]),sd(z[,1]))[1] -1.616485e-16 1.000000e+00

(On request by Halfdan)

Page 39: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

39 of 40

Links

http://www.r-project.org/ - homepage of the R project - download and install R - sign up on the r-help mailing list - more manuals and examples. - additional packages (CRAN) - conferences and other large-scale projects - ...

http://www.maths.lth.se/help/R/- Help @ MC- Local packages

Although I’m finishing up my thesis... I can help out, but please make sure you read the help first. Don’t forget about the r-help list. Thanks!

Page 40: 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available.

40 of 40

Session open for any questions(remaining time)


Recommended