File: week-14PLUS-R-stuff-07dec08 · * Briefly review statements that can be used with most...

Week 14+ Class Activities File: week-14PLUS-R-stuff-07dec08.doc Directory (hp/compaq): C:..\sta402\handouts

TOPICS IN STATISTICAL PROGRAMMING (varies) * Introduction to quantitative programming in R/S-Plus * objects – vectors, lists, matrices, dataframes * reading data [scan, read.table, sas.get] * summarizing data sets [mean, var, summary, table] * graphical displays [plot, pairs, trellis displays] * writing functions References: R Project downloads, FAQs and manuals http://www.r-project.org/

[BM] Braun WJ and Murdoch DJ 2007. A First Course in Statistical Programming with R. Cambridge University Press. Cambridge, UK ISBN 978-0-521-69424-7

Muenchen, Robert A.(2008) R for SAS and SPSS Users. Springer-Verlag. ISBN: 978-0-387-09417-5 Venables WW and Ripley BD. 1997. Modern Applied Statistics with S-Plus. Springer-Verlag. Venables WW and Ripley BD. 2000. S Programming. Springer-Verlag. Spector P. 1994. An introduction to S and S-Plus. Duxbury Press Krause A and Olson M. 1997. The Basics and S and S-Plus. Springer. S-Plus for Windows Documentation http://216.211.131.2/support/doc_splus_win.asp Getting Started with S-Plus for Windows (GUI stuff) http://216.211.131.2/support/splus60win/tutorial.pdf Past SAS activities and how these might relate to R/S-Plus ideas week SAS ideas R/S-Plus ideas

1

http://www.r-project.org/

http://216.211.131.2/support/doc_splus_win.asp

http://216.211.131.2/support/splus60win/tutorial.pdf

1 BASIC CONCEPTS *Review basic concepts of statistical computing and research data management * Introduce SAS data sets * Review the form of SAS Statements and SAS names * Introduce SAS procedures * Review the structure of SAS programs * Describe SAS data libraries and what they can contain * Show documenting SAS programs using comments * Illustrate running SAS programs and basic debugging

Introduce language elements Objects in R/S-Plus (scalars, vectors, matrices, dataframes, lists) Object-oriented functionality

2 USING SAS PROCEDURES * Introduce the idea of SAS system options * Briefly review statements that can be used with most procedures (BY, WHERE, TITLE, FOOTNOTE, LABEL, FORMAT) * PROC CONTENTS for describing a data set * PROC PRINT for listing the observations in a data set * PROC CHART and PROC PLOT for producing low resolution graphs * PROC FREQ for one-way frequency tables and n-way cross-tabulations * PROC UNIVARIATE for descriptive statistics and distributional information * PROC MEANS for descriptive statistics * PROC SORT for sorting a data set * SAS documentation and the online help system

* options() * titles build into most displays * names(dataframe) * low resolution graphs via printer() * table() – frequency tables * summary() * means(), var(), quantile(), etc. * sort(), index() * ?command

3 REPORT WRITING * Introduce the Output Delivery System (ODS) for customizing procedure output * PROC TABULATE for producing nicely-formatted tables

No ODS analog although can save graphics in a host of different formats

4 AN INTRODUCTION TO STATISTICAL MODELING * PROC REG for linear modeling (a very basic introduction) * PROC GLM for anova models

lm (linear models), glm (generalized linear models)

5 HIGH-RESOLUTION GRAPHICS AND FORMATS * Introduce concepts related to high-resolution graphs * PROC GCHART and PROC GPLOT for producing high-resolution graphs * SAS-supplied formats and PROC FORMAT for user-defined formats

plot, pairs, boxplot,

2

6 TRANSFORMING SAS DATA SETS * Creating SAS data sets with DATA steps: flow of execution, including the program data vector * Creating variables in DATA steps with assignment statements * Statements: DATA, SET, OUTPUT, RETURN, WHERE, IF, DROP, KEEP, LENGTH * Subsetting observations and variables * Using SAS functions and operators * Working with SAS date values (also time and date-time) * Introduction to missing values

Can operate on Data frames in manner similar to IML manipulations

7 SAS PROGRAMMING * Declarative vs. executables statements * Statements: RETAIN, RENAME, LABEL, FORMAT, SUM * Using formats in DATA steps * Conditional execution * DO groups * Arrays * More on missing values

for (i=1 . . .) { } if-then-else

8 COMBINING AND MANAGING SAS DATA SETS * SET statement for concatenation and interleaving * MERGE statement for joining observations * UPDATE statement for updating a master file (maybe) * Special variables: IN, END, FIRST, and LAST * Creating multiple data sets in one DATA step * Reshaping data sets * Managing data sets using PROC COPY and PROC DATASETS * Transporting data sets between hosts

9 WRITING EXTERNAL FILES * Statements: FILE, PUT * Using DATA _NULL_ * PUT function * Creating customized reports using DATA setps

data.dump

10 MACRO LANGUAGE * Why use macros? * Macro variables – system-defined and user-defined * Macros * Macro parameters * Macro functions * Conditional execution and DO loops * CALL SYMPUT

functions

3

11 SAS/IML Programming * Basic matrix concepts: rows, columns, scalars * matrix operators * subscripting * matrix functions * creating matrices from data sets and vice versa * sample applications

matrix manip. part of the language

Overview and abbreviated history (from Krause/Olson and R project page) S is a quantitative programming environment developed at AT&T Bell Labs * 1976 start * 1981 rewrite in C and port to unix * 1985 function concept introduced * 1987 S-Plus, a supported and extended version of S, started * 1989 S-Plus for DOS * 1993 S-Plus for Windows * 1997 GUI for S-Plus for Windows * ca 1997 R development team working (after R started by Robert Gentleman and Ross Ihaka of the Statistics Department of the University of Auckland

R is a language and environment for statistical computing and graphics.

* GNU project which is similar to the S language, different implementation of S

* provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible.

S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.

One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.

R is available as Free Software under the terms of the Free Software Foundation's GNU General Public License in source code form. It compiles and runs out of the box on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux). It also compiles and runs on Windows 9x/NT/2000 and MacOS.

4

http://www.gnu.org/

http://www.gnu.org/

http://www.r-project.org/COPYING

http://www.r-project.org/COPYING

GUI vs. Command Line S-Plus has a Graphical User Interface available (GUI) -- see http://216.211.131.2/support/splus60win/tutorial.pdf for more information. R has GUI packages – 1. Rcmdr (http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/) 2. GUI IDE for R it links to R and runs it. It runs alot like sas where you code and hit run. If you are interested it is called Tinn-R, just google it and it is on the net (http://www.sciviews.org/Tinn-R/). We focus on command line interactions with R/S-Plus Getting help 1. menu 2. > help() [where “>” is the default command prompt in S/R] 3. > ?quantile > help(quantile) # for help on specific command > ?”^” > help(“^”) # for help on operator Syntax Assignment (via “<-“ or “_”)

TOPICS IN STATISTICAL PROGRAMMING (varies) * writing functions Functions/Operators 1. HELP .. ?fun.name help(fun.name) args(fun.name) Object-oriented: classes and Methods (into S in 1992) Classes = type of object under consideration Methods = what to do with object Most commonly used Methods: print, plot, summary (others: predict, fitted, resid, coef)

5

http://216.211.131.2/support/splus60win/tutorial.pdf

http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/

Methods implementation Methodname.classname e.g., plot has a default, summary.default and specific methods, plot.lm, plot.data.frame methods(plot) nlme3 nlme3 splus splus splus splus "plot.ACF" "plot.Variogram" "plot.aareg" "plot.acf" "plot.agnes" "plot.arima" nlme3 splus splus nlme3 "plot.augPred" "plot.censorReg" "plot.compare.fits" "plot.compareFits" splus splus stat main "plot.cox.zph" "plot.data.frame" "plot.dataframe" "plot.default" splus splus splus splus stat "plot.design" "plot.diana" "plot.discrim" "plot.factor" "plot.formula" splus splus nlme3 splus nlme3 "plot.gam" "plot.glm" "plot.gls" "plot.hexbin" "plot.intervals.lmList" splus splus splus nlme3 "plot.jack.after.bootstrap" "plot.kaplanMeier" "plot.lm" "plot.lmList" splus nlme3 splus splus splus splus "plot.lmRobMM" "plot.lme" "plot.lms" "plot.loadings" "plot.loess" "plot.lts" splus splus splus splus splus "plot.mcd" "plot.mlm" "plot.mona" "plot.moving.sd" "plot.multicomp" splus nlme3 nlme3 nlme3 "plot.mve" "plot.nffGroupedData" "plot.nfnGroupedData" "plot.nls" nlme3 splus nlme3 splus "plot.nmGroupedData" "plot.partition" "plot.pdMat" "plot.preplot.gam" splus splus stat nlme3 "plot.preplot.loess" "plot.princomp" "plot.profile" "plot.ranef.lmList" nlme3 splus splus trellis "plot.ranef.lme" "plot.raw.Internal" "plot.resamp" "plot.shingle" splus nlme3 splus stat "plot.signalSeries" "plot.simulate.lme" "plot.size.scale" "plot.stl" splus splus splus splus "plot.survfit" "plot.timeSeries" "plot.times" "plot.tree" splus trellis splus splus "plot.tree.sequence" "plot.trellis" "plot.varcomp" "plot.xy" methods(class=lm) nlme3 nlme3 splus splus splus splus stat "AIC.lm" "BIC.lm" "add1.lm" "alias.lm" "all.equal.lm" "anova.lm" "coef.lm" stat splus splus splus stat "deviance.lm" "drop1.lm" "dummy.coef.lm" "durbinWatson.lm" "effects.lm"

6

stat stat splus stat nlme3 stat "family.lm" "formula.lm" "kappa.lm" "labels.lm" "logLik.lm" "model.frame.lm" splus splus splus splus splus "model.matrix.lm" "multicomp.lm" "plot.lm" "predict.lm" "print.lm" splus splus nlme3 splus splus "print.summary.lm" "proj.lm" "qqnorm.lm" "residuals.lm" "ssType3.lm" stat splus menu menu menu "step.lm" "summary.lm" "tabPlot.lm" "tabPredict.lm" "tabSummary.lm" Operators + - * / ^ - (unary minus) %/% integer divide (7 %/% 2 = 3) %% modulo (7 %% 2 = 1) %*% matrix multiplication %o% outer product : sequence Logical operators == != < <= > >= & = elementwise AND vec1 <- c(2,4,8) vec2 <- c(3,9, 27) vec1 <=4 & vec2 <9 (T T F) AND (T F F) = (T F F) | = elementwise OR vec1 <- c(2,4,8) vec2 <- c(3,9, 27) vec1 <=4 | vec2 <9 (T T F) OR (T F F) = (T T F) && = Control AND

7

(mode(vec1)==”numeric”) && (min(vec1)>0)) T || = Control OR ! = unary not !(vec1 <=4 | vec2 <9) (F F T) Precedence

{} > ()> $ > [ [[> ^ > > > %*% %/% %% > - :

*/ > +- > < . <= >= == != > ! > [& && | ||] >

<<- > <- _ ->

Example with functions x1 <- c(5:1, 1:5) x1 [1] 5 4 3 2 1 1 2 3 4 5 x1==5 [1] T F F F F F F F F T y1<- c(letters[5:1],LETTERS[1:5]) y1 [1] "e" "d" "c" "b" "a" "A" "B" "C" "D" "E" y1[x1==5] (1:length(x1))[x1==5] [1] 1 10 seq(along=x1)[x1==5] [1] 1 10 x2<-c(x1,NA) x2 [1] 5 4 3 2 1 1 2 3 4 5 NA x2[!is.na(x2)] [1] 5 4 3 2 1 1 2 3 4 5 sort(x1) [1] 1 1 2 2 3 3 4 4 5 5

8

order(x1) [1] 5 6 4 7 3 8 2 9 1 10 y1[order(x1)] [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" new.mat<- matrix(16:1,nrow=4,byrow=T) new.mat [,1] [,2] [,3] [,4] [1,] 16 15 14 13 [2,] 12 11 10 9 [3,] 8 7 6 5 [4,] 4 3 2 1 new.mat[order(new.mat[,1]),] # sort row by 1st col [,1] [,2] [,3] [,4] [1,] 4 3 2 1 [2,] 8 7 6 5 [3,] 12 11 10 9 [4,] 16 15 14 13 new.mat[,c(2,3,4)] [,1] [,2] [,3] [1,] 15 14 13 [2,] 11 10 9 [3,] 7 6 5 [4,] 3 2 1 new.mat[,-c(2,3,4)] [1] 16 12 8 4 new.mat[,-c(2,3,4),drop=F] [,1] [1,] 16 [2,] 12 [3,] 8 [4,] 4 # OBJECTS in R/S-Plus Scalar < vector < matrix < data frame < list # a simple regression example xx <- 1:10 yy <- 3 + 5*xx + rnorm(10) xx

9

[1] 1 2 3 4 5 6 7 8 9 10 yy [1] 8.778154 12.582817 17.445294 22.923522 28.219180 32.282231 [7] 40.679650 44.390966 49.811534 52.988893 plot(xx,yy) myfit <- lm(yy ~ xx) myfit Call: lm(formula = yy ~ xx) Coefficients: (Intercept) xx 2.658308 5.154894 Degrees of freedom: 10 total; 8 residual Residual standard error: 1.088049 > names(myfit) [1] "coefficients" "residuals" "fitted.values" [4] "effects" "R" "rank" [7] "assign" "df.residual" "contrasts" [10] "terms" "call" > pmin pmax # returns vector of max/min of each position of 2 vecs

Getting Data into R * you can enter data directly (ala datalines in SAS) Squid data – from Myers y = squid weight x1 = rostral length (in) x2 = wing length (in) x3 = rostral to notch length x4 = notch to wing length x5 = width Ref: Myers */ data squid; input x1 x2 x3 x4 x5 y @@; datalines; 1.31 1.07 0.44 0.75 0.35 1.95 1.55 1.49 0.53 0.90 0.47 2.90 0.99 0.84 0.34 0.57 0.32 0.72 0.99 0.83 0.34 0.54 0.27 0.81 1.05 0.90 0.36 0.64 0.30 1.09 1.09 0.93 0.42 0.61 0.31 1.22 1.08 0.90 0.40 0.51 0.31 1.02 1.27 1.08 0.44 0.77 0.34 1.93 0.99 0.85 0.36 0.56 0.29 0.64 1.34 1.13 0.45 0.77 0.37 2.08 1.30 1.10 0.45 0.76 0.38 1.98 1.33 1.10 0.48 0.77 0.38 1.90

10

1.86 1.47 0.60 1.01 0.65 8.56 1.58 1.34 0.52 0.95 0.50 4.49 1.97 1.59 0.67 1.20 0.59 8.49 1.80 1.56 0.66 1.02 0.59 6.17 1.75 1.58 0.63 1.09 0.59 7.54 1.72 1.43 0.64 1.02 0.63 6.36 1.68 1.57 0.72 0.96 0.68 7.63 1.75 1.59 0.68 1.08 0.62 7.78 2.19 1.86 0.75 1.24 0.72 10.15 1.73 1.67 0.64 1.14 0.55 6.88 Squid <- scan(what=list(rlength=0,wing.length=0,r.to.notch=0,notch.to.wing=0, width=0,weight=0)) 1.31 1.07 0.44 0.75 0.35 1.95 1.55 1.49 0.53 0.90 0.47 2.90 0.99 0.84 0.34 0.57 0.32 0.72 0.99 0.83 0.34 0.54 0.27 0.81 1.05 0.90 0.36 0.64 0.30 1.09 1.09 0.93 0.42 0.61 0.31 1.22 1.08 0.90 0.40 0.51 0.31 1.02 1.27 1.08 0.44 0.77 0.34 1.93 0.99 0.85 0.36 0.56 0.29 0.64 1.34 1.13 0.45 0.77 0.37 2.08 1.30 1.10 0.45 0.76 0.38 1.98 1.33 1.10 0.48 0.77 0.38 1.90 1.86 1.47 0.60 1.01 0.65 8.56 1.58 1.34 0.52 0.95 0.50 4.49 1.97 1.59 0.67 1.20 0.59 8.49 1.80 1.56 0.66 1.02 0.59 6.17 1.75 1.58 0.63 1.09 0.59 7.54 1.72 1.43 0.64 1.02 0.63 6.36 1.68 1.57 0.72 0.96 0.68 7.63 1.75 1.59 0.68 1.08 0.62 7.78 2.19 1.86 0.75 1.24 0.72 10.15 1.73 1.67 0.64 1.14 0.55 6.88 Squid.df <- as.data.frame(Squid) # convert to a data frame Squid.df # show contents of the data frame pairs(Squid.df) # scatterplot matrix lin.reg.fit <- lm(weight~rlength, data=Squid.df) # linear regression summary(lin.reg.fit) plot(lin.reg.fit) * more commonly you want to read data from an external file – R functions for doing this include: read.table, read.csv, read.csv2 test <- read.table("C:/users/baileraj/desktop/cars-csv.csv",header=TRUE,sep=",") test2 <- read.csv("C:/users/baileraj/desktop/cars-csv.csv") names(test) summary(test$hwy.mpg) Min. 1st Qu. Median Mean 3rd Qu. Max. 20.00 26.00 28.00 29.09 31.00 50.00 # confidence interval for mean HWY MPG attach(test) # attach data frame so that you can reference # variable names without needing to specify data set nobs <- length(hwy.mpg) nobs [1] 93

11

mean(hwy.mpg)+ c(-1,1)*qt(.975,nobs-1)*sqrt(var(hwy.mpg)/nobs) 27.98797 30.18408 # some plots (when “X” is factor vs. numeric) plot(hwy.mpg~type,data=test) plot(hwy.mpg~wt,data=test) args(scatter.smooth) function (x, y = NULL, span = 2/3, degree = 1, family = c("symmetric", "gaussian"), xlab = NULL, ylab = NULL, ylim = range(y, prediction$y, na.rm = TRUE), evaluation = 50, ...) scatter.smooth(test$wt, test$hwy.mpg) scatter.smooth(test$wt, test$hwy.mpg, span=1/3) coplot(hwy.mpg~ wt | type, data=test)

2025

3035

4045

50

2000 3000 4000

2000 3000 4000 2000 3000 4000

2025

3035

4045

50

wt

hwy.

mpg

CompactLarge

MidsizeSmall

SportyVan

Given : type

# reading in the NITROFEN data b1 <- as.data.frame(scan(what=list(id=0,conc=0,brood1=0))) 1 0 3 2 0 5 3 0 6 4 0 6 5 0 6 6 0 5 7 0 6 8 0 5 9 0 3 10 0 6 11 80 6 12 80 5 13 80 6 14 80 5 15 80 8 16 80 3 17 80 5 18 80 7 19 80 5 20 80 3 21 160 6 22 160 6 23 160 2 24 160 6 25 160 6 26 160 6 27 160 6 28 160 5 29 160 6 30 160 6 31 235 4 32 235 6 33 235 2 34 235 6 35 235 6 36 235 6 37 235 7 38 235 4 39 235 6 40 235 7 41 310 6 42 310 6 43 310 7 44 310 0 45 310 5 46 310 5 47 310 6 48 310 4 49 310 6 50 310 5 b2 <- as.data.frame(scan(what=list(id=0,conc=0,brood2=0))) 1 0 14 2 0 12 3 0 11 4 0 12 5 0 15 6 0 14 7 0 12 8 0 13 9 0 10

12

10 0 11 11 80 11 12 80 12 13 80 11 14 80 12 15 80 13 16 80 9 17 80 9 18 80 12 19 80 13 20 80 12 21 160 12 22 160 12 23 160 8 24 160 10 25 160 11 26 160 13 27 160 12 28 160 10 29 160 13 30 160 12 31 235 13 32 235 10 33 235 5 34 235 0 35 235 13 36 235 0 37 235 0 38 235 2 39 235 8 40 235 0 41 310 0 42 310 0 43 310 0 44 310 0 45 310 10 46 310 0 47 310 0 48 310 0 49 310 0 50 310 0 b3 <- as.data.frame(scan(what=list(id=0,conc=0,brood3=0))) 1 0 10 2 0 15 3 0 17 4 0 15 5 0 15 6 0 15 7 0 15 8 0 12 9 0 11 10 0 14 11 80 16 12 80 16 13 80 18 14 80 16 15 80 15 16 80 14 17 80 13 18 80 12 19 80 14 20 80 14 21 160 11 22 160 11 23 160 13 24 160 11 25 160 13 26 160 12 27 160 12 28 160 11 29 160 10 30 160 11 31 235 6 32 235 5 33 235 0 34 235 6 35 235 8 36 235 10 37 235 6 38 235 9 39 235 7 40 235 10 41 310 0 42 310 0 43 310 0 44 310 0 45 310 0 46 310 0 47 310 0 48 310 0 49 310 0 50 310 0 Nitro.df<- cbind(b1, brood2=b2$brood2, brood3=b3$brood3, total=b1[,3]+b2[,3]+b3[,3]) names(Nitro.df) [1] "id" "conc" "brood1" "brood2" "brood3" "total" attach(Nitro.df) par(mfrow=c(2,2)) scatter.smooth(conc, brood1) scatter.smooth(conc, brood2) scatter.smooth(conc, brood3) scatter.smooth(conc, total)

0 50 100 200 300

02

46

8

conc

broo

d1

0 50 100 200 300

05

1015

conc

broo

d2

0 50 100 200 300

05

1015

conc

broo

d3

0 50 100 200 300

05

1525

35

conc

tota

l

Saving your work When you leave R, you’re prompted about saving your workspace.

13

You can find out where your work is currently saved: getwd() [1] "C:/Users/baileraj/Documents" You can change the location for storing your work . . . setwd("C:/Users/baileraj/Desktop") getwd() [1] "C:/Users/baileraj/Desktop" objects() [1] "age" "all.age" "all.counts" "all.health" [5] "all.status" "amph.odds" "avg" "boot.mpg" [9] "Counts" "disenroll.tree" "health.status" "lin.reg.fit" [13] "logSA" "mpg.boot" "my.mean" "No.Counts" [17] "nobs" "non.amph.odds" "Squid" "Squid.df" [21] "test" "test2" "toy.df" "x32" [25] "x50" "y32" "y50"

save.image(file="MyWorkspace.RData") # default is to save to .RData Next time you launch R, you can load these data … load("MyWorkspace.RData") You can use this for project management. In one sense, setwd is analogous to libref in SAS while the save.image is analogous to using a two level variable name to permanently store a SAS data set.

Loading packages * R is rich with user-contributed software that is bundled together in packages * install.packages(”Rcmdr”, dependencies=TRUE) [this package has a large number of packages with dependencies] EXAMPLE: Bootstrap-based confidence intervals # FIRST – install the “boot” package along with any # dependencies install.packages(”boot”, dependencies=TRUE) # SECOND – load the package for use library(”boot”) # Aside – can get help on the package and functions # in the package help(package=boot)

14

help(boot.ci,package=boot) # need to construct a function that takes index as # for use in the “boot” function my.mean<- function(x, ii) sum(x[ii])/length(x) boot.mpg ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = hwy.mpg, statistic = my.mean, R = 1999) Bootstrap Statistics : original bias std. error t1* 29.08602 -0.005610332 0.5461516 boot.ci(boot.mpg) Warning in boot.ci(boot.mpg) : bootstrap variances needed for studentized intervals BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1999 bootstrap replicates CALL : boot.ci(boot.out = boot.mpg) Intervals : Level Normal Basic 95% (28.02, 30.16 ) (27.98, 30.13 ) Level Percentile BCa 95% (28.04, 30.19 ) (28.07, 30.27 ) Calculations and Intervals on Original Scale APPLY Functions apply(obj, dim, function) lapply(list, function) sapply(list,function) tapply(object,indices,function) state.tmp <- as.data.frame(state.x77) attach(state.tmp) inc <- cut(Income,breaks=quantile(Income),include.lowest=T) inc [1] 1 4 3 1 4 4 4 3 4 2 4 2 4 2 3 3 1 1 1 4 3 3 3 1 2 2 2 4 2 4 1 4 1 4 3 1 3 [38] 2 3 1 2 1 2 2 1 3 4 1 2 3 attr(, "levels"): [1] "3098.00+ thru 3992.75" "3992.75+ thru 4519.00" "4519.00+ thru 4813.50"

15

[4] "4813.50+ thru 6315.00" attr(, "class"): [1] "category" tapply(Murder,inc,mean) 3098.00+ thru 3992.75 3992.75+ thru 4519.00 4519.00+ thru 4813.50 9.707692 6.191667 5.658333 4813.50+ thru 6315.00 7.730769 unlist(lapply(split(Murder,inc),mean)) 3098.00+ thru 3992.75 3992.75+ thru 4519.00 4519.00+ thru 4813.50 9.707692 6.191667 5.658333 4813.50+ thru 6315.00 7.730769 By functions by(nitrofen$total,nitrofen$conc,function(x) {sqrt(var(x))}) INDICES:0 x x 3.596294 ----------------------------------------------------------- INDICES:80 x x 3.27448 ----------------------------------------------------------- INDICES:160 x x 2.359378 ----------------------------------------------------------- INDICES:235 x x 5.902918 ----------------------------------------------------------- INDICES:310 x x 3.711843 by(nitrofen$total,nitrofen$conc,function(x) {sqrt(var(x))}) INDICES:0 x x 3.596294 ----------------------------------------------------------- INDICES:80 x x 3.27448 ----------------------------------------------------------- INDICES:160 x x 2.359378 ----------------------------------------------------------- INDICES:235 x x 5.902918 -----------------------------------------------------------

16

INDICES:310 x x 3.711843 lapply(split(nitrofen$total,nitrofen$conc),summary) $"0": Min. 1st Qu. Median Mean 3rd Qu. Max. "24.00" "30.25" "32.50" "31.40" "33.75" "36.00" $"80": Min. 1st Qu. Median Mean 3rd Qu. Max. "26.0" "29.5" "32.5" "31.5" "33.0" "36.0" $"160": Min. 1st Qu. Median Mean 3rd Qu. Max. "23.00" "27.50" "29.00" "28.30" "29.75" "31.00" $"235": Min. 1st Qu. Median Mean 3rd Qu. Max. " 7.0" "13.5" "16.5" "17.2" "21.0" "27.0" $"310": Min. 1st Qu. Median Mean 3rd Qu. Max. " 0" " 5" " 6" " 6" " 6" "15"

WRITING FUNCTIONS * basic syntax function(arguments) { commands } * lazy evaluation = arguments evaluated only as needed when they appear in a function * can use “fix” to modify existing function or create new function * conditional execution if (condition) { expression } if (condition) { expression 1 } else { expression 2 } ifelse(test, True option, False option) * looping/iteration

17

for (name in values) { commands } while (condition) { commands } apply(object, margin, function) * writing output cat(object,”\n”) cat(“Another plot?”) ans<- readline() if (ans==”Y”) | (ans==”y”) { expression } * couple of tricks 1. eval(parse(text=character-with-command)) cmd="objects()" cmd [1] "objects()" eval(parse(text=cmd)) [1] ".Last.fixed" ".Last.value" ".Random.seed" "Temp" [5] "ans" "cmd" "imat" "inc" [9] "intensity" "last.dump" "my.df" "my.lst" [13] "myfunc" "new.mat" "nitro.linreg" "nitrofen" [17] "ord.int" "par.old" "pi.est" "pi.est.fcn" [21] "pi.est.fcn2" "plot.any.func" "plot.any.func2" "state.tmp" [25] "tmp" "trt" "x" "x1" [29] "x2" "x3" "x4" "x5" [33] "xmat1" "xmat1.df" "xmat2" "xmat3" [37] "xmat4" "y" "y1" "y2" 2. deparse(substitute(object-name)) - constructs character variable with value “object-name” 3. paste to build commands xmat1 C1 C2 C3 R1 1 4 7

18

R2 2 5 8 R3 3 6 9 dimnames(xmat1)<- list(paste("Row",1:3,sep=""),paste("Col",1:3,sep="")) xmat1 Col1 Col2 Col3 Row1 1 4 7 Row2 2 5 8 Row3 3 6 9 example: MC estimate of PI pi.est.fcn2 <- function(nsims=1400,do.plot=T) { xsim <- runif(nsims) ysim <- runif(nsims) icheck <- as.numeric( ysim <= sqrt(1-xsim^2) ) points.under <- sum(icheck)/nsims if (do.plot==T) { plot(xsim,ysim,type=”n”) text(xsim,ysim,icheck) xp <- sort(xsim) lines(xp,sqrt(1-xp^2)) } piest <- 4*points.under se.piest <- 4*sqrt(points.under*(1-points.under)/nsims) CL <- piest + c(-1,1)*qnorm(.975)*se.piest list(est.pi = piest, se.est.pi = se.piest, conf.int = CL) } example: MC estimate of PI plot.any.func <- function(func,x.low,x.high, ltype=”l”,mtitle=””) { xp <- seq(from=x.low,to=x.high,length=1000) yp <- func(xp) plot(xp,yp,type=ltype, ylab=deparse(substitute(func)), main=mtitle) } plot.any.func(cos,-2*pi,2*pi,mtitle=”Cosine over -2*pi to 2*pi”)

19

Cosine over -2*pi to 2*pi

xp

cos

-6 -4 -2 0 2 4 6

-1.0

-0.5

0.0

0.5

1.0

myfunc <- function(x) { 3 + 5*x + rnorm(length(x)) } plot.any.func(myfunc,ltype=”p”,0,10)

xp

myf

unc

0 2 4 6 8 10

010

2030

4050

Example: One sample t-based CI t.based.ci<- function(xxx) {

20

mean(xxx) + c(-1,1)*qt(.975,length(xxx)-1)* sqrt(var(xxx)/length(xxx)) } one.sample.ci.sim <- function(Nsims=1400,Nobs=15,mu=0) { xmat<- matrix(rnorm(Nsims*Nobs),nrow=Nsims,ncol=Nobs) xres<- t(apply(xmat,1,t.based.ci)) check<- (xres[,1]<=mu)&(mu<=xres[,2]) cover <- sum(check)/Nsims cat(“Normal population:\n Number of observations = “,Nobs, ”\n Number of simulated experiments = “,Nsims, ”\n Coverage Probability = “,cover,”\n\n”) } for (iNobs in c(15,30,45)) { one.sample.ci.sim(Nsims=1400,Nobs=iNobs) } Normal population: Number of observations = 15 Number of simulated experiments = 1400 Coverage Probability = 0.958571428571429 Normal population: Number of observations = 30 Number of simulated experiments = 1400 Coverage Probability = 0.963571428571429 Normal population: Number of observations = 45 Number of simulated experiments = 1400 Coverage Probability = 0.946428571428571

LAB EXERCISES 1. Finding stuff a) Type search() to get the search list and find(“state.x77”) to see where this data set resides b) Type objects() to get a listing of all objects you have created c) Type objects(where=”data”) or objects(where=find(“state.x77”) to get a list of all data files stored in the same location

21

as state.x77 d) Find other “state” data in same location as state.x77 objects(where=find("state.x77"),pattern="*state*") 2. Play with the “rep” function, vectors and matrices a) compare rep(3,1) rep(2,3) rep(1:5,2) rep(1:5, 5:1) b) create a matrix with rows 1:4, 8:5, 9:12, 13:10 c) create a matrix with columns 1:4, 8:5, 9:12, 13:10 3. Look at some multivariate data – state.x91 a) look at some basic summary statistics apply(state.x91,2,summary) b) check out multivariate graphical displays pairs(state.x91) pairs(state.x91[,c(1,2,3,6,7,8)]) # what does this do? faces(state.x91,labels=state.abb) stars(state.x91,labels=state.abb) 4. Generate a data set with x = rep(c(0,80,160,240), mu.x = exp(log(30) + .008*x - .00015*x*x) and Y~Poisson(mu.x). a) Plot randomly generated Y (use rpois) vs. x. b) Generate the linear regression fit of sqrt(y) = b0 + b1 X + b2 X^2 (sqrt transformation is the variance stabilizing transformation for Poisson data). c) Superimpose this fit on a scatterplot of the data. 5. Modify the plotting function to pass the number of x values you wish to plot. Note that you can also add the ability to pass other non-specified arguments via an ellipsis (...) in the argument list.

22

plot.any.func2 <- function(func,x.low,x.high, ltype=”l”,mtitle=””,...) { xp <- seq(from=x.low,to=x.high,length=100) yp <- func(xp) plot(xp,yp,type=ltype, ylab=deparse(substitute(func)), main=mtitle,...) } # check out how other parameters are passed in the function # calls below plot.any.func2(cos,0,3*pi,ltype="l",lwd=13) plot.any.func2(cos,0,3*pi,ltype=”p”,cex=3) plot.any.func2(cos,0,3*pi,ltype=”p”,pch=6) 6. Modify the t-based CI function (one.sample.ci.sim) to study samples from uniform and exponential samples. 7. Import “Roberts Data.xls” (an Excel spreadsheet) into a data frame. Examine variables in this data frame. Generate summary statistics for each numeric variable in this data set. Plot X=Treatment..Hours. Y=Liver.SOD 8. Import “Manatee” SAS data set into a data frame. Plot number of manatees deaths versus number of boats registered. Generate a linear regression fit relating number of deaths to number of boats registered. Superimpose this fit on the manatee plot. You may want to look at “lm”, “plot”, “lines”, “abline” and “coef” functions. Q: How do I delete a bunch of similarly named files? Suppose you have a number of files with “tmp” as part of the name. . . xtmp<-32 ytmp <- 111 ztmp <- “more junk” objects(pattern=”*tmp*”) rm(list=objects(pattern="*tmp*")) Q: How do I manipulate multiple graphs?

23

Need to use “dev.set”, and related functions – see example below

graphsheet()

qqnorm(x <- runif(100))

graphsheet()

qqnorm(y <- rnorm(100))

dev.set(dev.prev())

abline(a=0,b=1)

dev.set(dev.next())

abline(a=0,b=1)

names(dev.list()) dev.set(2) title("Random UNIFORM") dev.set(3) title("Random NORMAL")

Q: How should I build a production graphic? Carefully :-) Here is an example # a little data age.grp <- 1:7 little <- c(77,70,65,61,46,34,25) moderate <- c(17,21,25,24,29,28,25) severe <- c(6,9,10,15,25,38,50) # # take a first look at the data plot(1:7,little,ylim=c(0,80), xlab=" ",ylab="Percentage Disabled")

24

Per

cent

age

Dis

able

d

1 2 3 4 5 6 7

020

4060

80

# # now customize the axes and use lines and text to annotate # plot(1:7,little,type="n",xaxt="n",ylim=c(0,80), xlab=" ",ylab="Percentage Disabled") axis(1,at=1:7,labels=c("65-69","70-74","75-79","80-84","85-89","90-94","95+")) lines(1:7,little,lty=1,col=2) text(1:7,little,"L",col=2) lines(1:7,moderate,lty=2) text(1:7,moderate,"M") lines(1:7,severe,lty=3,col=3) text(1:7,severe,"S",col=3)

25

Per

cent

age

Dis

able

d

020

4060

80

65-69 70-74 75-79 80-84 85-89 90-94 95+

L

LL

L

L

L

L

MM

M MM M

M

SS S

S

S

S

S

# # the final plot – add title, footnotes, suppress moderate scale # plot(1:7,little,type="n",xaxt="n",ylim=c(0,80), xlab=" ",ylab="Percentage Disabled") axis(1,at=1:7,labels=c("65-69","70-74","75-79","80-84","85-89","90-94","95+")) lines(1:7,little,lty=1,col=2) text(1:7,little,"L",col=2) #lines(1:7,moderate,lty=2) #text(1:7,moderate,"M") lines(1:7,severe,lty=3,col=3) text(1:7,severe,"S",col=3) title("Estimated Distribution of Disability Status\nin Ohio's Older population by age, 1995") # title(sub="NOTE: Moderate Disability slightly increases with age") text(4,-11,”Age”) text(0.5,-13,"NOTE: Moderate Disability slightly increases with age",adj=-1,cex=.75) text(0.5,-16,"Source: Mehdizadeh et al. (2001) Projections of Ohio's Older Disabled Population: 2015 to 2050\nOhio Long-Term Care Research Project Report. Scripps Gerontology Center, Miami University, Oxford, OH 45056",adj=-1,cex=.75)

26

text(3,70,"Little or No Disability is DECREASING with age",adj=-1,col=2) #text(3,30,"Moderate Disability slightly increases with age",adj=0) text(3,5,"Severe Disability is INCREASING with age",adj=-1,col=3) arrows(4.5,65,5.5,49,col=2) arrows(4.5,15,5.5,25,col=3)

Per

cent

age

Dis

able

d

020

4060

80

65-69 70-74 75-79 80-84 85-89 90-94 95+

L

LL

L

L

L

L

SS S

S

S

S

S

Estimated Distribution of Disability Statusin Ohio's Older population by age, 1995

AgeNOTE: Moderate Disability slightly increases with ageSource: Mehdizadeh et al. (2001) Projections of Ohio's Older Disabled Population: 2015 to 2050Ohio Long-Term Care Research Project Report. Scripps Gerontology Center, Miami University, Oxford, OH 45056

Little or No Disability is DECREASING with age

Severe Disability is INCREASING with age

27

28

STA 502 Homework (Due: 15 Dec 08) (STA 402 students can do this for extra credit) Prepare a comparison using SAS and R to conduct the same analysis. Examples include doing a regression analysis and accompanying residual analysis (using SAS GLM vs. R lm with associated graphics), contingency table analysis (using SAS FREQ and R table), generating a complex graphical, etc. Your solution should be a 2 COLUMN display with the SAS analysis on one side and the R analysis on the other side (e.g. you could do this by inserting a 2 column table in a Word document and placing the code in the SAS or the R “column” of this table).

Date post:	26-Mar-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times