Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
R Basics
Peter Dalgaard
Department of BiostatisticsUniversity of Copenhagen
Mixed Models in R, Copenhagen, January 2006
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Outline
Using R
R Language Basics
Dealing with the workspace
Reading Data
Data Manipulation
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Basics of R
I What is R?I Interacting with RI Extended user interfacesI Later: Dealing with R’s workspace
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Key Points about R
I Environment built around the programming language R,(an Open Source dialect of the S language).
I R is Free Software, and runs on a variety of platforms (I’llbe using Linux. Computer labs run on Windows.)
I Command-line execution based on function callsI Extensible with user functionsI Workspace containing data and functionsI Graphics devices
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
R packages
I Collections of R functions, data, and compiled codeI Well-defined format that ensures easy installation, a basic
standard of documentation, and enhances portability andreliability,
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Interacting with R
I Command line interface (CLI)I The basic mode of interaction is “read – evaluate – print”I User types an expression at the command line,I R evaluates itI . . . and prints the resultI Batch variation: read commands from a file
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Extended Interfaces
I Windows, Macintosh GUI: Fairly simple extensions of CLI,mostly offloads some tasks to menu interface, and addscommand recall
I Script editing: The ability to work with multiple lines of Rcode, save them to a file for later use, etc. A simple scripteditor is built into the R GUI in recent versions.
I External editor interfaces: TINN-R, R-WinEdt adds syntaxhighlighting. Highly recommended.
I R embedded in a text editor (ESS – Emacs SpeaksStatistics). Popular on Unix/Linux systems.
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Demo 1
2+2log(10)help(log)summary(airquality)demo(graphics) # pretty pictures...
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Basic Vector Types
I R is a vector based language, data types includeI Numeric (integer/double) vectorsI Character (strings) vectorsI Logical vectorsI These types are combined and extended to form more
complex objects
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Basic operations
I Standard arithmetic is vectorized: x + y adds eachelement of x to the corresponding element of y
I c — concatenationI seq or from:to — sequencesI rep, gl — replicationI sum, mean, range, math functions,. . .
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Demo 2
x <- round(rnorm(10,mean=20,sd=5)) # simulate dataxmean(x)m <- mean(x)mx - m # notice recycling(x - m)^2sum((x - m)^2)sqrt(sum((x - m)^2)/9)sd(x)
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Concatenation
I The c() function joins a number of vectors end to endI An important special case is where the vectors are just
numbers> c(7,9,13)[1] 7 9 13
I This is used all over the place where you need to pass ashort vector to a function call (e.g ylim=c(0,100) in aplot.
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Sequences
I The seq() function generates regular sequencesI The arguments are from, to, byI If by is 1 (default), you can also use from:to
> seq(1,9,2)[1] 1 3 5 7 9> 1:5[1] 1 2 3 4 5
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Replication
I The rep() function generates a vector by replicatingelements from another vector a given number of times
I The arguments are x, times, but there are alternateforms, notably the each argument> x <- c(7,9,13)> rep(x,2)[1] 7 9 13 7 9 13> rep(x,each=2)[1] 7 7 9 9 13 13> rep(x, 1:3)[1] 7 9 9 13 13 13
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Generating regular designs
I The gl() function “generates levels” for a regular designI The arguments are
I Number of levelsI Block sizeI Total length
I Notice that this generates a factor (more about this later)> gl(2,3,12) # 2 levels, blocks of 3, total length 12[1] 1 1 1 2 2 2 1 1 1 2 2 2Levels: 1 2
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Classed Objects
I In R objects can have classesI These are used as the basis for function dispatchI I.e. the same (generic) function can have different methods
for different classesI Print methods are a prototypical exampleI There are two object systems, based (roughly) on S
version 3 and version 4. I will not go into details.
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Factors
I Factors are used to describe groupings (the termoriginates from factorial designs)
I Basically, these are just integer codes plus a set of namesfor the levels
I They have class "factor" making them (a) print nicelyand (b) maintain consistency
I A factor can also be ordered (class "ordered"),signifying that there is a natural sort order on the levels
I In model specifications, factors play a fundamental role byindicating that a variable should be treated as aclassification rather than as a quantitative variable (similarto a CLASS statement in SAS)
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Creating factors
I Factors can be created during read (but not alwayscorrectly)
I The factor function is used when, e.g., groups havebeen read as numeric codes> sexnr <- c(0,0,1,1,0,1)> (sex <- factor(sexnr, levels=c(1,0),+ labels=c("male", "female")))[1] female female male male female maleLevels: male female
I Notice the slightly confusing use of levels and labelsarguments.
I levels are the value codes on inputI labels are the value codes on output (and become the
levels of the resulting factor)
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Indexing
I R has several useful indexing mechanisms:I a[5] single elementI a[5:7] several elementsI a[-6] all except the 6thI a[b>200] logical index
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Lists
I A vector where the elements can have different typesI Functions often return listsI lst <- list(A=rnorm(5), B="hello")
I Special indexing:I lst$A
I lst[[1]] first elementI (lst[1] list containing the first element)
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Matrices/Tables/Arrays
I Used in matrix calculus and as input to, e.g.,chisq.test(). Results of tabulation.
I Matrices: Generate with matrix
I Indexing methods are like [i,j], [i,], [,j]I Leaving out the row index gives all rows, etc.
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Data frames
I Like data set in other packagesI Technically: Lists of vectors/factors of same lengthI Indexed like matrices (Beware, though: Data frames are
not matrices) or as listsI Generate from read operation or with data.frame
I Many sample data frames are avalilable using data()
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Demo 3
data(airquality)airquality[1:10,]airquality$Monthairquality[airquality$Month==5,]oz <- airquality[airquality$Month==5,]$Ozonemean(oz)mean(oz, na.rm=TRUE)
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
The workspace
I The global environment contains R objects created on thecommand line.
I There is an additional search path of loaded packages andattached data frames.
I When you request an object by name, R looks first in theglobal environment, and if it doesn’t find it there, itcontinues along the search path.
I The search path is maintained by library(), attach(),and detach()
I Notice that objects in the global environment may maskobjects in packages and attached data frames
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Demo 4
attach(airquality)mean(Ozone, na.rm=TRUE)tapply(Ozone, Month, mean, na.rm=TRUE)detach()search()library(ISwR)data(intake) # From ISwRls()attach(intake)search()ls("intake") # show variables in data framepost - prerm(intake) # remove data framedetach() # remove from search path
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
A Common Mistake
attach(mydata)sex <- factor(sex)tapply(height, sex, mean)detach()attach(subset(mydata, age > 25))sex <- factor(sex)tapply(height, sex, mean)
You get an error saying that height and tanner are ofdifferent length. What went wrong?Second time around, sex was found in the global environmentbefore the attached data frame.
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Getting Organized
Several possibilities:I Save/restore entire workspace (objects only)I Save selected objects and load themI source() script filesI Batch processing (R CMD BATCH file.R)
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Reading Data, Overview
I Simple data vectors can be read using scan()
I Data frames can be read from most reasonably structuredtext file formats (space separated columns, tab- andcomma-delimited files) using read.table() orread.delim().
I The foreign package can read files from Stata, SASexport libraries, SPSS, and Epi-Info, Minitab, and someS-PLUS versions.
I For spreadsheets and databases, the quick and easy wayis to export to a delimited file, but you can work via ODBCconnections and database access packages
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
The Simplest Way to Read Data
I This is what you’d normally want to do:I Have data in a plain text fileI Columns separated by whitespaceI Missing values coded as the string "NA"
I Preferably have a row of variable names at the topI Use d <- read.table("myfile", header=TRUE)
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Demo 5
dir <- system.file("data", package="ISwR")fname <- file.path(dir, "thuesen.txt")fname
file.show(fname)read.table(fname, header=TRUE)
(Notice the use of portable constructs to find the data directoryinside a package and the construction of the full pathname.)
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Options and Details
I read.table has quite a few options and detailsI Different codings of missing values (na.strings)I Different decimal separators (dec argument)I Text strings can be quoted if embedded blanksI You may skip lines, read a limited number of lines, and
more. Please consult the manual page for details.
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Data Manipulation Functions
I Single-column modificationsI Modifying and subsetting data frames
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Constructors
I R deals with many kinds of objects besides data setsI Need to have ways of constructing them from the
command lineI We have seen the c and list functionsI Extracting and setting names with names(x)
I For matrices and arrays, use the (surprise) matrix andarray functions. data.frame for data frames.
I It is also fairly common to construct a matrix from itscolumns using cbind
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
The cut Function
I The cut function converts a numerical variable into groupsaccording to a set of break points
I Notice that the number of breaks is one more than thenumber of intervals
I Notice also that the intervals are left-open, right-closed bydefault (right=FALSE changes that)
I . . . and that the lowest endpoint is not included by default(set include.lowest=TRUE if it bothers you)
R Basics Department of Biostatistics University of Copenhagen
Using R R Language Basics Dealing with the workspace Reading Data Data Manipulation
Modifying and Subsetting Data Frames
I The syntax for indexing data frames gets awkward:airquality[airquality$Month == 5 &airquality$Ozone > 50,]
I The subset function allows you to saysubset(airquality, Month == 5 & Ozone >50). I.e., it evaluates the second argument within the dataframe.
I The transform function is similar. It allows you to definenew variables or modify old ones using code likejuulnew <- transform(juul,
sex=factor(sex, labels=c("M","F")),tanner=factor(tanner))
R Basics Department of Biostatistics University of Copenhagen