BIO5312: R Session 1An Introduction to R and Descriptive Statistics
Yujin Chung
August 30th, 2016
Fall, 2016
Yujin Chung R Session 1 Fall, 2016 1/24
Introduction to R
R software
R is both open source and open development.
You can look at the source code and you can propose changes
You can write R functions and publish them.
R is available for many platforms:I Unix of many flavors, including Linux, Solaris, FreeBSD, AIXI Windows 95 and laterI Mac OS X
Binaries and source code are available from www.r-project.org
The R Console
Basic interaction with R is by typing in the console, a.k.a.terminal or command-line
You type in commands, R gives back answers (or errors)
Yujin Chung R Session 1 Fall, 2016 2/24
RStudio
RStudio allows the user to run R in a more user-friendly environment.It is open-source (i.e. free) and available at http://www.rstudio.com/
R console
R script/editor
Environment/Workspace: all the active object
History: a list of commands used so far
Files: shows all the files and folders in your default workingdirectory
I changing working directory: More→Set As Working Directory
Plots, Packages, Help
Yujin Chung R Session 1 Fall, 2016 3/24
Quick start
• Mathematical operators/functions
> log(64) # natural logarithm
[1] 4.158883
> sqrt(2) # square root
[1] 1.414214
Q) What are the R outputs of the followings?
7+5, 7/5, 7*5, 7-2, 7^2, 7%%5, 7%/%5
• Comparisons are also binary operators: they take two objects, likenumbers, and give a Boolean
> 7 == 5 # 7 is equal to 5
[1] FALSE
7 > 5 # 7 is larger than 5
[1] TRUE
7 != 5 # 7 is not equal to 5
[1] TRUE
Yujin Chung R Session 1 Fall, 2016 4/24
Quick start II
• Boolean operators: & (and), | (or)Q) What are the R outputs of the followings?
(5>7) & (6*7 == 42)
(5>7) | (6*7 == 42)
!(5>7) | (6*7 == 42)
• R help
> help(log) # or
> ?log
Yujin Chung R Session 1 Fall, 2016 5/24
Operators
Arithmetic operatorsOperator Description
+ addition- subtraction* multiplication/ division
ˆ or ** exponentiationx %% y remainderx %/% y quotient
Logical OperatorsOperator Description
< less than<= less than or equal to> greater than>= greater than or equal to== exactly equal to! = not equal to!x Not x
x | y x OR yx & y x AND y
isTRUE(x) test if X is TRUE
Yujin Chung R Session 1 Fall, 2016 6/24
Data type
We can give names to data objects; these give us variables! Variablesare created with the assignment operator: = or <- (arrow)
Numeric: numbers, either floating point or integer
> x=5 # or x <- 5
character : a character string
> x = "I like chocolate ice cream"
logical : TRUE or FALSE
> x = (1 > 2)
built-in variables. E.g. TRUE (or T), FALSE (or F)
Yujin Chung R Session 1 Fall, 2016 7/24
Data structures
Group of data values into one object of type including
vector
data frame
list
matrix
factors
tables
Some R packages have their own data structure
Yujin Chung R Session 1 Fall, 2016 8/24
Data structures: vectors
Vectors: a sequence of values of numerical, character or logical.Function c() returns a vector containing all its arguments in order.
numeric vector
> x = c(1,2,3)
> x
[1] 1 2 3
> length(x) # the length of x
[1] 3
character
> x=c("a","b")
> x
[1] "a" "b"
> length(x)
[1] 2
Yujin Chung R Session 1 Fall, 2016 9/24
Data structures: vectors II
Sequence generators
> x = seq(from=1, to=3, by=1) # or seq(1,3,1)
> x
[1] 1 2 3
> x = 1:3 # same as seq(1,3,1)
Extracting sub-vectors
> x[2] # return the 2nd elements
[1] 2
> x[2:3] # extracting subset from the 2nd to 3rd elements
[1] 2 3
> x[c(2,3)] # same as x[2:3]
> x[-2] # drop off the 2nd elementsx=c("a","b")
[1] 1 3
Yujin Chung R Session 1 Fall, 2016 10/24
Data structures: vectors III
Element-wise arithmetic
> x = 1:5
> x+1
[1] 2 3 4 5 6
> x <= 3
[1] TRUE TRUE TRUE FALSE FALSE
Pairwise arithmetic
> x = 1:5
> y = c(-1, 0, 3:5)
> x+y
[1] 0 2 6 8 10
> x == y
[1] FALSE FALSE TRUE TRUE TRUE
Yujin Chung R Session 1 Fall, 2016 11/24
Data structure: data frames
Data frames: a data set that can be represented as a set ofobservations (rows) on several variables (columns).
Example: Fishers or Andersons iris data set (built-in)
> iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
## Basic information of data frame
> names(iris) # variable names
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
> dim(iris) # the numbers of rows and columns
[1] 150 5
> nrow(iris) # the number of rows
[1] 150
> ncol(iris) # the number of columns
Yujin Chung R Session 1 Fall, 2016 12/24
Data structure: data frames II
Extracting subset
> iris$Sepal.Length # extracting variable (1st column)
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
[19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
> iris[,1] # extracting the 1st column
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
[19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
> iris[2,] # extracting the 2nd row
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
2 4.9 3 1.4 0.2 setosa
> iris[1,1] # the element of the 1st column & the 1st row
[1] 5.1
Yujin Chung R Session 1 Fall, 2016 13/24
Data structure: data frames III
Each of columns or rows is a vector
> sepal = iris$Sepal.Length # extracting variable (1st column)
> length(sepal)
[1] 150
> temp = iris[1:10,2]
> length(temp)
[1] 10
Yujin Chung R Session 1 Fall, 2016 14/24
Writing data
Writing a data set in a text or CSV file
write.table(x,file,quote,sep,row.names,col.names,...)
x: data object to save, file: file namequote: if TRUE, elements surrounded by double quotes. If FALSE,nothing is quotedsep: the field separator string. Values within each row are separatedby this stringrow.names: a logical value indicating whether the row (or column)names of x are to be written along with xcol.names: a logical value indicating whether the column names of xare to be written along with x
> write.table(iris, file = "iris.txt", quote=F, sep = " ",
row.names = F, col.names=T)
Yujin Chung R Session 1 Fall, 2016 15/24
Reading data
Reading a data set in a text or CSV file
read.table(file, header, sep = , ...)
file: data file to readheader: a logical value indicating whether the file contains the namesof the variables as its first linesep: the field separator string. Values within each row are separatedby this string
> iris2 = read.table("iris.txt", header =T, sep=" ",)
> lead = read.table("LEAD.DAT.txt",header=T)
Yujin Chung R Session 1 Fall, 2016 16/24
Functions
Built-in functions: write.table, read.table, read.cross, etc
> names(lead) ## variable names of "lead"
[1] "id" "area" "ageyrs" "sex" "iqv_inf" "iqv_comp"
[7] "iqv_ar" "iqv_ds" "iqv_raw" "iqp_pc" "iqp_bd" "iqp_oa"
[13] "iqp_cod" "iqp_raw" "hh_index" "iqv" "iqp" "iqf"
[19] "iq_type" "lead_grp" "Group" "ld72" "ld73" "fst2yrs"
[25] "totyrs" "pica" "colic" "clumsi" "irrit" "convul"
[31] "X_2plat_r" "X_2plar_l" "visrea_r" "visrea_l" "audrea_r" "audrea_l"
[37] "fwt_r" "fwt_l" "hyperact" "maxfwt"
> dim(lead)
[1] 124 40
Yujin Chung R Session 1 Fall, 2016 17/24
min(), max() and range()
min(x): the minimum of the argument xmax(x): the maximum of the argument xrange(): the minimum and maximum of the argument x
> fwt = lead$maxfwt # extracting "maxfwt" and creating a new variable
> fwt
[1] 72 61 49 48 51 49 50 58 50 51 59 65 57 53 74 50 84 46 52 64 59 55 99 46 52
[26] 63 52 42 57 23 65 38 59 26 53 50 56 49 76 68 60 46 57 45 46 64 40 62 13 79
> min(fwt)
[1] 13
> max(fwt)
[1] 99
> range(fwt)
[1] 13 99
> diff(range(fwt)) # the "range" of fwt
[1] 86
Yujin Chung R Session 1 Fall, 2016 18/24
mean()
mean(): the arithmetic mean of the argumentsum(): the sum of the argument
> mean(fwt) # arithmetic mean of fwt
[1] 61.44355
> sum(fwt)/length(fwt)
[1] 61.44355
> colMeans(lead) # the mean of each column
id area ageyrs sex iqv_inf iqv_comp iqv_ar
240.233871 1.717742 8.935000 1.387097 6.766129 7.532258 8.306452
cf) colMeans(), rowMeans(), colSums(), rowSums()
Yujin Chung R Session 1 Fall, 2016 19/24
median() and quantile()
median(x): the median of the argument xquantile(x): returns the minimum, Q1, Q2 (median), Q3, maximumof xIQR(x): the IQR of x
> mean(fwt) # the median of fwt
[1] 56
> quantile(fwt)
0% 25% 50% 75% 100%
13 48 56 72 99
> IQR(fwt)
[1] 24
> quantile(fwt,probs=.25) # Q1, the 25th percentile
25%
48
> quantile(fwt,probs=.75) - quantile(fwt,probs=.25) # IQR
75%
24 Yujin Chung R Session 1 Fall, 2016 20/24
var() and sd()
var(x): the variance of the argument xsd(x): the standard deviation of the argument x
> var(fwt)
[1] 490.4114
> sd(fwt)
[1] 22.14523
> n = length(fwt)
> n
[1] 124
> sum( (fwt - mean(fwt))^2 )/ (n-1) # variance
[1] 490.4114
> sqrt( sum( (fwt - mean(fwt))^2 )/(n-1) ) # standard deviation
[1] 22.14523
Yujin Chung R Session 1 Fall, 2016 21/24
summary()
summary(x): the minimum, Q1, median, mean, Q3 and maximum ofthe argument x
> summary(fwt)
Min. 1st Qu. Median Mean 3rd Qu. Max.
13.00 48.00 56.00 61.44 72.00 99.00
Yujin Chung R Session 1 Fall, 2016 22/24
Writing and calling functions
The structure of a function
<function name> = function(arg1, arg2, ... ){
statements
return(object)
}
We write another summary function, called mySummary, that returnsthe mean and standard deviation of an argument variable afterremoving missing values.
> mySummary = function(dat){ # define a function
res = c(mean(dat), sd(dat))
return(res)
}
> mySummary(fwt) # calling a function
[1] 61.44355 22.14523
Yujin Chung R Session 1 Fall, 2016 23/24
More intro
Some R resourcesI The official intro, “An Introduction to R”, available online in
http://cran.r-project.org/doc/manuals/R-intro.pdfI Norman Matloff, The Art of R Programming: A Tour of Statistical
Software DesignI Phil Spector, Data Manipulation with RI Paul Teetor, The R Cookbook
Yujin Chung R Session 1 Fall, 2016 24/24