Download - BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

BIO5312: R Session 1An Introduction to R and Descriptive Statistics

Yujin Chung

August 30th, 2016

Fall, 2016

Yujin Chung R Session 1 Fall, 2016 1/24

Introduction to R

R software

R is both open source and open development.

You can look at the source code and you can propose changes

You can write R functions and publish them.

R is available for many platforms:I Unix of many flavors, including Linux, Solaris, FreeBSD, AIXI Windows 95 and laterI Mac OS X

Binaries and source code are available from www.r-project.org

The R Console

Basic interaction with R is by typing in the console, a.k.a.terminal or command-line

You type in commands, R gives back answers (or errors)


RStudio

RStudio allows the user to run R in a more user-friendly environment.It is open-source (i.e. free) and available at http://www.rstudio.com/

R console

R script/editor

Environment/Workspace: all the active object

History: a list of commands used so far

Files: shows all the files and folders in your default workingdirectory

I changing working directory: More→Set As Working Directory

Plots, Packages, Help


Quick start

• Mathematical operators/functions

> log(64) # natural logarithm

[1] 4.158883

> sqrt(2) # square root

[1] 1.414214

Q) What are the R outputs of the followings?

7+5, 7/5, 7*5, 7-2, 7^2, 7%%5, 7%/%5

• Comparisons are also binary operators: they take two objects, likenumbers, and give a Boolean

> 7 == 5 # 7 is equal to 5

[1] FALSE

7 > 5 # 7 is larger than 5

[1] TRUE

7 != 5 # 7 is not equal to 5

[1] TRUE


Quick start II

• Boolean operators: & (and), | (or)Q) What are the R outputs of the followings?

(5>7) & (6*7 == 42)

(5>7) | (6*7 == 42)

!(5>7) | (6*7 == 42)

• R help

> help(log) # or

> ?log


Operators

Arithmetic operatorsOperator Description

+ addition- subtraction* multiplication/ division

ˆ or ** exponentiationx %% y remainderx %/% y quotient

Logical OperatorsOperator Description

< less than<= less than or equal to> greater than>= greater than or equal to== exactly equal to! = not equal to!x Not x

x | y x OR yx & y x AND y

isTRUE(x) test if X is TRUE


Data type

We can give names to data objects; these give us variables! Variablesare created with the assignment operator: = or <- (arrow)

Numeric: numbers, either floating point or integer

> x=5 # or x <- 5

character : a character string

> x = "I like chocolate ice cream"

logical : TRUE or FALSE

> x = (1 > 2)

built-in variables. E.g. TRUE (or T), FALSE (or F)


Data structures

Group of data values into one object of type including

vector

data frame

list

matrix

factors

tables

Some R packages have their own data structure


Data structures: vectors

Vectors: a sequence of values of numerical, character or logical.Function c() returns a vector containing all its arguments in order.

numeric vector

> x = c(1,2,3)

> x

[1] 1 2 3

> length(x) # the length of x

[1] 3

character

> x=c("a","b")

> x

[1] "a" "b"

> length(x)

[1] 2


Data structures: vectors II

Sequence generators

> x = seq(from=1, to=3, by=1) # or seq(1,3,1)

> x

[1] 1 2 3

> x = 1:3 # same as seq(1,3,1)

Extracting sub-vectors

> x[2] # return the 2nd elements

[1] 2

> x[2:3] # extracting subset from the 2nd to 3rd elements

[1] 2 3

> x[c(2,3)] # same as x[2:3]

> x[-2] # drop off the 2nd elementsx=c("a","b")

[1] 1 3


Data structures: vectors III

Element-wise arithmetic

> x = 1:5

> x+1

[1] 2 3 4 5 6

> x <= 3

[1] TRUE TRUE TRUE FALSE FALSE

Pairwise arithmetic

> x = 1:5

> y = c(-1, 0, 3:5)

> x+y

[1] 0 2 6 8 10

> x == y

[1] FALSE FALSE TRUE TRUE TRUE


Data structure: data frames

Data frames: a data set that can be represented as a set ofobservations (rows) on several variables (columns).

Example: Fishers or Andersons iris data set (built-in)

> iris

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

## Basic information of data frame

> names(iris) # variable names

[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"

> dim(iris) # the numbers of rows and columns

[1] 150 5

> nrow(iris) # the number of rows

[1] 150

> ncol(iris) # the number of columns


Data structure: data frames II

Extracting subset

> iris$Sepal.Length # extracting variable (1st column)

[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1

[19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0

> iris[,1] # extracting the 1st column

[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1

[19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0

> iris[2,] # extracting the 2nd row

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

2 4.9 3 1.4 0.2 setosa

> iris[1,1] # the element of the 1st column & the 1st row

[1] 5.1


Data structure: data frames III

Each of columns or rows is a vector

> sepal = iris$Sepal.Length # extracting variable (1st column)

> length(sepal)

[1] 150

> temp = iris[1:10,2]

> length(temp)

[1] 10


Writing data

Writing a data set in a text or CSV file

write.table(x,file,quote,sep,row.names,col.names,...)

x: data object to save, file: file namequote: if TRUE, elements surrounded by double quotes. If FALSE,nothing is quotedsep: the field separator string. Values within each row are separatedby this stringrow.names: a logical value indicating whether the row (or column)names of x are to be written along with xcol.names: a logical value indicating whether the column names of xare to be written along with x

> write.table(iris, file = "iris.txt", quote=F, sep = " ",

row.names = F, col.names=T)


Reading data

Reading a data set in a text or CSV file

read.table(file, header, sep = , ...)

file: data file to readheader: a logical value indicating whether the file contains the namesof the variables as its first linesep: the field separator string. Values within each row are separatedby this string

> iris2 = read.table("iris.txt", header =T, sep=" ",)

> lead = read.table("LEAD.DAT.txt",header=T)


Functions

Built-in functions: write.table, read.table, read.cross, etc

> names(lead) ## variable names of "lead"

[1] "id" "area" "ageyrs" "sex" "iqv_inf" "iqv_comp"

[7] "iqv_ar" "iqv_ds" "iqv_raw" "iqp_pc" "iqp_bd" "iqp_oa"

[13] "iqp_cod" "iqp_raw" "hh_index" "iqv" "iqp" "iqf"

[19] "iq_type" "lead_grp" "Group" "ld72" "ld73" "fst2yrs"

[25] "totyrs" "pica" "colic" "clumsi" "irrit" "convul"

[31] "X_2plat_r" "X_2plar_l" "visrea_r" "visrea_l" "audrea_r" "audrea_l"

[37] "fwt_r" "fwt_l" "hyperact" "maxfwt"

> dim(lead)

[1] 124 40


min(), max() and range()

min(x): the minimum of the argument xmax(x): the maximum of the argument xrange(): the minimum and maximum of the argument x

> fwt = lead$maxfwt # extracting "maxfwt" and creating a new variable

> fwt

[1] 72 61 49 48 51 49 50 58 50 51 59 65 57 53 74 50 84 46 52 64 59 55 99 46 52

[26] 63 52 42 57 23 65 38 59 26 53 50 56 49 76 68 60 46 57 45 46 64 40 62 13 79

> min(fwt)

[1] 13

> max(fwt)

[1] 99

> range(fwt)

[1] 13 99

> diff(range(fwt)) # the "range" of fwt

[1] 86


mean()

mean(): the arithmetic mean of the argumentsum(): the sum of the argument

> mean(fwt) # arithmetic mean of fwt

[1] 61.44355

> sum(fwt)/length(fwt)

[1] 61.44355

> colMeans(lead) # the mean of each column

id area ageyrs sex iqv_inf iqv_comp iqv_ar

240.233871 1.717742 8.935000 1.387097 6.766129 7.532258 8.306452

cf) colMeans(), rowMeans(), colSums(), rowSums()


median() and quantile()

median(x): the median of the argument xquantile(x): returns the minimum, Q1, Q2 (median), Q3, maximumof xIQR(x): the IQR of x

> mean(fwt) # the median of fwt

[1] 56

> quantile(fwt)

0% 25% 50% 75% 100%

13 48 56 72 99

> IQR(fwt)

[1] 24

> quantile(fwt,probs=.25) # Q1, the 25th percentile

25%

48

> quantile(fwt,probs=.75) - quantile(fwt,probs=.25) # IQR

75%

24 Yujin Chung R Session 1 Fall, 2016 20/24

var() and sd()

var(x): the variance of the argument xsd(x): the standard deviation of the argument x

> var(fwt)

[1] 490.4114

> sd(fwt)

[1] 22.14523

> n = length(fwt)

> n

[1] 124

> sum( (fwt - mean(fwt))^2 )/ (n-1) # variance

[1] 490.4114

> sqrt( sum( (fwt - mean(fwt))^2 )/(n-1) ) # standard deviation

[1] 22.14523


summary()

summary(x): the minimum, Q1, median, mean, Q3 and maximum ofthe argument x

> summary(fwt)

Min. 1st Qu. Median Mean 3rd Qu. Max.

13.00 48.00 56.00 61.44 72.00 99.00


Writing and calling functions

The structure of a function

<function name> = function(arg1, arg2, ... ){

statements

return(object)

}

We write another summary function, called mySummary, that returnsthe mean and standard deviation of an argument variable afterremoving missing values.

> mySummary = function(dat){ # define a function

res = c(mean(dat), sd(dat))

return(res)

}

> mySummary(fwt) # calling a function

[1] 61.44355 22.14523


More intro

Some R resourcesI The official intro, “An Introduction to R”, available online in

http://cran.r-project.org/doc/manuals/R-intro.pdfI Norman Matloff, The Art of R Programming: A Tour of Statistical

Software DesignI Phil Spector, Data Manipulation with RI Paul Teetor, The R Cookbook