R: A Statistics Program For Teaching & Research Josué Guzmán 11 Nov. 2007 JGuzmanPhD@Gmail.Com.

Post on 14-Dec-2015

217 views 0 download

transcript

R: A Statistics ProgramFor Teaching &

ResearchJosué Guzmán11 Nov. 2007

JGuzmanPhD@Gmail.Com

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 2

Some Useful R Links

• R Home Page www.r-project.org

• CRAN http://cran.r-project.org

• Precompiled Binary Distributions

• Windows (95 and later)

• R Manuals

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 3

R Installation

• R: Statistical Analysis & Graphics

• Freely Available Under GPL

• Binary Distributions

• Installation – Standard Steps

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 4

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 5

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 6

Running R

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 7

Statistical Programming with R

• Learn Language Basics

• Learn Documentation / Help System

• Learn Data Manipulation & Graphics

• Perform Basic Statistical Analysis

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 8

First Steps: Interacting with R

• Type a Command & Press Enter

• R Executes (printing the result if relevant)

• R waits for more input

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 9

Some Examples

2 * 2

[1] 4

exp(-2)

[1] 0.1353353

rdmnorm =rnormal(1000)

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 10

R Functions

• exp, log and rnorm are functions

• Function calls are indicated by the presence of parentheses

Example: hist(rdmnorm, col = "magenta")

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 11

Variables and Assignments

The = operator; the <- operator also works

x = 2.2y = x + 3.5sqrt(x)y

x ^ y

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 12

Variables and Assignments

• Variable names cannot start with a digit

• Names are Case-Sensitive

• Some common names are already used by R

• Examples: c, q, t, C, D, F, I, T

• Should be avoided

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 13

Vectorized Arithmetic

• Elementary data types in R are all vectors

• The c(...) construct used to create vectors:

• Bolstad, 2004, exercise 13.2, page 253

fertilizer = c(1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5)

fertilizer

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 14

Vectorized Arithmetic [cont.]

•Arithmetic operations (+, -, *, /, ^) and mathematical functions (sin, cos, log, …) work element-wise on vectors

yield = c(25, 31, 27, 28, 36, 35, 32, 34)

log(yield)

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 15

Vectorized Arithmetic [cont.]

sum.yield = sum(yield)sum.yield

n = length(yield)n

avg.yield = sum.yield/navg.yield

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 16

Graphics

• plot(x, y) function – simple way to produce R graphics:

plot(fertilizer, log(yield), main = "Fertilizer vs. Yield")

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 17

Getting Help• help.start( ) Starts a browser window with an HTML

help interface. Links to manual An Introduction to R, as well as topic-wise listings.

• help(topic) Help page for a particular topic or

function. Every R function has a help page.

• help.search("search string") Subject/keyword search

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 18

Getting Help [cont.]

• Short-cut: question mark (?) help(plot) ? plot

• To know about a specific subject, use help.search function. Example:

help.search("logarithm")

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 19

apropos( )

• apropos function - list of topics that partially match its argument:

apropos("plot")[1:10][1] ".__C__recordedplot" "biplot"

[3] "interaction.plot" "lag.plot"

[5] "monthplot" "plot.TukeyHSD"

[7] "plot.density" "plot.ecdf"

[9] "plot.lm" "plot.mlm"

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 20

R Packages

• R makes use of a system of packages• Each package is a collection of routines

with a common theme• The core of R itself is a package called

base• A collection of packages is called a library• Some packages are already loaded when

R starts up• Other packages need be loaded using the

library function

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 21

R Packages [cont.]

Several packages come pre-installed with R:

installed.packages( )[, 1][1] "ISwR" "KernSmooth" "MASS" "base"[5] "boot" "class" "cluster" "foreign"[9] "graphics" "grid" "lattice" "methods"[13] "mgcv" "nlme" "nnet" "rpart"[17] "spatial" "splines" "stats" "stats4"[21] "survival" "tcltk" "tools" "utils"

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 22

Contributed Packages

• Many packages are available from CRAN

• Some packages are already loaded when R starts up. List of currently loaded packages - use search:

search( )[1] ".GlobalEnv" "package:tools" "package:methods"

[4] "package:stats" "package:graphics" "package:utils"

[7] "Autoloads" "package:base"

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 23

R Packages

• Can be loaded by the user. Example: UsingR package

library(UsingR)

• New packages downloaded using the install.packages function:

install.packages("UsingR") library(help = UsingR)

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 24

Data Types

• vector – Set of elements in a specified order

• matrix – Two-dimensional array of elements of the same mode

• factor – Vector of categorical data• data frame – Two-dimensional array

whose columns may represent data of different modes

• list – Set of components that can be any other object type

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 25

Editing Data Sets• Can create and modify data sets on the command

line xx = seq(from = 1, to = 5) xx

x2 = 1 : 5 x2

yy = scan( )5 8 10 4 2 6 2011 21 32 43 55 yy

• Can edit a data set once it is created edit(mydata) data.entry(mydata)

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 26

Built-in Data

Data from a library:library(UsingR) attach(cfb)#Consumer-Finances Surveycfb$INCOMEcfb$EDUCeduc.fac = factor(EDUC)plot(INCOME ~ educ.fac, xlab = "EDUCATION", ylab = "INCOME")

detach(cfb)

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 27

Data Modes

• logical – Binary mode, values represented as TRUE or FALSE

• numeric – Numeric mode [integer, single, & double precision]

• complex – Complex numeric values

• character – Character values represented as strings

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 28

Data Frames

• read.table( ) – Reads in data from an external file

read.table("data.txt" , header = T)

read.table(file = file.choose( ), header = T)

• data.frame – Binds R objects of various kinds

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 29

read.table Function

• Reads ASCII file, creates a data frame• Data in tables of rows and columns• If first line contains column labels:

Use argument header = T

• Field separator is white space• Also read.csv and read.csv2

– Assume , and ; separations, respectively

• Treats characters as factors

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 30

save( ) and load( )• Used for R Functions and Objects

• Understandable to load only

x = 23

y = 44

save(x, y, file = "xy.Rdata")

load("xy.Rdata")

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 31

Comparison Operators

!= Not Equal To

< Less Than

<= Less Than or Equal To

== Exactly Equal To

> Greater Than

>= Greater Than or Equal To

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 32

Some Logical Operators

! Not

| Or (For Calculating Vectors and Arrays of Logicals)

& And (For Calculating Vectors and Arrays of Logicals)

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 33

Some Mathematical Functions

abs Absolute Valueceiling Next Larger Integerfloor Next Smallest Integercos, sin, tan Trigonometric

Functionsexp(x) e^x [e = 2.71828 …]log Natural Logarithmlog10 Logarithm Base 10sqrt Square Root

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 34

Statistical Summary Functions

length Length of Objectmax Maximum Valuemean Arithmetic Meanmedian Medianmin Minimum Valueprod Product of Valuesquantile Empirical Quantilessum Sumvar Variance - Covariancesd Standard Deviationcor Correlation Between Vectors or

Matrices

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 35

Sorting and Other Functions

rev Put Values of Vectors in Reverse Order

sort Sort Values of Vectororder Permutation of Elements to Produce

Sorted Orderrank Ranks of Values in Vectormatch Detect Occurrences in a Vectorcumsum Cumulative Sums of Values in

Vectorcumprod Cumulative Products

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 36

Plotting Functions Useful for

One-Dimensional Databarplot Bar plot

boxplot Box & Whisker plot

hist Histogram

dotchart Dot plot

pie Pie chart

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 37

Plotting Functions Useful for

Two-Dimensional Dataplot Creates a scatter plot:

plot(x, y)

qqnorm Quantile-quantile plot sample vs. N(0, 1): qqnorm(x)

qqplot Plot quantile-quantile plot for two samples: qqplot(x , y)

pairs Creates a pairs or scatter plot matrix: attach(babies) pairs(babies[ , c("gestation", "wt", "age", "inc" ) ] )

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 38

Three-Dimensional PlottingFunctions

contour Contour plot

persp Perspective plot

image Image plot

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 39

Probability Distributions Using R

• Pseudo-random sampling

sample(0:20, 5) # select 5 WOR

sample(0:20, 5, replace = T) # select WR

• Coin toss simulation [0 = tail; 1 = head] 20 tosses:

sample(c(0, 1), 20, replace=T)

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 40

For Any Probability Distribution

ddist density or probability

pdist cumulative probability

qdist quantiles [percentiles]

rdist pseudo-random selection

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 41

Binomial Distribution

X ~ Binomial(n , p) ; x = 0, 1, …, n

dbinom(x , n , p ) Density or point probability

pbinom(x , n , p ) Cumulative distribution

qbinom(q , n , p ) Quantiles [ 0 < q < 1 ]

rbinom(m , n , p ) Pseudo-random numbers

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 42

Binomial Distribution

Coin toss simulation: x = 0:20 # num. of heads in 20 tosses

px = dbinom(x , size = 20, prob = 0.5)

plot(x , px, type = "h") # graph display

curve(dnorm(x, 10, sqrt(20*.5*.5)), col=2, add=T)

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 43

0 5 10 15 20

0.0

00

.05

0.1

00

.15

x

px

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 44

Normal Distribution

X ~ Normal(µ,)dnorm(x , µ,) Density

pnorm(x , µ,) Cumulative probability

qnorm(q , µ,) Quantiles

rnorm(m , µ,) Random numbers

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 45

Standard Normal

x = seq(-3.5,3.5,0.1) # x ~ N(0,1)

prx = dnorm(x) # M = 0 , SD = 1

plot(x , prx , type = "l" )

Or using: curve(dnorm(x), from = -3.5 , to = 3.5)

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 46

Cumulative Normal & Quantiles

curve(pnorm(x), from=-3.5,to=3.5)

qnorm(.25) #Percentile 25, x~N(0,1)

qnorm(.75, m=50, sd=2) # M=50,SD=2

qnorm(c(.1,.3,.7,.9), m=65, sd=3)

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 47

Poisson Distribution

X ~ Poisson( λ ) ; X = 0, 1, 2, 3, …

x = 0:20 # Suppose λ = 3.5

prx = dpois(x, lambda = 3.5)

plot(x , prx, type = "h", main = "Poisson Distribution")

text(10, .10, "Lambda = 3.5")

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 48

0 5 10 15 20

0.0

00

.05

0.1

00

.15

0.2

0

Poisson Distribution

x

prx

Lambda = 3.5

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 49

Sampling Distributions

n = 25; curve(dnorm(x , 0, 1/sqrt(n)), -3, 3,

xlab = "Mean", ylab = "Densities of Sample Mean", bty = "l" )

n=5 ; curve(dnorm(x, 0, 1/sqrt(n)), add=T)

n=1 ; curve(dnorm(x, 0, 1/sqrt(n)), add=T)

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 50

-3 -2 -1 0 1 2 3

0.0

0.5

1.0

1.5

2.0

Mean

De

nsitie

s o

f S

am

ple

Me

an

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 51

t – Distribution as df Increase curve(dnorm(x), -4, 4, main="Normal & t

Distributions", ylab="Densities" )

k=3; curve(dt(x , df = k ), lty = k, add = T)

k=5; curve(dt(x , df = k ), lty = k, add = T)

k=15; curve(dt(x , df = k ), lty = k, add = T)

k=100; curve(dt(x , df = k ), lty = k, add = T)

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 52

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

Normal & t Distributions

x

De

nsitie

s

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 53

Binomial-Normal Approximation

• Coin toss example: n = 100, p = .5• P(X ≤ 40)?

Using Larget’s prob.R file: source(file.choose( ) )

gbinom(100, .5, b = 40 )

Normal approximation: µ = 50, = 5 gnorm(50, 5, b = 40.5)

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 54

30 40 50 60 70

0.0

00

.02

0.0

40

.06

0.0

8

Binomial Distribution n = 100 , p = 0.5

Possible Values

Pro

ba

bility

P(0 <= Y <= 40) = 0.028444

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 55

Normal Distribution with 50, 5

Possible Values

Pro

ba

bility D

en

sity

30 40 50 60 70

P( X < 40.5 ) = 0.0287

P( X > 40.5 ) = 0.9713

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 56

One-Sample t-test

Ho: µ = µ0 Null Hypothesis

Ha: µ µ0 Two-sided

Ha: µ > µ0 One-sided

Ha: µ < µ0 One-sided

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 57

R One-Sample t.test

x = c(x1, x2, …, xn) # data set

t.test(x, mu = Mo) # two-sided

t.test(x, mu = Mo, alt = "g") # one-sided

t.test(x, mu = Mo, alt = "l") # one-sided

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 58

R One-Sample t.test [cont.]

Example: Text, Problem 8.11, page 226 library(UsingR) attach(stud.recs) x = sat.m # Math SAT Scores hist(x) # Visual display qqnorm(x) # Normal quantile plot qqline(x, col=2) # Add equality line

t.test(x, mu = 500) detach(stud.recs)

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 59

Normality Test

Shapiro-Wilk test:Ho: X ~ Normal Ha: X !~ Normal

Command: shapiro.test(x)

# Examine p-value

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 60

Normality Test [cont.]

Example: On Base %

data(OBP)

summary(OBP)

boxplot(OBP) 0.2

0.3

0.4

0.5

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 61

Normality Test [cont.]

qqnorm(OBP)

qqline(OBP, col=2)

shapiro.test(OBP)

wilcox.test(OBP, mu=.330)

-3 -2 -1 0 1 2 3

0.2

0.3

0.4

0.5

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 62

One-Sample Proportion Test

x total successes; n sample size

prop.test(x, n, p = Po) # two-sided

prop.test(x, n, p = Po, alt= "g")

prop.test(x, n, p = Po, alt= "l")

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 63

Or Using Binomial “Exact” Test

binom.test(x, n, p = Po) binom.test(x, n, p = Po, alt = "g")

binom.test(x, n, p = Po, alt = "l")

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 64

Proportion Test

Text, Example 8.3: Survey US Poverty Rate

Ho: P = 0.113 # Year 2000 RateHa: P > 0.113 # Year 2001 Rate Increased

x = 5850 # Sample people UPL n = 50000 # Sample size prop.test(x, n, p = 0.113, alt = "g") binom.test(x, n, p = 0.113, alt = "g")

© J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 65

Some Modeling Functions/Packages

Linear Models: anova, car, lm, glmGraphics: graphics, grid,

latticeMultivariate: mva, clusterSurvey: surveySQC: qccTime Series: tseriesBayesian: BRugs, MCMCpack,

… Simulation: boot, bootstrap, Zelig

You Perform An Experiment

In Order To Learn,Not To Prove.

W Edwards Deming