workshop1.pdf

Statistics Workshop 1: Introduction to R. Tuesday May 26, 2009

Assignments

Generally speaking, there are three basic forms of assigning data. Case one is the single atomor a single number. Assigning a number to an object in this case is quite trivial. All we needis to use < or = for assigning a number or an atom to a character. In the following, >refers to the prompt in R.

The second form is the vector form. In this form, we assign a name to an array of numbers.This can be done with the command c which stands for concatenation. The interesting factis that we can call any member of the vector or we can replace that member with a newmember or to perform various arithmetic operations on that vector, as shown below.

Finally, the third form of storing data is to put them in a matrix form. The commandis matrix. First we need to input the data set of interest, followed by telling R the dimen-sionality of the matrix that needs to be specified. For example, we can put an array of 9numbers into a matrix with 3 rows and 3 columns. We demonstrate all of these below.

Atoms, Vectors and Matrices

(a) Atoms:

> sam=2

> sam

[1] 2

> sam+sam

[1] 4

> (2*sam*2)/2

[1] 4

> sam^(1/3)

[1] 1.259921

> sqrt(sam)

[1] 1.414214

> abs(-sam)

[1] 2

1

(b) Vectors

> class.age=c(35,35,36,37,37,38,38,39,40.5,43,44,44.5,50,19)

> class.age

[1] 35.0 35.0 36.0 37.0 37.0 38.0 38.0 39.0 40.5 43.0 44.0 44.5 50.0 19.0

> class.age[3]

[1] 36

> class.age[1:5]

[1] 35 35 36 37 37

> class.age[-5]

[1] 35.0 35.0 36.0 37.0 38.0 38.0 39.0 40.5 43.0 44.0 44.5 50.0 19.0

> class.age[-c(2,7)]

[1] 35.0 36.0 37.0 37.0 38.0 39.0 40.5 43.0 44.0 44.5 50.0 19.0

> class.age*2

[1] 70 70 72 74 74 76 76 78 81 86 88 89 100 38

> sqrt(class.age)

[1] 5.916080 5.916080 6.000000 6.082763 6.082763 6.164414 6.164414 6.244998

6.363961 6.557439 6.633250 6.670832 7.071068

[14] 4.358899

> class.age^(-1)

[1] 0.02857143 0.02857143 0.02777778 0.02702703 0.02702703 0.02631579

0.02631579 0.02564103 0.02469136 0.02325581

[11] 0.02272727 0.02247191 0.02000000 0.05263158

> class.age*class.age

[1] 1225.00 1225.00 1296.00 1369.00 1369.00 1444.00 1444.00 1521.00 1640.25

1849.00 1936.00 1980.25 2500.00 361.00

> class.age^2

[1] 1225.00 1225.00 1296.00 1369.00 1369.00 1444.00 1444.00 1521.00 1640.25

1849.00 1936.00 1980.25 2500.00 361.00

> mean(class.age)

2

[1] 38.28571

> median(class.age)

[1] 38

> class.age=(class.age)/2

> class.age

[1] 17.50 17.50 18.00 18.50 18.50 19.00 19.00 19.50 20.25 21.50 22.00 22.25 25.00 9.50

> class.age=class.age*2

Often it is useful to create an empty vector. Here is the way this is done:

> hi=numeric(10)

> hi

[1] 0 0 0 0 0 0 0 0 0 0

Vectors do not have to be numerical. We can create a vector of characters:

> hi=c("hello","whasup","longday")

> hi

[1] "hello" "whasup" "longday"

Later, it becomes useful to ask R the length of a vector:

> length(class.age)

[1] 14

(c) Matrices

> sam = matrix(nrow=3,ncol=4)

> sam

[,1] [,2] [,3] [,4]

[1,] NA NA NA NA

[2,] NA NA NA NA

[3,] NA NA NA NA

> sam = matrix(c(1,2,3,4,5,6,7,8,9,10,11,12),nrow=3,byrow=T)

3

> sam

[,1] [,2] [,3] [,4]

[1,] 1 2 3 4

[2,] 5 6 7 8

[3,] 9 10 11 12

> sam sam

[,1] [,2] [,3] [,4]

[1,] 1 4 7 10

[2,] 2 5 8 11

[3,] 3 6 9 12

> sally =c(1,2,3,4,5,6,7,8,9,10,11,12)

> sam=matrix(sally,nrow=3,byrow=T)

> sam

[,1] [,2] [,3] [,4]

[1,] 1 2 3 4

[2,] 5 6 7 8

[3,] 9 10 11 12

> v1=c(1,2,3,4)

> v2=c(5,6,7,8)

> v3=c(9,10,11,12)

> sam=matrix(c(v1,v2,v3),nrow=3,byrow=T)

> sam

[,1] [,2] [,3] [,4]

[1,] 1 2 3 4

[2,] 5 6 7 8

[3,] 9 10 11 12

4

> sam[1,]

[1] 1 2 3 4

> sam[,2]

[1] 2 6 10

> sam[1,3]

[1] 3

sam[3,] sam

[,1] [,2] [,3] [,4]

[1,] 1 2 3 4

[2,] 5 6 7 8

[3,] 5 6 7 8

> sam[1,] sam

[,1] [,2] [,3] [,4]

[1,] 0 0.6931472 1.098612 1.386294

[2,] 5 6.0000000 7.000000 8.000000

[3,] 5 6.0000000 7.000000 8.000000

(d) ListsR provides a powerful additional storing function called list. The importance of list isin that we can store various objects of different natures such as matrices, vectors, oratoms into a unique space, followed by calling different parts of that object separately.

Lets assume that we would like to store the following three object into a list-objectcalled sam:

> s1=3

> s2=seq(1,10,2)

> s3=matrix(c(1:9),nrow=3)

> s1

[1] 3

> s2

5

[1] 1 3 5 7 9

> s3

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

> s s

[[1]]

[1] 3

[[2]]

[1] 1 3 5 7 9

[[3]]

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

> s[[1]]

[1] 3

> s[[2]]

[1] 1 3 5 7 9

> s[[3]]

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

The for loop

Often times, it becomes necessary to repeat certain calculations a number of times.This is done in R using a simple command called for. Here are some examples:

6

> for(i in 1:3)

+ {

+ print("sam")

+ }

[1] "sam"

[1] "sam"

[1] "sam"

> s=matrix(c(1,2,3,4,5,6,7,8,9),nrow=3,byrow=T)

> for(i in 1:3)

+ print(s)

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

or:

> for(i in 1:3)

+ { print(s)}

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

7

> for(i in 1:3)

+{

+ print(s[i,])

+ }

[1] 1 4 7

[1] 2 5 8

[1] 3 6 9

Functions

In principle, there are two sorts of functions in R. The most common and useful onesare the library functions or the already written commands. For example, mean andsd are commands that calculate the average and the standard deviation of an object,say a vector respectively. Here are a couple of examples:

> s2 s2

[1] 1 3 5 7 9

> mean(s2)

[1] 5

> var(s2)

[1] 10

> sd(s2)

[1] 3.162278

> median(s2)

[1] 5

Second type of functions are those that the users of R create. These functions willremain in the command memory of the software unless you delete them or overwritethem. Expectedly, the command to create a function is function.

Here is an example of a function that gets a matrix, and calculates the standarddeviation divided by the mean of its rows

. This measurement is called the coefficient

of variation. Note that in writing this function, I use the commands mean, and sd.

8

In general, any time you are not sure what an R command does or to learn about itsspecifics, just type a question mark, followed by the command in the prompt.

> m.cv m.cv(sm3)

[1] 0.75 0.60 0.50

9

Visualizing Data: Pie Charts, Stem plots, Histograms

Categorical Data

For categorical data, we keep track of counts or relative frequencies of each group. Therefore,a schematic presentation should reflect the percentage of occurrences in each category. Thisis usually done via two types of graphs: 1- Pie charts, and 2- Barplots. Both graphs are easyto create in R. An important issue here is that in most cases, it would make sense to labelthe categories. We show you how to do this below.

Example 1. The counts and the percentages of the marital status of American womenwas collected by the Current Population Survey in 1995 as following:

Marital Status Count (millions) PercentNever Married 43.9 22.9

Married 116.7 20.9Widowed 13.4 7.0Divorced 17.6 9.2

Here are the commands to provide the pie-chart for these data (figure 1):

> married married.code pie(married,married.code)

Alternatively, we could label each piece of pie by creating a factor vector that containsthe names of each pie (figure 2):

> married.code pie(married,married.code)

To create a barplot for the married data, it is sufficient to execute the following function(figure 3):

> married barplot(married,names.arg=married.code)

10

12

3

4

Figure 1: Pie chart for the Married data.

Stemplots and Histograms

For quantitative data, stemplots and histograms are the useful visual tools.Example 2. Lets revisit the class-age data we introduced previously. To create the stemplotfor these data, we can do the following:

> class.age class.age

[1] 35.0 35.0 36.0 37.0 37.0 38.0 38.0 39.0 40.5 43.0 44.0 44.5 50.0 19.0

> stem(class.age)

The decimal point is 1 digit(s) to the right of the |

1 | 9

2 |

3 | 55677889

4 | 1345

11

never married

married

widowed

divorced

Figure 2: Pie chart for the Married data with labels.

5 | 0

> stem(class.age,scale=2)


1 | 9

2 |

2 |

3 |

3 | 55677889

4 | 134

4 | 5

5 | 0

> test stem(test)

12

never married married widowed divorced

010

2030

4050

60

Figure 3: Barplot for the Married data with labels.

The decimal point is 1 digit(s) to the left of the |

0 | 01

2 | 21466

4 | 055

6 | 5

> stem(test,scale=2)

The decimal point is 1 digit(s) to the left of the |

0 | 01

1 |

2 | 2

3 | 1466

4 | 05

5 | 5

6 | 5

13

To create a histogram for the class-age data, it is sufficient to use the hist command(figure 4):

> hist(class.age)

Histogram of class.age

class.age

Freq

uenc

y

15 20 25 30 35 40 45 50

01

23

45

6

Figure 4: The histogram of class.age.

We can make the bars finer. Here is a simple trick (figure 5):

> b1 b1

[1] 15 18 21 24 27 30 33 36 39 42 45 48

> b1 hist(class.age,breaks=b1)

14

Histogram of class.age

class.age

Freq

uenc

y

20 25 30 35 40 45 50

01

23

45

Figure 5: The histogram of class.age with finer classes.

Measuring Center: The Mean, The Median, and the

Quartile

The measures of centrality play a fundamental role in understanding the statistical distribu-tions. The chief important ones are the mean, the median, and the other quantiles.

> mean(class.age)

[1] 38.28571

> median(class.age)

[1] 38

> quantile(class.age)

0% 25% 50% 75% 100%

19.000 36.250 38.000 42.375 50.000

> quantile(class.age,prob=0.66)

15

66%

39.87

Comparing Mean and Median

The Symmetric Case

For symmetric distributions such as the one in figure 6, the median and the mean are closeto each other.

Histogram of n1

n1

Freq

uenc

y

70 80 90 100 110 120 130

050

100

150

200

Figure 6: A symmetric distribution. Mean= 100.07, Median= 99.71.

For distributions that are skewed to the left such as the one in figure 7, the mean issmaller than the median (why?)

Finally, for the right-skewed distributions, the mean is larger than the median (figure 8).

Measuring Spread: The Standard Deviation

The standard deviation reflects the number of units away from the mean. For example, thestandard deviation for the data in figures 6, 7, and 8 are 10.03, 0.19, and 0.19 respectively.

To calculate the variance and the standard deviation for the class.age data, we can type:

16

Histogram of n3

n3

Freq

uenc

y

0.2 0.4 0.6 0.8 1.0

050

100

150

200

250

Figure 7: Left-skewed Distribution. Mean= 0.75 , Median= 0.79.

> var(class.age)

[1] 49.1044

> sd(class.age)

[1] 7.007453

Project 1. Visualizing Grades.

First, read the file grades.txt from the webpage. To do this, run the following code in R:

> grades=read.table("http://math.fullerton.edu/sbehseta/grades.txt",header=T)

This will generate a data frame of the size 100 3 called grades in R for you. Rows arestudents, and columns represent the Verbal SAT score, the Math SAT score, and the GPAfor each student. To examine the dimensionality of this object you can type:

> dim(grades)

[1] 100 3

Which confirms what we planned initially.Now, we are at a position to answer the following questions:

17

Histogram of n5

n5

Freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.0

050

100

150

200

250

Figure 8: Right-skewed Distribution. Mean= 0.24, Median= 0.19.

(1.) Create barplots, dotplots, stemplots, and histograms for the three variables of interest.To make life easier, use the command attach to make the data file grades yourworking data. Next, proceed by just typing the name of the column of interest. Hereis what I mean:

> attach(grades)

> GPA

[1] 2.6 2.3 2.4 3.0 3.1 2.9 3.1 3.3 2.3 3.3 2.6 3.3 2.0 3.0 1.9 2.7 2.0 3.3

[19] 2.0 2.3 3.3 2.8 1.7 2.4 3.4 2.8 2.4 1.9 2.5 2.3 3.4 2.8 1.9 3.0 3.7 2.3

[37] 2.9 3.3 2.1 1.2 3.3 2.0 3.1 2.6 2.4 2.4 2.3 3.0 2.9 3.4 2.3 1.4 2.8 2.4

[55] 3.4 2.5 3.6 2.6 3.6 2.9 2.6 3.8 3.0 2.5 3.5 2.0 3.0 2.0 1.8 2.3 2.1 3.0

[73] 3.3 3.0 3.2 2.3 3.3 3.3 3.9 2.1 2.6 2.4 3.3 3.1 3.6 2.9 2.4 1.8 2.4 2.9

[91] 3.5 3.4 2.3 2.9 1.8 2.8 2.3 2.5 2.4 2.9

(2.) Calculate the min, the max, the mean, the median, first quartile, third quartile, andthe standard deviation of each variable.A good chunk of that information may be obtained through using the command sum-mary :

> summary(GPA)

18

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.200 2.300 2.750 2.706 3.125 3.900

(3.) Report your findings in detail. Compare the verbal scores with the math scores. Com-ment on the symmetry, measures of centrality, measures of spread, and the potentialoutliers in each distribution. Make sure to comment on the statistical features of theGPA as well.

Boxplots

Boxplots are efficient tools for representing data distributions. The five number summarycan be traced on a boxplot. Additionally, we can figure-out outliers with boxplots.

Remember the three distributions in figures 6, 7 and 8. Note that these distributions aresymmetric, left skewed and right skewed respectively. Here is how I created figure 9 below:

> n1=rnorm(100000,2,3)

> n2=rpois(100000,3)

> n3=rbeta(100000,12,3)

> par(mfrow=c(3,2))

> hist(n1)

> boxplot(n1)

> hist(n2)

> boxplot(n2)

> hist(n3)

> boxplot(n3)

Note that the command par(mfrow) creates a 3 by 2 grid in the graphic area.The side-by-side boxplots are very helpful in comparing two or more distributions. For

example figure 2 shows a side-by-side boxplots for the two skewed distributions of figure 1.

> boxplot(n1,n2)

Linear Transformations and Their Effect on x and s:

Standardization

To experiment with the idea of linear transformations, lets go back to the first dataset n1and calculate the mean and the standard deviation of it:

> mean(n1)

[1] 2.000206

19

Histogram of n1

n1

Freq

uenc

y

15 10 5 0 5 10 15

010

000

2500

0

10

010

Histogram of n2

n2

Freq

uenc

y

0 2 4 6 8 10 12

010

000

04

812

Histogram of n3

n3

Freq

uenc

y

0.4 0.6 0.8 1.0

010

000

0.3

0.6

0.9

Figure 9: Histograms along with boxplots for the three simulated datasets.

> sd(n1)

[1] 2.999017

Let us perform the following simple linear transformation on these numbers:

Z =n1 n1

s

Here is the code:

> z mean(z)

[1] 7.186852e-17

> sd(z)

[1] 1

20

1 2

10

5

05

1015

Figure 10: Side-by-side boxplots for the two of the distributions of figure 1.

> hist(z)

21

Histogram of z

z

Freq

uenc

y

4 2 0 2 4

050

0010

000

1500

020

000

Figure 11: Standardized version of n1. Note that the mean is roughly 0, and the standard deviationis 1.

Verification of the 68% - 95% - 99.7% Rule

To verify the rule, we reconsider the standardized vector z. Next, we count the number ofelements of z whose values are between -1 to 1, -2 to 2 and -3 to 3 respectively:

> length(z[z>-1 & z length(z[z>-2 & z length(z[z>-3 & z u=rnorm(10000,3,4)

> m=mean(u)

> s=sd(u)

22

> length(u[u>m-s&u length(u[u>m-2*s&u length(u[u>m-3*s&u pnorm(1,0,1)-pnorm(-1,0,1)

[1] 0.6826895

> pnorm(2,0,1)-pnorm(-2,0,1)

[1] 0.9544997

> pnorm(3,0,1)-pnorm(-3,0,1)

[1] 0.9973002

Areas Under Normal Distribution: General

So, the command pnorm provides the area below a given point for any normal distribution.Suppose, we know that grades in statistics follow a normal distribution with the mean of 78,and the standard deviation of 7. We like to know where does a grade of 83 stand:

> pnorm(83,78,7)

[1] 0.7624747

Roughly, 76% of all grades are below 83. We can do the reverse calculation. Suppose thatwe like to find the same grade, this time knowing the area below it:

> qnorm(0.7624747,78,7)

[1] 83

Assessing Normality: Normal Quantile Plots or QQ-

plots

Quantile-Quantile plots are among the most powerful tools for assessing the normality of adata set. The idea is relatively simple. We want to know whether the empirical quantilesof our data will match with the theoretical quantiles of a standard normal. Data will forma atraight line if normality holds. R does provides QQ-plots through the command theqqnorm. Here is the QQ-plot for the data set test3 (figure 1):

> qqnorm(n1)

Below, we demonstrate the qqplot of a symmetric, a left-skewed and a right-skeweddistribution. The code shows you how we generate the next figure.

23

3 2 1 0 1 2 3

4

3

2

1

01

23

Normal QQ Plot

Theoretical Quantiles

Sam

ple

Quan

tiles

Figure 12: qq-plot for n1.

> n4=rnorm(1000,3,5)

> n5=rpois(1000,3)

> n6=rbeta(1000,8,3)

> par(mfrow=c(3,2))

> hist(n4)

> qqnorm(n4)

> hist(n5)

> qqnorm(n5)

> hist(n6)

> qqnorm(n6)

Importing Text-files

Reminder:

> grades=read.table("http://math.fullerton.edu/sbehseta/grades.txt",header=T)

24

Histogram of n4

n4

Freq

uenc

y

15 10 5 0 5 10 15

050

150

3 2 1 0 1 2 3

10

010

Normal QQ Plot


Sam

ple

Quan

tiles

Histogram of n5

n5

Freq

uenc

y

0 2 4 6 8

010

020

0

3 2 1 0 1 2 3

02

46

8

Normal QQ Plot


Sam

ple

Quan

tiles

Histogram of n6

n6

Freq

uenc

y

0.4 0.6 0.8 1.0

050

150

3 2 1 0 1 2 3

0.3

0.6

0.9

Normal QQ Plot

Theoretical QuantilesSa

mpl

e Qu

antile

s

Figure 13: The Normal Quantile plots for symmetric and asymmetric distributions.

> attach(grades)

> dim(grades)

[1] 100 3

> Verbal

[1] 623 454 643 585 719 693 571 646 613 655 662 585 580 648 405 506 669 558

[19] 577 487 682 565 552 567 745 610 493 571 682 600 740 593 488 526 630 586

[37] 610 695 539 490 509 667 597 662 566 597 604 519 643 606 500 460 717 592

[55] 752 695 610 620 682 524 552 703 584 550 659 585 578 533 532 708 537 635

[73] 591 552 557 599 540 752 726 630 558 646 643 606 682 565 578 488 361 560

[91] 630 666 719 669 571 520 571 539 580 629

> hist(Verbal)

> summary(Verbal)

Min. 1st Qu. Median Mean 3rd Qu. Max.

25

361.0 552.0 592.5 598.5 649.8 752.0

> stem(verbal)


3 | 6

4 | 1

4 | 5699999

5 | 0112223334444

5 | 55556666777777778888889999999

6 | 000001111112233334444

6 | 5556666777788889

7 | 000122234

7 | 555

26

Scatterplots and Pearsons Correlation

To create scatterplots, the command is simply plot.

> plot(Verbal,Math)

To calculate the correlation coefficient between any two random variables, we can use thecor command:

> cor(Verbal,Math)

[1] 0.4306938

> cor(Verbal,GPA)

[1] 0.4847681

> cor(Math,GPA)

[1] 0.2236183

Central Limit Theorem

Central limit theorem says that if x D(, ) where D is a probability density or a massfunction (regardless of the form of the distribution), when sample size is large enough:x(

n) N(0, 1). When the data comes from a Normal distribution mean and standard

deviation , we can assume x(

n) N(0, 1).

To verify these results lets look at Figure 1. Figure 1, shows a clearly right skewedpopulation with = 2.029 and = 1.382. For 1000 times, we take samples of size 2from this population and we will obtain 1000 sample averages associated with those randomsamples. Then we repeat this process for samples of sizes 3, 6, 10, 20, 100 and each timewe keep track of the mean and the distribution of those 1000 sample averages. We alsoplot histograms and qq-plots for each scenario. It turns out that as sample sizes increasesthe distribution of 1000 sample averages converge to normality (figures 2 and 3). Also, thestandard deviations of those sampling distributions get closer to

n(table 1).

Next, I sample from a normal population with = 99.602 and = 10.2211 (figure 4).Then I repeat the same procedure for sample averages. It turns out that regardless of samplesizes, the result associated with central limit theorem hold (figures 5, 6, and table 2).

27

Sample Size 2 3 6 10 20 100Mean 2.0945 2.061667 2.016 2.0301 2.0186 2.02654

Standard Deviation 0.9860487 0.78204 0.5660369 0.453773 0.3010642 0.1300328Table 1. Results for popoulation 1. The mean and standard deviations of the sample mean with

different sample sizes

Sample Size 2 3 6 10 20 100Mean 99.95793 99.30017 99.5832 99.5639 99.653 99.57605

Standard Deviation 7.199265 5.83941 4.132356 3.290563 2.240824 0.9427503Table 2. Results for population 2. The mean and standard deviations of the sample mean with

different sample sizes

Histogram of test1

test1

Freq

uenc

y

0 1 2 3 4 5 6 7

050

100

150

200

250

300

Figure 14: Case one: Population distribution. The distribution is Skewed to the right.

28

Histogram of mean.size2

mean.size2

Freq

uenc

y

0 1 2 3 4 5

050

150


mean.size3

Freq

uenc

y

0 1 2 3 4 5

010

020

030

0


mean.size6

Freq

uenc

y

1 2 3 4

010

025

0


mean.size10

Freq

uenc

y

1.0 1.5 2.0 2.5 3.0 3.5

050

150


mean.size20

Freq

uenc

y

1.5 2.0 2.5 3.0

010

020

0


mean.size100

Freq

uenc

y

1.6 1.8 2.0 2.2 2.4

010

020

0

Figure 15: Sampling Distributions of Sample means for sample sizes 2, 3, 6, 10, 20, 100.

R-codes For Simulation

Population 1: Right Skewed Distribution We can simulate from a Poisson dis-tribution:

> test1=rpois(1000,2)

> hist(test1)

> mean(test1)

[1] 2.029

> sd(test1)

[1] 1.382777

Population 1: Obtaining 1000 Samples With Size 2, 3, 6, 10, 20, 100Here is the case for Size 2. Others are similar.

29

3 2 1 0 1 2 3

01

23

45

Normal QQ Plot


Sam

ple

Quan

tiles

3 2 1 0 1 2 3

01

23

45

Normal QQ Plot


Sam

ple

Quan

tiles

3 2 1 0 1 2 3

12

34

Normal QQ Plot


Sam

ple

Quan

tiles

3 2 1 0 1 2 3

1.0

2.0

3.0

Normal QQ Plot


Sam

ple

Quan

tiles

3 2 1 0 1 2 3

1.5

2.5

Normal QQ Plot


Sam

ple

Quan

tiles

3 2 1 0 1 2 3

1.6

2.0

2.4

Normal QQ Plot


Sam

ple

Quan

tiles

Figure 16: QQ-plots for the sample mean distributions for different sample sizes.

> test=matrix(nrow=1000,ncol=2)

> for(i in 1:1000)

{

test[i,]=sample(test1,2)

}

> mean.size2=apply(test,1,mean)

> mean(mean.size2)

[1] 2.0945

> sd(mean.size2)

[1] 0.9860487

30

Histogram of test2

test2

Freq

uenc

y

70 80 90 100 110 120 130

050

100

150

200

Figure 17: Case two: Population distribution. The distribution symmetric.

Population 2: Obtaining 1000 Samples With Size 2, 3, 6, 10, 20, 100 Again,only the case for size 2 is included.

test=matrix(nrow=1000,ncol=2)

for(i in 1:1000)

{

test[i,]=sample(test2,2)

}

mean.size2.norm=apply(test,1,mean)

> mean(mean.size2.norm)

[1] 99.95793

> sd(mean.size2.norm)

31

Histogram of mean.size2.norm

mean.size2.norm

Freq

uenc

y

80 90 100 110 120 130

010

020

0


mean.size3.norm

Freq

uenc

y

80 90 100 110 120

010

020

030

0


mean.size6.norm

Freq

uenc

y

85 90 95 100 105 110

050

150


mean.size10.norm

Freq

uenc

y

90 95 100 105 110

050

150


mean.size20.norm

Freq

uenc

y

95 100 105

050

100


mean.size100.norm

Freq

uenc

y

97 98 99 100 101 102 103

050

150

Figure 18: Sampling Distributions of Sample means for sample sizes 2, 3, 6, 10, 20, 100.

[1] 7.199265

Population 1: Plotting Histograms and QQ-plots

par(mfrow=c(3,2))

hist(mean.size2)

hist(mean.size3)

hist(mean.size6)

hist(mean.size10)

hist(mean.size20)

hist(mean.size100)

32

3 2 1 0 1 2 3

8010

012

0

Normal QQ Plot


Sam

ple

Quan

tiles

3 2 1 0 1 2 3

8090

110

Normal QQ Plot


Sam

ple

Quan

tiles

3 2 1 0 1 2 3

9010

011

0

Normal QQ Plot


Sam

ple

Quan

tiles

3 2 1 0 1 2 3

9010

011

0

Normal QQ Plot


Sam

ple

Quan

tiles

3 2 1 0 1 2 3

9296

100

106

Normal QQ Plot


Sam

ple

Quan

tiles

3 2 1 0 1 2 3

9799

101

Normal QQ Plot


Sam

ple

Quan

tiles

Figure 19: QQ-plots for the sample mean distributions for different sample sizes.

par(mfrow=c(3,2))

qqnorm(mean.size2)

qqnorm(mean.size3)

qqnorm(mean.size6)

qqnorm(mean.size10)

qqnorm(mean.size20)

qqnorm(mean.size100)

33

SAT Scores Again

Verbal: t-test and Confidence Interval

> mean(Verbal)

[1] 598.49

> t.test(Verbal,mu=600)

One Sample t-test

data: verbal t = -0.1986, df = 99, p-value = 0.843 alternative

hypothesis: true mean is not equal to 600 95 percent confidence

interval:

583.4042 613.5758

sample estimates: mean of x

598.49

Verbal: Two-sided Versus One-sided Tests

> t.test(Verbal,mu=600,alternative="two.sided")

One Sample t-test



interval:

583.4042 613.5758


598.49

> t.test(Verbal,mu=600,alternative="less")

One Sample t-test


hypothesis: true mean is less than 600 95 percent confidence

interval:

-Inf 611.1138


598.49

> t.test(Verbal,mu=600,alternative="greater")

34

One Sample t-test


hypothesis: true mean is greater than 600 95 percent confidence

interval:

585.8662 Inf


598.49

Verbal: Changing

> t.test(Verbal,mu=620)

One Sample t-test



interval:

583.4042 613.5758


598.49

> t.test(Verbal,mu=620,alternative="less")

One Sample t-test


hypothesis: true mean is less than 620 95 percent confidence

interval:

-Inf 611.1138


598.49

> t.test(Verbal,mu=620,alternative="greater")

One Sample t-test


hypothesis: true mean is greater than 620

35

Math: Confidence Interval and t-test

> t.test(Math)

One Sample t-test

data: math t = 99.6647, df = 99, p-value = < 2.2e-16 alternative


interval:

641.0874 667.1326


654.11

Two-sample t-test for Math and Verbal

> t.test(Math,verbal,mu=0)

Welch Two Sample t-test

data: math and verbal t = 5.5377, df = 193.867, p-value =

9.88e-08 alternative hypothesis: true difference in means is not

equal to 0 95 percent confidence interval:

35.81078 75.42922

sample estimates: mean of x mean of y

654.11 598.49

> t.test(Math,verbal,mu=0,alternative="greater")


data: math and verbal t = 5.5377, df = 193.867, p-value =

4.94e-08 alternative hypothesis: true difference in means is

greater than 0 95 percent confidence interval:

39.02003 Inf


654.11 598.49

> t.test(Math,verbal,mu=0,alternative="less")

36


data: math and verbal t = 5.5377, df = 193.867, p-value = 1

alternative hypothesis: true difference in means is less than 0 95

percent confidence interval:

-Inf 72.21997


654.11 598.49

> t.test(Math,verbal,mu=50)


data: math and verbal t = 0.5595, df = 193.867, p-value = 0.5764

alternative hypothesis: true difference in means is not equal to

50 95 percent confidence interval:

35.81078 75.42922


654.11 598.49

37

Project 2

(1) (Moore and McCabe, 1998) Crop researchers plant 15 plots with a new variety of corn.The yields in bushels per acre are:138.0 139.1 113.0 132.5 140.7 109.7 118.9 134.8109.6 127.3 115.6 130.4 130.2 111.7 105.5Assume that the population of yields is normal.

(a) Find the 90% confidence interval for the mean yield for this variety of corn.

(b) Find the 95% confidence interval.

(c) Find the 99% confidence interval.

(d) How do margin of error (sampling error) in (a), (b), and (c) change as confidencelevel increases?

(2) (Moore and McCabe, 1998) The table below gives the pretest and posttest scores onMLA listening test in Spanish for 20 high school Spanish teachers who attended anintensive summer course in Spanish.Subject Pretest Posttest Subject Pretest Posttest

1 30 29 11 30 322 28 30 12 29 283 31 32 13 31 344 26 30 14 29 325 20 16 15 34 326 30 25 16 20 277 34 31 17 26 288 15 18 18 25 299 28 33 19 31 3210 20 25 20 29 32

Give a 90% confidence interval for the mean increase in listening score due to attendingthe summer institute.

(3) Download the dataset grades.txt from the course webpage. Build a 95% confidenceintervals for math, and verbal. Are there overlaps? Interpret your findings.

(4) Bonus: Consider verbal scores in the grades dataset. First, show that verbalscores follows a normal distribution. Then, construct a 95% confidence interval for thepopulation mean verbal of verbal scores.An alternative way of constructing a 95% confidence interval is to use the quantiles ofthe data: Consider (grades0.025,grades0.975), where grades0.025, and grades0.975 are sim-ply the 2.5% and the 97.5% percentiles of the dataset. Does this confidence intervalagree with the confidence interval you constructed before? Why?Note that x = 598.49, and s = 76.029 for grades. Generate 100,000 normal val-ues from Normal(598, 76/

100). Create a quantile confidence interval for the latter

38

dataset. Does this confidence interval agree with the first confidence interval? Canyou think of an explanation for this agreement(disagreement)?Hint: look at the command quantile.

39

Date post:	08-Nov-2015
Category:	Documents
Upload:	ted-zhang
View:	215 times
Download:	1 times

workshop1.pdf

Documents