Statistics Workshop 1: Introduction to R. Tuesday May 26, 2009
Assignments
Generally speaking, there are three basic forms of assigning data. Case one is the single atomor a single number. Assigning a number to an object in this case is quite trivial. All we needis to use < or = for assigning a number or an atom to a character. In the following, >refers to the prompt in R.
The second form is the vector form. In this form, we assign a name to an array of numbers.This can be done with the command c which stands for concatenation. The interesting factis that we can call any member of the vector or we can replace that member with a newmember or to perform various arithmetic operations on that vector, as shown below.
Finally, the third form of storing data is to put them in a matrix form. The commandis matrix. First we need to input the data set of interest, followed by telling R the dimen-sionality of the matrix that needs to be specified. For example, we can put an array of 9numbers into a matrix with 3 rows and 3 columns. We demonstrate all of these below.
Atoms, Vectors and Matrices
(a) Atoms:
> sam=2
> sam
[1] 2
> sam+sam
[1] 4
> (2*sam*2)/2
[1] 4
> sam^(1/3)
[1] 1.259921
> sqrt(sam)
[1] 1.414214
> abs(-sam)
[1] 2
1
(b) Vectors
> class.age=c(35,35,36,37,37,38,38,39,40.5,43,44,44.5,50,19)
> class.age
[1] 35.0 35.0 36.0 37.0 37.0 38.0 38.0 39.0 40.5 43.0 44.0 44.5 50.0 19.0
> class.age[3]
[1] 36
> class.age[1:5]
[1] 35 35 36 37 37
> class.age[-5]
[1] 35.0 35.0 36.0 37.0 38.0 38.0 39.0 40.5 43.0 44.0 44.5 50.0 19.0
> class.age[-c(2,7)]
[1] 35.0 36.0 37.0 37.0 38.0 39.0 40.5 43.0 44.0 44.5 50.0 19.0
> class.age*2
[1] 70 70 72 74 74 76 76 78 81 86 88 89 100 38
> sqrt(class.age)
[1] 5.916080 5.916080 6.000000 6.082763 6.082763 6.164414 6.164414 6.244998
6.363961 6.557439 6.633250 6.670832 7.071068
[14] 4.358899
> class.age^(-1)
[1] 0.02857143 0.02857143 0.02777778 0.02702703 0.02702703 0.02631579
0.02631579 0.02564103 0.02469136 0.02325581
[11] 0.02272727 0.02247191 0.02000000 0.05263158
> class.age*class.age
[1] 1225.00 1225.00 1296.00 1369.00 1369.00 1444.00 1444.00 1521.00 1640.25
1849.00 1936.00 1980.25 2500.00 361.00
> class.age^2
[1] 1225.00 1225.00 1296.00 1369.00 1369.00 1444.00 1444.00 1521.00 1640.25
1849.00 1936.00 1980.25 2500.00 361.00
> mean(class.age)
2
[1] 38.28571
> median(class.age)
[1] 38
> class.age=(class.age)/2
> class.age
[1] 17.50 17.50 18.00 18.50 18.50 19.00 19.00 19.50 20.25 21.50 22.00 22.25 25.00 9.50
> class.age=class.age*2
Often it is useful to create an empty vector. Here is the way this is done:
> hi=numeric(10)
> hi
[1] 0 0 0 0 0 0 0 0 0 0
Vectors do not have to be numerical. We can create a vector of characters:
> hi=c("hello","whasup","longday")
> hi
[1] "hello" "whasup" "longday"
Later, it becomes useful to ask R the length of a vector:
> length(class.age)
[1] 14
(c) Matrices
> sam = matrix(nrow=3,ncol=4)
> sam
[,1] [,2] [,3] [,4]
[1,] NA NA NA NA
[2,] NA NA NA NA
[3,] NA NA NA NA
> sam = matrix(c(1,2,3,4,5,6,7,8,9,10,11,12),nrow=3,byrow=T)
3
> sam
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
> sam sam
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
> sally =c(1,2,3,4,5,6,7,8,9,10,11,12)
> sam=matrix(sally,nrow=3,byrow=T)
> sam
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
> v1=c(1,2,3,4)
> v2=c(5,6,7,8)
> v3=c(9,10,11,12)
> sam=matrix(c(v1,v2,v3),nrow=3,byrow=T)
> sam
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
4
> sam[1,]
[1] 1 2 3 4
> sam[,2]
[1] 2 6 10
> sam[1,3]
[1] 3
sam[3,] sam
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 5 6 7 8
> sam[1,] sam
[,1] [,2] [,3] [,4]
[1,] 0 0.6931472 1.098612 1.386294
[2,] 5 6.0000000 7.000000 8.000000
[3,] 5 6.0000000 7.000000 8.000000
(d) ListsR provides a powerful additional storing function called list. The importance of list isin that we can store various objects of different natures such as matrices, vectors, oratoms into a unique space, followed by calling different parts of that object separately.
Lets assume that we would like to store the following three object into a list-objectcalled sam:
> s1=3
> s2=seq(1,10,2)
> s3=matrix(c(1:9),nrow=3)
> s1
[1] 3
> s2
5
[1] 1 3 5 7 9
> s3
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> s s
[[1]]
[1] 3
[[2]]
[1] 1 3 5 7 9
[[3]]
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> s[[1]]
[1] 3
> s[[2]]
[1] 1 3 5 7 9
> s[[3]]
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
The for loop
Often times, it becomes necessary to repeat certain calculations a number of times.This is done in R using a simple command called for. Here are some examples:
6
> for(i in 1:3)
+ {
+ print("sam")
+ }
[1] "sam"
[1] "sam"
[1] "sam"
> s=matrix(c(1,2,3,4,5,6,7,8,9),nrow=3,byrow=T)
> for(i in 1:3)
+ print(s)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
or:
> for(i in 1:3)
+ { print(s)}
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
7
> for(i in 1:3)
+{
+ print(s[i,])
+ }
[1] 1 4 7
[1] 2 5 8
[1] 3 6 9
Functions
In principle, there are two sorts of functions in R. The most common and useful onesare the library functions or the already written commands. For example, mean andsd are commands that calculate the average and the standard deviation of an object,say a vector respectively. Here are a couple of examples:
> s2 s2
[1] 1 3 5 7 9
> mean(s2)
[1] 5
> var(s2)
[1] 10
> sd(s2)
[1] 3.162278
> median(s2)
[1] 5
Second type of functions are those that the users of R create. These functions willremain in the command memory of the software unless you delete them or overwritethem. Expectedly, the command to create a function is function.
Here is an example of a function that gets a matrix, and calculates the standarddeviation divided by the mean of its rows
. This measurement is called the coefficient
of variation. Note that in writing this function, I use the commands mean, and sd.
8
In general, any time you are not sure what an R command does or to learn about itsspecifics, just type a question mark, followed by the command in the prompt.
> m.cv m.cv(sm3)
[1] 0.75 0.60 0.50
9
Visualizing Data: Pie Charts, Stem plots, Histograms
Categorical Data
For categorical data, we keep track of counts or relative frequencies of each group. Therefore,a schematic presentation should reflect the percentage of occurrences in each category. Thisis usually done via two types of graphs: 1- Pie charts, and 2- Barplots. Both graphs are easyto create in R. An important issue here is that in most cases, it would make sense to labelthe categories. We show you how to do this below.
Example 1. The counts and the percentages of the marital status of American womenwas collected by the Current Population Survey in 1995 as following:
Marital Status Count (millions) PercentNever Married 43.9 22.9
Married 116.7 20.9Widowed 13.4 7.0Divorced 17.6 9.2
Here are the commands to provide the pie-chart for these data (figure 1):
> married married.code pie(married,married.code)
Alternatively, we could label each piece of pie by creating a factor vector that containsthe names of each pie (figure 2):
> married.code pie(married,married.code)
To create a barplot for the married data, it is sufficient to execute the following function(figure 3):
> married barplot(married,names.arg=married.code)
10
12
3
4
Figure 1: Pie chart for the Married data.
Stemplots and Histograms
For quantitative data, stemplots and histograms are the useful visual tools.Example 2. Lets revisit the class-age data we introduced previously. To create the stemplotfor these data, we can do the following:
> class.age class.age
[1] 35.0 35.0 36.0 37.0 37.0 38.0 38.0 39.0 40.5 43.0 44.0 44.5 50.0 19.0
> stem(class.age)
The decimal point is 1 digit(s) to the right of the |
1 | 9
2 |
3 | 55677889
4 | 1345
11
never married
married
widowed
divorced
Figure 2: Pie chart for the Married data with labels.
5 | 0
> stem(class.age,scale=2)
The decimal point is 1 digit(s) to the right of the |
1 | 9
2 |
2 |
3 |
3 | 55677889
4 | 134
4 | 5
5 | 0
> test stem(test)
12
never married married widowed divorced
010
2030
4050
60
Figure 3: Barplot for the Married data with labels.
The decimal point is 1 digit(s) to the left of the |
0 | 01
2 | 21466
4 | 055
6 | 5
> stem(test,scale=2)
The decimal point is 1 digit(s) to the left of the |
0 | 01
1 |
2 | 2
3 | 1466
4 | 05
5 | 5
6 | 5
13
To create a histogram for the class-age data, it is sufficient to use the hist command(figure 4):
> hist(class.age)
Histogram of class.age
class.age
Freq
uenc
y
15 20 25 30 35 40 45 50
01
23
45
6
Figure 4: The histogram of class.age.
We can make the bars finer. Here is a simple trick (figure 5):
> b1 b1
[1] 15 18 21 24 27 30 33 36 39 42 45 48
> b1 hist(class.age,breaks=b1)
14
Histogram of class.age
class.age
Freq
uenc
y
20 25 30 35 40 45 50
01
23
45
Figure 5: The histogram of class.age with finer classes.
Measuring Center: The Mean, The Median, and the
Quartile
The measures of centrality play a fundamental role in understanding the statistical distribu-tions. The chief important ones are the mean, the median, and the other quantiles.
> mean(class.age)
[1] 38.28571
> median(class.age)
[1] 38
> quantile(class.age)
0% 25% 50% 75% 100%
19.000 36.250 38.000 42.375 50.000
> quantile(class.age,prob=0.66)
15
66%
39.87
Comparing Mean and Median
The Symmetric Case
For symmetric distributions such as the one in figure 6, the median and the mean are closeto each other.
Histogram of n1
n1
Freq
uenc
y
70 80 90 100 110 120 130
050
100
150
200
Figure 6: A symmetric distribution. Mean= 100.07, Median= 99.71.
For distributions that are skewed to the left such as the one in figure 7, the mean issmaller than the median (why?)
Finally, for the right-skewed distributions, the mean is larger than the median (figure 8).
Measuring Spread: The Standard Deviation
The standard deviation reflects the number of units away from the mean. For example, thestandard deviation for the data in figures 6, 7, and 8 are 10.03, 0.19, and 0.19 respectively.
To calculate the variance and the standard deviation for the class.age data, we can type:
16
Histogram of n3
n3
Freq
uenc
y
0.2 0.4 0.6 0.8 1.0
050
100
150
200
250
Figure 7: Left-skewed Distribution. Mean= 0.75 , Median= 0.79.
> var(class.age)
[1] 49.1044
> sd(class.age)
[1] 7.007453
Project 1. Visualizing Grades.
First, read the file grades.txt from the webpage. To do this, run the following code in R:
> grades=read.table("http://math.fullerton.edu/sbehseta/grades.txt",header=T)
This will generate a data frame of the size 100 3 called grades in R for you. Rows arestudents, and columns represent the Verbal SAT score, the Math SAT score, and the GPAfor each student. To examine the dimensionality of this object you can type:
> dim(grades)
[1] 100 3
Which confirms what we planned initially.Now, we are at a position to answer the following questions:
17
Histogram of n5
n5
Freq
uenc
y
0.0 0.2 0.4 0.6 0.8 1.0
050
100
150
200
250
Figure 8: Right-skewed Distribution. Mean= 0.24, Median= 0.19.
(1.) Create barplots, dotplots, stemplots, and histograms for the three variables of interest.To make life easier, use the command attach to make the data file grades yourworking data. Next, proceed by just typing the name of the column of interest. Hereis what I mean:
> attach(grades)
> GPA
[1] 2.6 2.3 2.4 3.0 3.1 2.9 3.1 3.3 2.3 3.3 2.6 3.3 2.0 3.0 1.9 2.7 2.0 3.3
[19] 2.0 2.3 3.3 2.8 1.7 2.4 3.4 2.8 2.4 1.9 2.5 2.3 3.4 2.8 1.9 3.0 3.7 2.3
[37] 2.9 3.3 2.1 1.2 3.3 2.0 3.1 2.6 2.4 2.4 2.3 3.0 2.9 3.4 2.3 1.4 2.8 2.4
[55] 3.4 2.5 3.6 2.6 3.6 2.9 2.6 3.8 3.0 2.5 3.5 2.0 3.0 2.0 1.8 2.3 2.1 3.0
[73] 3.3 3.0 3.2 2.3 3.3 3.3 3.9 2.1 2.6 2.4 3.3 3.1 3.6 2.9 2.4 1.8 2.4 2.9
[91] 3.5 3.4 2.3 2.9 1.8 2.8 2.3 2.5 2.4 2.9
(2.) Calculate the min, the max, the mean, the median, first quartile, third quartile, andthe standard deviation of each variable.A good chunk of that information may be obtained through using the command sum-mary :
> summary(GPA)
18
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.200 2.300 2.750 2.706 3.125 3.900
(3.) Report your findings in detail. Compare the verbal scores with the math scores. Com-ment on the symmetry, measures of centrality, measures of spread, and the potentialoutliers in each distribution. Make sure to comment on the statistical features of theGPA as well.
Boxplots
Boxplots are efficient tools for representing data distributions. The five number summarycan be traced on a boxplot. Additionally, we can figure-out outliers with boxplots.
Remember the three distributions in figures 6, 7 and 8. Note that these distributions aresymmetric, left skewed and right skewed respectively. Here is how I created figure 9 below:
> n1=rnorm(100000,2,3)
> n2=rpois(100000,3)
> n3=rbeta(100000,12,3)
> par(mfrow=c(3,2))
> hist(n1)
> boxplot(n1)
> hist(n2)
> boxplot(n2)
> hist(n3)
> boxplot(n3)
Note that the command par(mfrow) creates a 3 by 2 grid in the graphic area.The side-by-side boxplots are very helpful in comparing two or more distributions. For
example figure 2 shows a side-by-side boxplots for the two skewed distributions of figure 1.
> boxplot(n1,n2)
Linear Transformations and Their Effect on x and s:
Standardization
To experiment with the idea of linear transformations, lets go back to the first dataset n1and calculate the mean and the standard deviation of it:
> mean(n1)
[1] 2.000206
19
Histogram of n1
n1
Freq
uenc
y
15 10 5 0 5 10 15
010
000
2500
0
10
010
Histogram of n2
n2
Freq
uenc
y
0 2 4 6 8 10 12
010
000
04
812
Histogram of n3
n3
Freq
uenc
y
0.4 0.6 0.8 1.0
010
000
0.3
0.6
0.9
Figure 9: Histograms along with boxplots for the three simulated datasets.
> sd(n1)
[1] 2.999017
Let us perform the following simple linear transformation on these numbers:
Z =n1 n1
s
Here is the code:
> z mean(z)
[1] 7.186852e-17
> sd(z)
[1] 1
20
1 2
10
5
05
1015
Figure 10: Side-by-side boxplots for the two of the distributions of figure 1.
> hist(z)
21
Histogram of z
z
Freq
uenc
y
4 2 0 2 4
050
0010
000
1500
020
000
Figure 11: Standardized version of n1. Note that the mean is roughly 0, and the standard deviationis 1.
Verification of the 68% - 95% - 99.7% Rule
To verify the rule, we reconsider the standardized vector z. Next, we count the number ofelements of z whose values are between -1 to 1, -2 to 2 and -3 to 3 respectively:
> length(z[z>-1 & z length(z[z>-2 & z length(z[z>-3 & z u=rnorm(10000,3,4)
> m=mean(u)
> s=sd(u)
22
> length(u[u>m-s&u length(u[u>m-2*s&u length(u[u>m-3*s&u pnorm(1,0,1)-pnorm(-1,0,1)
[1] 0.6826895
> pnorm(2,0,1)-pnorm(-2,0,1)
[1] 0.9544997
> pnorm(3,0,1)-pnorm(-3,0,1)
[1] 0.9973002
Areas Under Normal Distribution: General
So, the command pnorm provides the area below a given point for any normal distribution.Suppose, we know that grades in statistics follow a normal distribution with the mean of 78,and the standard deviation of 7. We like to know where does a grade of 83 stand:
> pnorm(83,78,7)
[1] 0.7624747
Roughly, 76% of all grades are below 83. We can do the reverse calculation. Suppose thatwe like to find the same grade, this time knowing the area below it:
> qnorm(0.7624747,78,7)
[1] 83
Assessing Normality: Normal Quantile Plots or QQ-
plots
Quantile-Quantile plots are among the most powerful tools for assessing the normality of adata set. The idea is relatively simple. We want to know whether the empirical quantilesof our data will match with the theoretical quantiles of a standard normal. Data will forma atraight line if normality holds. R does provides QQ-plots through the command theqqnorm. Here is the QQ-plot for the data set test3 (figure 1):
> qqnorm(n1)
Below, we demonstrate the qqplot of a symmetric, a left-skewed and a right-skeweddistribution. The code shows you how we generate the next figure.
23
3 2 1 0 1 2 3
4
3
2
1
01
23
Normal QQ Plot
Theoretical Quantiles
Sam
ple
Quan
tiles
Figure 12: qq-plot for n1.
> n4=rnorm(1000,3,5)
> n5=rpois(1000,3)
> n6=rbeta(1000,8,3)
> par(mfrow=c(3,2))
> hist(n4)
> qqnorm(n4)
> hist(n5)
> qqnorm(n5)
> hist(n6)
> qqnorm(n6)
Importing Text-files
Reminder:
> grades=read.table("http://math.fullerton.edu/sbehseta/grades.txt",header=T)
24
Histogram of n4
n4
Freq
uenc
y
15 10 5 0 5 10 15
050
150
3 2 1 0 1 2 3
10
010
Normal QQ Plot
Theoretical Quantiles
Sam
ple
Quan
tiles
Histogram of n5
n5
Freq
uenc
y
0 2 4 6 8
010
020
0
3 2 1 0 1 2 3
02
46
8
Normal QQ Plot
Theoretical Quantiles
Sam
ple
Quan
tiles
Histogram of n6
n6
Freq
uenc
y
0.4 0.6 0.8 1.0
050
150
3 2 1 0 1 2 3
0.3
0.6
0.9
Normal QQ Plot
Theoretical QuantilesSa
mpl
e Qu
antile
s
Figure 13: The Normal Quantile plots for symmetric and asymmetric distributions.
> attach(grades)
> dim(grades)
[1] 100 3
> Verbal
[1] 623 454 643 585 719 693 571 646 613 655 662 585 580 648 405 506 669 558
[19] 577 487 682 565 552 567 745 610 493 571 682 600 740 593 488 526 630 586
[37] 610 695 539 490 509 667 597 662 566 597 604 519 643 606 500 460 717 592
[55] 752 695 610 620 682 524 552 703 584 550 659 585 578 533 532 708 537 635
[73] 591 552 557 599 540 752 726 630 558 646 643 606 682 565 578 488 361 560
[91] 630 666 719 669 571 520 571 539 580 629
> hist(Verbal)
> summary(Verbal)
Min. 1st Qu. Median Mean 3rd Qu. Max.
25
361.0 552.0 592.5 598.5 649.8 752.0
> stem(verbal)
The decimal point is 2 digit(s) to the right of the |
3 | 6
4 | 1
4 | 5699999
5 | 0112223334444
5 | 55556666777777778888889999999
6 | 000001111112233334444
6 | 5556666777788889
7 | 000122234
7 | 555
26
Scatterplots and Pearsons Correlation
To create scatterplots, the command is simply plot.
> plot(Verbal,Math)
To calculate the correlation coefficient between any two random variables, we can use thecor command:
> cor(Verbal,Math)
[1] 0.4306938
> cor(Verbal,GPA)
[1] 0.4847681
> cor(Math,GPA)
[1] 0.2236183
Central Limit Theorem
Central limit theorem says that if x D(, ) where D is a probability density or a massfunction (regardless of the form of the distribution), when sample size is large enough:x(
n) N(0, 1). When the data comes from a Normal distribution mean and standard
deviation , we can assume x(
n) N(0, 1).
To verify these results lets look at Figure 1. Figure 1, shows a clearly right skewedpopulation with = 2.029 and = 1.382. For 1000 times, we take samples of size 2from this population and we will obtain 1000 sample averages associated with those randomsamples. Then we repeat this process for samples of sizes 3, 6, 10, 20, 100 and each timewe keep track of the mean and the distribution of those 1000 sample averages. We alsoplot histograms and qq-plots for each scenario. It turns out that as sample sizes increasesthe distribution of 1000 sample averages converge to normality (figures 2 and 3). Also, thestandard deviations of those sampling distributions get closer to
n(table 1).
Next, I sample from a normal population with = 99.602 and = 10.2211 (figure 4).Then I repeat the same procedure for sample averages. It turns out that regardless of samplesizes, the result associated with central limit theorem hold (figures 5, 6, and table 2).
27
Sample Size 2 3 6 10 20 100Mean 2.0945 2.061667 2.016 2.0301 2.0186 2.02654
Standard Deviation 0.9860487 0.78204 0.5660369 0.453773 0.3010642 0.1300328Table 1. Results for popoulation 1. The mean and standard deviations of the sample mean with
different sample sizes
Sample Size 2 3 6 10 20 100Mean 99.95793 99.30017 99.5832 99.5639 99.653 99.57605
Standard Deviation 7.199265 5.83941 4.132356 3.290563 2.240824 0.9427503Table 2. Results for population 2. The mean and standard deviations of the sample mean with
different sample sizes
Histogram of test1
test1
Freq
uenc
y
0 1 2 3 4 5 6 7
050
100
150
200
250
300
Figure 14: Case one: Population distribution. The distribution is Skewed to the right.
28
Histogram of mean.size2
mean.size2
Freq
uenc
y
0 1 2 3 4 5
050
150
Histogram of mean.size3
mean.size3
Freq
uenc
y
0 1 2 3 4 5
010
020
030
0
Histogram of mean.size6
mean.size6
Freq
uenc
y
1 2 3 4
010
025
0
Histogram of mean.size10
mean.size10
Freq
uenc
y
1.0 1.5 2.0 2.5 3.0 3.5
050
150
Histogram of mean.size20
mean.size20
Freq
uenc
y
1.5 2.0 2.5 3.0
010
020
0
Histogram of mean.size100
mean.size100
Freq
uenc
y
1.6 1.8 2.0 2.2 2.4
010
020
0
Figure 15: Sampling Distributions of Sample means for sample sizes 2, 3, 6, 10, 20, 100.
R-codes For Simulation
Population 1: Right Skewed Distribution We can simulate from a Poisson dis-tribution:
> test1=rpois(1000,2)
> hist(test1)
> mean(test1)
[1] 2.029
> sd(test1)
[1] 1.382777
Population 1: Obtaining 1000 Samples With Size 2, 3, 6, 10, 20, 100Here is the case for Size 2. Others are similar.
29
3 2 1 0 1 2 3
01
23
45
Normal QQ Plot
Theoretical Quantiles
Sam
ple
Quan
tiles
3 2 1 0 1 2 3
01
23
45
Normal QQ Plot
Theoretical Quantiles
Sam
ple
Quan
tiles
3 2 1 0 1 2 3
12
34
Normal QQ Plot
Theoretical Quantiles
Sam
ple
Quan
tiles
3 2 1 0 1 2 3
1.0
2.0
3.0
Normal QQ Plot
Theoretical Quantiles
Sam
ple
Quan
tiles
3 2 1 0 1 2 3
1.5
2.5
Normal QQ Plot
Theoretical Quantiles
Sam
ple
Quan
tiles
3 2 1 0 1 2 3
1.6
2.0
2.4
Normal QQ Plot
Theoretical Quantiles
Sam
ple
Quan
tiles
Figure 16: QQ-plots for the sample mean distributions for different sample sizes.
> test=matrix(nrow=1000,ncol=2)
> for(i in 1:1000)
{
test[i,]=sample(test1,2)
}
> mean.size2=apply(test,1,mean)
> mean(mean.size2)
[1] 2.0945
> sd(mean.size2)
[1] 0.9860487
30
Histogram of test2
test2
Freq
uenc
y
70 80 90 100 110 120 130
050
100
150
200
Figure 17: Case two: Population distribution. The distribution symmetric.
Population 2: Obtaining 1000 Samples With Size 2, 3, 6, 10, 20, 100 Again,only the case for size 2 is included.
test=matrix(nrow=1000,ncol=2)
for(i in 1:1000)
{
test[i,]=sample(test2,2)
}
mean.size2.norm=apply(test,1,mean)
> mean(mean.size2.norm)
[1] 99.95793
> sd(mean.size2.norm)
31
Histogram of mean.size2.norm
mean.size2.norm
Freq
uenc
y
80 90 100 110 120 130
010
020
0
Histogram of mean.size3.norm
mean.size3.norm
Freq
uenc
y
80 90 100 110 120
010
020
030
0
Histogram of mean.size6.norm
mean.size6.norm
Freq
uenc
y
85 90 95 100 105 110
050
150
Histogram of mean.size10.norm
mean.size10.norm
Freq
uenc
y
90 95 100 105 110
050
150
Histogram of mean.size20.norm
mean.size20.norm
Freq
uenc
y
95 100 105
050
100
Histogram of mean.size100.norm
mean.size100.norm
Freq
uenc
y
97 98 99 100 101 102 103
050
150
Figure 18: Sampling Distributions of Sample means for sample sizes 2, 3, 6, 10, 20, 100.
[1] 7.199265
Population 1: Plotting Histograms and QQ-plots
par(mfrow=c(3,2))
hist(mean.size2)
hist(mean.size3)
hist(mean.size6)
hist(mean.size10)
hist(mean.size20)
hist(mean.size100)
32
3 2 1 0 1 2 3
8010
012
0
Normal QQ Plot
Theoretical Quantiles
Sam
ple
Quan
tiles
3 2 1 0 1 2 3
8090
110
Normal QQ Plot
Theoretical Quantiles
Sam
ple
Quan
tiles
3 2 1 0 1 2 3
9010
011
0
Normal QQ Plot
Theoretical Quantiles
Sam
ple
Quan
tiles
3 2 1 0 1 2 3
9010
011
0
Normal QQ Plot
Theoretical Quantiles
Sam
ple
Quan
tiles
3 2 1 0 1 2 3
9296
100
106
Normal QQ Plot
Theoretical Quantiles
Sam
ple
Quan
tiles
3 2 1 0 1 2 3
9799
101
Normal QQ Plot
Theoretical Quantiles
Sam
ple
Quan
tiles
Figure 19: QQ-plots for the sample mean distributions for different sample sizes.
par(mfrow=c(3,2))
qqnorm(mean.size2)
qqnorm(mean.size3)
qqnorm(mean.size6)
qqnorm(mean.size10)
qqnorm(mean.size20)
qqnorm(mean.size100)
33
SAT Scores Again
Verbal: t-test and Confidence Interval
> mean(Verbal)
[1] 598.49
> t.test(Verbal,mu=600)
One Sample t-test
data: verbal t = -0.1986, df = 99, p-value = 0.843 alternative
hypothesis: true mean is not equal to 600 95 percent confidence
interval:
583.4042 613.5758
sample estimates: mean of x
598.49
Verbal: Two-sided Versus One-sided Tests
> t.test(Verbal,mu=600,alternative="two.sided")
One Sample t-test
data: verbal t = -0.1986, df = 99, p-value = 0.843 alternative
hypothesis: true mean is not equal to 600 95 percent confidence
interval:
583.4042 613.5758
sample estimates: mean of x
598.49
> t.test(Verbal,mu=600,alternative="less")
One Sample t-test
data: verbal t = -0.1986, df = 99, p-value = 0.4215 alternative
hypothesis: true mean is less than 600 95 percent confidence
interval:
-Inf 611.1138
sample estimates: mean of x
598.49
> t.test(Verbal,mu=600,alternative="greater")
34
One Sample t-test
data: verbal t = -0.1986, df = 99, p-value = 0.5785 alternative
hypothesis: true mean is greater than 600 95 percent confidence
interval:
585.8662 Inf
sample estimates: mean of x
598.49
Verbal: Changing
> t.test(Verbal,mu=620)
One Sample t-test
data: verbal t = -2.8292, df = 99, p-value = 0.00565 alternative
hypothesis: true mean is not equal to 620 95 percent confidence
interval:
583.4042 613.5758
sample estimates: mean of x
598.49
> t.test(Verbal,mu=620,alternative="less")
One Sample t-test
data: verbal t = -2.8292, df = 99, p-value = 0.002825 alternative
hypothesis: true mean is less than 620 95 percent confidence
interval:
-Inf 611.1138
sample estimates: mean of x
598.49
> t.test(Verbal,mu=620,alternative="greater")
One Sample t-test
data: verbal t = -2.8292, df = 99, p-value = 0.9972 alternative
hypothesis: true mean is greater than 620
35
Math: Confidence Interval and t-test
> t.test(Math)
One Sample t-test
data: math t = 99.6647, df = 99, p-value = < 2.2e-16 alternative
hypothesis: true mean is not equal to 0 95 percent confidence
interval:
641.0874 667.1326
sample estimates: mean of x
654.11
Two-sample t-test for Math and Verbal
> t.test(Math,verbal,mu=0)
Welch Two Sample t-test
data: math and verbal t = 5.5377, df = 193.867, p-value =
9.88e-08 alternative hypothesis: true difference in means is not
equal to 0 95 percent confidence interval:
35.81078 75.42922
sample estimates: mean of x mean of y
654.11 598.49
> t.test(Math,verbal,mu=0,alternative="greater")
Welch Two Sample t-test
data: math and verbal t = 5.5377, df = 193.867, p-value =
4.94e-08 alternative hypothesis: true difference in means is
greater than 0 95 percent confidence interval:
39.02003 Inf
sample estimates: mean of x mean of y
654.11 598.49
> t.test(Math,verbal,mu=0,alternative="less")
36
Welch Two Sample t-test
data: math and verbal t = 5.5377, df = 193.867, p-value = 1
alternative hypothesis: true difference in means is less than 0 95
percent confidence interval:
-Inf 72.21997
sample estimates: mean of x mean of y
654.11 598.49
> t.test(Math,verbal,mu=50)
Welch Two Sample t-test
data: math and verbal t = 0.5595, df = 193.867, p-value = 0.5764
alternative hypothesis: true difference in means is not equal to
50 95 percent confidence interval:
35.81078 75.42922
sample estimates: mean of x mean of y
654.11 598.49
37
Project 2
(1) (Moore and McCabe, 1998) Crop researchers plant 15 plots with a new variety of corn.The yields in bushels per acre are:138.0 139.1 113.0 132.5 140.7 109.7 118.9 134.8109.6 127.3 115.6 130.4 130.2 111.7 105.5Assume that the population of yields is normal.
(a) Find the 90% confidence interval for the mean yield for this variety of corn.
(b) Find the 95% confidence interval.
(c) Find the 99% confidence interval.
(d) How do margin of error (sampling error) in (a), (b), and (c) change as confidencelevel increases?
(2) (Moore and McCabe, 1998) The table below gives the pretest and posttest scores onMLA listening test in Spanish for 20 high school Spanish teachers who attended anintensive summer course in Spanish.Subject Pretest Posttest Subject Pretest Posttest
1 30 29 11 30 322 28 30 12 29 283 31 32 13 31 344 26 30 14 29 325 20 16 15 34 326 30 25 16 20 277 34 31 17 26 288 15 18 18 25 299 28 33 19 31 3210 20 25 20 29 32
Give a 90% confidence interval for the mean increase in listening score due to attendingthe summer institute.
(3) Download the dataset grades.txt from the course webpage. Build a 95% confidenceintervals for math, and verbal. Are there overlaps? Interpret your findings.
(4) Bonus: Consider verbal scores in the grades dataset. First, show that verbalscores follows a normal distribution. Then, construct a 95% confidence interval for thepopulation mean verbal of verbal scores.An alternative way of constructing a 95% confidence interval is to use the quantiles ofthe data: Consider (grades0.025,grades0.975), where grades0.025, and grades0.975 are sim-ply the 2.5% and the 97.5% percentiles of the dataset. Does this confidence intervalagree with the confidence interval you constructed before? Why?Note that x = 598.49, and s = 76.029 for grades. Generate 100,000 normal val-ues from Normal(598, 76/
100). Create a quantile confidence interval for the latter
38
dataset. Does this confidence interval agree with the first confidence interval? Canyou think of an explanation for this agreement(disagreement)?Hint: look at the command quantile.
39