+ All Categories
Home > Documents > workshop1.pdf

workshop1.pdf

Date post: 08-Nov-2015
Category:
Upload: ted-zhang
View: 215 times
Download: 1 times
Share this document with a friend
Popular Tags:
39
Statistics Workshop 1: Introduction to R. Tuesday May 26, 2009 Assignments Generally speaking, there are three basic forms of assigning data. Case one is the single atom or a single number. Assigning a number to an object in this case is quite trivial. All we need is to use < - or = for assigning a number or an atom to a character. In the following, > refers to the prompt in R. The second form is the vector form. In this form, we assign a name to an array of numbers. This can be done with the command c which stands for concatenation. The interesting fact is that we can call any member of the vector or we can replace that member with a new member or to perform various arithmetic operations on that vector, as shown below. Finally, the third form of storing data is to put them in a matrix form. The command is matrix. First we need to input the data set of interest, followed by telling R the dimen- sionality of the matrix that needs to be specified. For example, we can put an array of 9 numbers into a matrix with 3 rows and 3 columns. We demonstrate all of these below. Atoms, Vectors and Matrices (a) Atoms: > sam=2 > sam [1] 2 > sam+sam [1] 4 > (2*sam*2)/2 [1] 4 > sam^(1/3) [1] 1.259921 > sqrt(sam) [1] 1.414214 > abs(-sam) [1] 2 1
Transcript
  • Statistics Workshop 1: Introduction to R. Tuesday May 26, 2009

    Assignments

    Generally speaking, there are three basic forms of assigning data. Case one is the single atomor a single number. Assigning a number to an object in this case is quite trivial. All we needis to use < or = for assigning a number or an atom to a character. In the following, >refers to the prompt in R.

    The second form is the vector form. In this form, we assign a name to an array of numbers.This can be done with the command c which stands for concatenation. The interesting factis that we can call any member of the vector or we can replace that member with a newmember or to perform various arithmetic operations on that vector, as shown below.

    Finally, the third form of storing data is to put them in a matrix form. The commandis matrix. First we need to input the data set of interest, followed by telling R the dimen-sionality of the matrix that needs to be specified. For example, we can put an array of 9numbers into a matrix with 3 rows and 3 columns. We demonstrate all of these below.

    Atoms, Vectors and Matrices

    (a) Atoms:

    > sam=2

    > sam

    [1] 2

    > sam+sam

    [1] 4

    > (2*sam*2)/2

    [1] 4

    > sam^(1/3)

    [1] 1.259921

    > sqrt(sam)

    [1] 1.414214

    > abs(-sam)

    [1] 2

    1

  • (b) Vectors

    > class.age=c(35,35,36,37,37,38,38,39,40.5,43,44,44.5,50,19)

    > class.age

    [1] 35.0 35.0 36.0 37.0 37.0 38.0 38.0 39.0 40.5 43.0 44.0 44.5 50.0 19.0

    > class.age[3]

    [1] 36

    > class.age[1:5]

    [1] 35 35 36 37 37

    > class.age[-5]

    [1] 35.0 35.0 36.0 37.0 38.0 38.0 39.0 40.5 43.0 44.0 44.5 50.0 19.0

    > class.age[-c(2,7)]

    [1] 35.0 36.0 37.0 37.0 38.0 39.0 40.5 43.0 44.0 44.5 50.0 19.0

    > class.age*2

    [1] 70 70 72 74 74 76 76 78 81 86 88 89 100 38

    > sqrt(class.age)

    [1] 5.916080 5.916080 6.000000 6.082763 6.082763 6.164414 6.164414 6.244998

    6.363961 6.557439 6.633250 6.670832 7.071068

    [14] 4.358899

    > class.age^(-1)

    [1] 0.02857143 0.02857143 0.02777778 0.02702703 0.02702703 0.02631579

    0.02631579 0.02564103 0.02469136 0.02325581

    [11] 0.02272727 0.02247191 0.02000000 0.05263158

    > class.age*class.age

    [1] 1225.00 1225.00 1296.00 1369.00 1369.00 1444.00 1444.00 1521.00 1640.25

    1849.00 1936.00 1980.25 2500.00 361.00

    > class.age^2

    [1] 1225.00 1225.00 1296.00 1369.00 1369.00 1444.00 1444.00 1521.00 1640.25

    1849.00 1936.00 1980.25 2500.00 361.00

    > mean(class.age)

    2

  • [1] 38.28571

    > median(class.age)

    [1] 38

    > class.age=(class.age)/2

    > class.age

    [1] 17.50 17.50 18.00 18.50 18.50 19.00 19.00 19.50 20.25 21.50 22.00 22.25 25.00 9.50

    > class.age=class.age*2

    Often it is useful to create an empty vector. Here is the way this is done:

    > hi=numeric(10)

    > hi

    [1] 0 0 0 0 0 0 0 0 0 0

    Vectors do not have to be numerical. We can create a vector of characters:

    > hi=c("hello","whasup","longday")

    > hi

    [1] "hello" "whasup" "longday"

    Later, it becomes useful to ask R the length of a vector:

    > length(class.age)

    [1] 14

    (c) Matrices

    > sam = matrix(nrow=3,ncol=4)

    > sam

    [,1] [,2] [,3] [,4]

    [1,] NA NA NA NA

    [2,] NA NA NA NA

    [3,] NA NA NA NA

    > sam = matrix(c(1,2,3,4,5,6,7,8,9,10,11,12),nrow=3,byrow=T)

    3

  • > sam

    [,1] [,2] [,3] [,4]

    [1,] 1 2 3 4

    [2,] 5 6 7 8

    [3,] 9 10 11 12

    > sam sam

    [,1] [,2] [,3] [,4]

    [1,] 1 4 7 10

    [2,] 2 5 8 11

    [3,] 3 6 9 12

    > sally =c(1,2,3,4,5,6,7,8,9,10,11,12)

    > sam=matrix(sally,nrow=3,byrow=T)

    > sam

    [,1] [,2] [,3] [,4]

    [1,] 1 2 3 4

    [2,] 5 6 7 8

    [3,] 9 10 11 12

    > v1=c(1,2,3,4)

    > v2=c(5,6,7,8)

    > v3=c(9,10,11,12)

    > sam=matrix(c(v1,v2,v3),nrow=3,byrow=T)

    > sam

    [,1] [,2] [,3] [,4]

    [1,] 1 2 3 4

    [2,] 5 6 7 8

    [3,] 9 10 11 12

    4

  • > sam[1,]

    [1] 1 2 3 4

    > sam[,2]

    [1] 2 6 10

    > sam[1,3]

    [1] 3

    sam[3,] sam

    [,1] [,2] [,3] [,4]

    [1,] 1 2 3 4

    [2,] 5 6 7 8

    [3,] 5 6 7 8

    > sam[1,] sam

    [,1] [,2] [,3] [,4]

    [1,] 0 0.6931472 1.098612 1.386294

    [2,] 5 6.0000000 7.000000 8.000000

    [3,] 5 6.0000000 7.000000 8.000000

    (d) ListsR provides a powerful additional storing function called list. The importance of list isin that we can store various objects of different natures such as matrices, vectors, oratoms into a unique space, followed by calling different parts of that object separately.

    Lets assume that we would like to store the following three object into a list-objectcalled sam:

    > s1=3

    > s2=seq(1,10,2)

    > s3=matrix(c(1:9),nrow=3)

    > s1

    [1] 3

    > s2

    5

  • [1] 1 3 5 7 9

    > s3

    [,1] [,2] [,3]

    [1,] 1 4 7

    [2,] 2 5 8

    [3,] 3 6 9

    > s s

    [[1]]

    [1] 3

    [[2]]

    [1] 1 3 5 7 9

    [[3]]

    [,1] [,2] [,3]

    [1,] 1 4 7

    [2,] 2 5 8

    [3,] 3 6 9

    > s[[1]]

    [1] 3

    > s[[2]]

    [1] 1 3 5 7 9

    > s[[3]]

    [,1] [,2] [,3]

    [1,] 1 4 7

    [2,] 2 5 8

    [3,] 3 6 9

    The for loop

    Often times, it becomes necessary to repeat certain calculations a number of times.This is done in R using a simple command called for. Here are some examples:

    6

  • > for(i in 1:3)

    + {

    + print("sam")

    + }

    [1] "sam"

    [1] "sam"

    [1] "sam"

    > s=matrix(c(1,2,3,4,5,6,7,8,9),nrow=3,byrow=T)

    > for(i in 1:3)

    + print(s)

    [,1] [,2] [,3]

    [1,] 1 4 7

    [2,] 2 5 8

    [3,] 3 6 9

    [,1] [,2] [,3]

    [1,] 1 4 7

    [2,] 2 5 8

    [3,] 3 6 9

    [,1] [,2] [,3]

    [1,] 1 4 7

    [2,] 2 5 8

    [3,] 3 6 9

    or:

    > for(i in 1:3)

    + { print(s)}

    [,1] [,2] [,3]

    [1,] 1 4 7

    [2,] 2 5 8

    [3,] 3 6 9

    [,1] [,2] [,3]

    [1,] 1 4 7

    [2,] 2 5 8

    [3,] 3 6 9

    [,1] [,2] [,3]

    [1,] 1 4 7

    [2,] 2 5 8

    [3,] 3 6 9

    7

  • > for(i in 1:3)

    +{

    + print(s[i,])

    + }

    [1] 1 4 7

    [1] 2 5 8

    [1] 3 6 9

    Functions

    In principle, there are two sorts of functions in R. The most common and useful onesare the library functions or the already written commands. For example, mean andsd are commands that calculate the average and the standard deviation of an object,say a vector respectively. Here are a couple of examples:

    > s2 s2

    [1] 1 3 5 7 9

    > mean(s2)

    [1] 5

    > var(s2)

    [1] 10

    > sd(s2)

    [1] 3.162278

    > median(s2)

    [1] 5

    Second type of functions are those that the users of R create. These functions willremain in the command memory of the software unless you delete them or overwritethem. Expectedly, the command to create a function is function.

    Here is an example of a function that gets a matrix, and calculates the standarddeviation divided by the mean of its rows

    . This measurement is called the coefficient

    of variation. Note that in writing this function, I use the commands mean, and sd.

    8

  • In general, any time you are not sure what an R command does or to learn about itsspecifics, just type a question mark, followed by the command in the prompt.

    > m.cv m.cv(sm3)

    [1] 0.75 0.60 0.50

    9

  • Visualizing Data: Pie Charts, Stem plots, Histograms

    Categorical Data

    For categorical data, we keep track of counts or relative frequencies of each group. Therefore,a schematic presentation should reflect the percentage of occurrences in each category. Thisis usually done via two types of graphs: 1- Pie charts, and 2- Barplots. Both graphs are easyto create in R. An important issue here is that in most cases, it would make sense to labelthe categories. We show you how to do this below.

    Example 1. The counts and the percentages of the marital status of American womenwas collected by the Current Population Survey in 1995 as following:

    Marital Status Count (millions) PercentNever Married 43.9 22.9

    Married 116.7 20.9Widowed 13.4 7.0Divorced 17.6 9.2

    Here are the commands to provide the pie-chart for these data (figure 1):

    > married married.code pie(married,married.code)

    Alternatively, we could label each piece of pie by creating a factor vector that containsthe names of each pie (figure 2):

    > married.code pie(married,married.code)

    To create a barplot for the married data, it is sufficient to execute the following function(figure 3):

    > married barplot(married,names.arg=married.code)

    10

  • 12

    3

    4

    Figure 1: Pie chart for the Married data.

    Stemplots and Histograms

    For quantitative data, stemplots and histograms are the useful visual tools.Example 2. Lets revisit the class-age data we introduced previously. To create the stemplotfor these data, we can do the following:

    > class.age class.age

    [1] 35.0 35.0 36.0 37.0 37.0 38.0 38.0 39.0 40.5 43.0 44.0 44.5 50.0 19.0

    > stem(class.age)

    The decimal point is 1 digit(s) to the right of the |

    1 | 9

    2 |

    3 | 55677889

    4 | 1345

    11

  • never married

    married

    widowed

    divorced

    Figure 2: Pie chart for the Married data with labels.

    5 | 0

    > stem(class.age,scale=2)

    The decimal point is 1 digit(s) to the right of the |

    1 | 9

    2 |

    2 |

    3 |

    3 | 55677889

    4 | 134

    4 | 5

    5 | 0

    > test stem(test)

    12

  • never married married widowed divorced

    010

    2030

    4050

    60

    Figure 3: Barplot for the Married data with labels.

    The decimal point is 1 digit(s) to the left of the |

    0 | 01

    2 | 21466

    4 | 055

    6 | 5

    > stem(test,scale=2)

    The decimal point is 1 digit(s) to the left of the |

    0 | 01

    1 |

    2 | 2

    3 | 1466

    4 | 05

    5 | 5

    6 | 5

    13

  • To create a histogram for the class-age data, it is sufficient to use the hist command(figure 4):

    > hist(class.age)

    Histogram of class.age

    class.age

    Freq

    uenc

    y

    15 20 25 30 35 40 45 50

    01

    23

    45

    6

    Figure 4: The histogram of class.age.

    We can make the bars finer. Here is a simple trick (figure 5):

    > b1 b1

    [1] 15 18 21 24 27 30 33 36 39 42 45 48

    > b1 hist(class.age,breaks=b1)

    14

  • Histogram of class.age

    class.age

    Freq

    uenc

    y

    20 25 30 35 40 45 50

    01

    23

    45

    Figure 5: The histogram of class.age with finer classes.

    Measuring Center: The Mean, The Median, and the

    Quartile

    The measures of centrality play a fundamental role in understanding the statistical distribu-tions. The chief important ones are the mean, the median, and the other quantiles.

    > mean(class.age)

    [1] 38.28571

    > median(class.age)

    [1] 38

    > quantile(class.age)

    0% 25% 50% 75% 100%

    19.000 36.250 38.000 42.375 50.000

    > quantile(class.age,prob=0.66)

    15

  • 66%

    39.87

    Comparing Mean and Median

    The Symmetric Case

    For symmetric distributions such as the one in figure 6, the median and the mean are closeto each other.

    Histogram of n1

    n1

    Freq

    uenc

    y

    70 80 90 100 110 120 130

    050

    100

    150

    200

    Figure 6: A symmetric distribution. Mean= 100.07, Median= 99.71.

    For distributions that are skewed to the left such as the one in figure 7, the mean issmaller than the median (why?)

    Finally, for the right-skewed distributions, the mean is larger than the median (figure 8).

    Measuring Spread: The Standard Deviation

    The standard deviation reflects the number of units away from the mean. For example, thestandard deviation for the data in figures 6, 7, and 8 are 10.03, 0.19, and 0.19 respectively.

    To calculate the variance and the standard deviation for the class.age data, we can type:

    16

  • Histogram of n3

    n3

    Freq

    uenc

    y

    0.2 0.4 0.6 0.8 1.0

    050

    100

    150

    200

    250

    Figure 7: Left-skewed Distribution. Mean= 0.75 , Median= 0.79.

    > var(class.age)

    [1] 49.1044

    > sd(class.age)

    [1] 7.007453

    Project 1. Visualizing Grades.

    First, read the file grades.txt from the webpage. To do this, run the following code in R:

    > grades=read.table("http://math.fullerton.edu/sbehseta/grades.txt",header=T)

    This will generate a data frame of the size 100 3 called grades in R for you. Rows arestudents, and columns represent the Verbal SAT score, the Math SAT score, and the GPAfor each student. To examine the dimensionality of this object you can type:

    > dim(grades)

    [1] 100 3

    Which confirms what we planned initially.Now, we are at a position to answer the following questions:

    17

  • Histogram of n5

    n5

    Freq

    uenc

    y

    0.0 0.2 0.4 0.6 0.8 1.0

    050

    100

    150

    200

    250

    Figure 8: Right-skewed Distribution. Mean= 0.24, Median= 0.19.

    (1.) Create barplots, dotplots, stemplots, and histograms for the three variables of interest.To make life easier, use the command attach to make the data file grades yourworking data. Next, proceed by just typing the name of the column of interest. Hereis what I mean:

    > attach(grades)

    > GPA

    [1] 2.6 2.3 2.4 3.0 3.1 2.9 3.1 3.3 2.3 3.3 2.6 3.3 2.0 3.0 1.9 2.7 2.0 3.3

    [19] 2.0 2.3 3.3 2.8 1.7 2.4 3.4 2.8 2.4 1.9 2.5 2.3 3.4 2.8 1.9 3.0 3.7 2.3

    [37] 2.9 3.3 2.1 1.2 3.3 2.0 3.1 2.6 2.4 2.4 2.3 3.0 2.9 3.4 2.3 1.4 2.8 2.4

    [55] 3.4 2.5 3.6 2.6 3.6 2.9 2.6 3.8 3.0 2.5 3.5 2.0 3.0 2.0 1.8 2.3 2.1 3.0

    [73] 3.3 3.0 3.2 2.3 3.3 3.3 3.9 2.1 2.6 2.4 3.3 3.1 3.6 2.9 2.4 1.8 2.4 2.9

    [91] 3.5 3.4 2.3 2.9 1.8 2.8 2.3 2.5 2.4 2.9

    (2.) Calculate the min, the max, the mean, the median, first quartile, third quartile, andthe standard deviation of each variable.A good chunk of that information may be obtained through using the command sum-mary :

    > summary(GPA)

    18

  • Min. 1st Qu. Median Mean 3rd Qu. Max.

    1.200 2.300 2.750 2.706 3.125 3.900

    (3.) Report your findings in detail. Compare the verbal scores with the math scores. Com-ment on the symmetry, measures of centrality, measures of spread, and the potentialoutliers in each distribution. Make sure to comment on the statistical features of theGPA as well.

    Boxplots

    Boxplots are efficient tools for representing data distributions. The five number summarycan be traced on a boxplot. Additionally, we can figure-out outliers with boxplots.

    Remember the three distributions in figures 6, 7 and 8. Note that these distributions aresymmetric, left skewed and right skewed respectively. Here is how I created figure 9 below:

    > n1=rnorm(100000,2,3)

    > n2=rpois(100000,3)

    > n3=rbeta(100000,12,3)

    > par(mfrow=c(3,2))

    > hist(n1)

    > boxplot(n1)

    > hist(n2)

    > boxplot(n2)

    > hist(n3)

    > boxplot(n3)

    Note that the command par(mfrow) creates a 3 by 2 grid in the graphic area.The side-by-side boxplots are very helpful in comparing two or more distributions. For

    example figure 2 shows a side-by-side boxplots for the two skewed distributions of figure 1.

    > boxplot(n1,n2)

    Linear Transformations and Their Effect on x and s:

    Standardization

    To experiment with the idea of linear transformations, lets go back to the first dataset n1and calculate the mean and the standard deviation of it:

    > mean(n1)

    [1] 2.000206

    19

  • Histogram of n1

    n1

    Freq

    uenc

    y

    15 10 5 0 5 10 15

    010

    000

    2500

    0

    10

    010

    Histogram of n2

    n2

    Freq

    uenc

    y

    0 2 4 6 8 10 12

    010

    000

    04

    812

    Histogram of n3

    n3

    Freq

    uenc

    y

    0.4 0.6 0.8 1.0

    010

    000

    0.3

    0.6

    0.9

    Figure 9: Histograms along with boxplots for the three simulated datasets.

    > sd(n1)

    [1] 2.999017

    Let us perform the following simple linear transformation on these numbers:

    Z =n1 n1

    s

    Here is the code:

    > z mean(z)

    [1] 7.186852e-17

    > sd(z)

    [1] 1

    20

  • 1 2

    10

    5

    05

    1015

    Figure 10: Side-by-side boxplots for the two of the distributions of figure 1.

    > hist(z)

    21

  • Histogram of z

    z

    Freq

    uenc

    y

    4 2 0 2 4

    050

    0010

    000

    1500

    020

    000

    Figure 11: Standardized version of n1. Note that the mean is roughly 0, and the standard deviationis 1.

    Verification of the 68% - 95% - 99.7% Rule

    To verify the rule, we reconsider the standardized vector z. Next, we count the number ofelements of z whose values are between -1 to 1, -2 to 2 and -3 to 3 respectively:

    > length(z[z>-1 & z length(z[z>-2 & z length(z[z>-3 & z u=rnorm(10000,3,4)

    > m=mean(u)

    > s=sd(u)

    22

  • > length(u[u>m-s&u length(u[u>m-2*s&u length(u[u>m-3*s&u pnorm(1,0,1)-pnorm(-1,0,1)

    [1] 0.6826895

    > pnorm(2,0,1)-pnorm(-2,0,1)

    [1] 0.9544997

    > pnorm(3,0,1)-pnorm(-3,0,1)

    [1] 0.9973002

    Areas Under Normal Distribution: General

    So, the command pnorm provides the area below a given point for any normal distribution.Suppose, we know that grades in statistics follow a normal distribution with the mean of 78,and the standard deviation of 7. We like to know where does a grade of 83 stand:

    > pnorm(83,78,7)

    [1] 0.7624747

    Roughly, 76% of all grades are below 83. We can do the reverse calculation. Suppose thatwe like to find the same grade, this time knowing the area below it:

    > qnorm(0.7624747,78,7)

    [1] 83

    Assessing Normality: Normal Quantile Plots or QQ-

    plots

    Quantile-Quantile plots are among the most powerful tools for assessing the normality of adata set. The idea is relatively simple. We want to know whether the empirical quantilesof our data will match with the theoretical quantiles of a standard normal. Data will forma atraight line if normality holds. R does provides QQ-plots through the command theqqnorm. Here is the QQ-plot for the data set test3 (figure 1):

    > qqnorm(n1)

    Below, we demonstrate the qqplot of a symmetric, a left-skewed and a right-skeweddistribution. The code shows you how we generate the next figure.

    23

  • 3 2 1 0 1 2 3

    4

    3

    2

    1

    01

    23

    Normal QQ Plot

    Theoretical Quantiles

    Sam

    ple

    Quan

    tiles

    Figure 12: qq-plot for n1.

    > n4=rnorm(1000,3,5)

    > n5=rpois(1000,3)

    > n6=rbeta(1000,8,3)

    > par(mfrow=c(3,2))

    > hist(n4)

    > qqnorm(n4)

    > hist(n5)

    > qqnorm(n5)

    > hist(n6)

    > qqnorm(n6)

    Importing Text-files

    Reminder:

    > grades=read.table("http://math.fullerton.edu/sbehseta/grades.txt",header=T)

    24

  • Histogram of n4

    n4

    Freq

    uenc

    y

    15 10 5 0 5 10 15

    050

    150

    3 2 1 0 1 2 3

    10

    010

    Normal QQ Plot

    Theoretical Quantiles

    Sam

    ple

    Quan

    tiles

    Histogram of n5

    n5

    Freq

    uenc

    y

    0 2 4 6 8

    010

    020

    0

    3 2 1 0 1 2 3

    02

    46

    8

    Normal QQ Plot

    Theoretical Quantiles

    Sam

    ple

    Quan

    tiles

    Histogram of n6

    n6

    Freq

    uenc

    y

    0.4 0.6 0.8 1.0

    050

    150

    3 2 1 0 1 2 3

    0.3

    0.6

    0.9

    Normal QQ Plot

    Theoretical QuantilesSa

    mpl

    e Qu

    antile

    s

    Figure 13: The Normal Quantile plots for symmetric and asymmetric distributions.

    > attach(grades)

    > dim(grades)

    [1] 100 3

    > Verbal

    [1] 623 454 643 585 719 693 571 646 613 655 662 585 580 648 405 506 669 558

    [19] 577 487 682 565 552 567 745 610 493 571 682 600 740 593 488 526 630 586

    [37] 610 695 539 490 509 667 597 662 566 597 604 519 643 606 500 460 717 592

    [55] 752 695 610 620 682 524 552 703 584 550 659 585 578 533 532 708 537 635

    [73] 591 552 557 599 540 752 726 630 558 646 643 606 682 565 578 488 361 560

    [91] 630 666 719 669 571 520 571 539 580 629

    > hist(Verbal)

    > summary(Verbal)

    Min. 1st Qu. Median Mean 3rd Qu. Max.

    25

  • 361.0 552.0 592.5 598.5 649.8 752.0

    > stem(verbal)

    The decimal point is 2 digit(s) to the right of the |

    3 | 6

    4 | 1

    4 | 5699999

    5 | 0112223334444

    5 | 55556666777777778888889999999

    6 | 000001111112233334444

    6 | 5556666777788889

    7 | 000122234

    7 | 555

    26

  • Scatterplots and Pearsons Correlation

    To create scatterplots, the command is simply plot.

    > plot(Verbal,Math)

    To calculate the correlation coefficient between any two random variables, we can use thecor command:

    > cor(Verbal,Math)

    [1] 0.4306938

    > cor(Verbal,GPA)

    [1] 0.4847681

    > cor(Math,GPA)

    [1] 0.2236183

    Central Limit Theorem

    Central limit theorem says that if x D(, ) where D is a probability density or a massfunction (regardless of the form of the distribution), when sample size is large enough:x(

    n) N(0, 1). When the data comes from a Normal distribution mean and standard

    deviation , we can assume x(

    n) N(0, 1).

    To verify these results lets look at Figure 1. Figure 1, shows a clearly right skewedpopulation with = 2.029 and = 1.382. For 1000 times, we take samples of size 2from this population and we will obtain 1000 sample averages associated with those randomsamples. Then we repeat this process for samples of sizes 3, 6, 10, 20, 100 and each timewe keep track of the mean and the distribution of those 1000 sample averages. We alsoplot histograms and qq-plots for each scenario. It turns out that as sample sizes increasesthe distribution of 1000 sample averages converge to normality (figures 2 and 3). Also, thestandard deviations of those sampling distributions get closer to

    n(table 1).

    Next, I sample from a normal population with = 99.602 and = 10.2211 (figure 4).Then I repeat the same procedure for sample averages. It turns out that regardless of samplesizes, the result associated with central limit theorem hold (figures 5, 6, and table 2).

    27

  • Sample Size 2 3 6 10 20 100Mean 2.0945 2.061667 2.016 2.0301 2.0186 2.02654

    Standard Deviation 0.9860487 0.78204 0.5660369 0.453773 0.3010642 0.1300328Table 1. Results for popoulation 1. The mean and standard deviations of the sample mean with

    different sample sizes

    Sample Size 2 3 6 10 20 100Mean 99.95793 99.30017 99.5832 99.5639 99.653 99.57605

    Standard Deviation 7.199265 5.83941 4.132356 3.290563 2.240824 0.9427503Table 2. Results for population 2. The mean and standard deviations of the sample mean with

    different sample sizes

    Histogram of test1

    test1

    Freq

    uenc

    y

    0 1 2 3 4 5 6 7

    050

    100

    150

    200

    250

    300

    Figure 14: Case one: Population distribution. The distribution is Skewed to the right.

    28

  • Histogram of mean.size2

    mean.size2

    Freq

    uenc

    y

    0 1 2 3 4 5

    050

    150

    Histogram of mean.size3

    mean.size3

    Freq

    uenc

    y

    0 1 2 3 4 5

    010

    020

    030

    0

    Histogram of mean.size6

    mean.size6

    Freq

    uenc

    y

    1 2 3 4

    010

    025

    0

    Histogram of mean.size10

    mean.size10

    Freq

    uenc

    y

    1.0 1.5 2.0 2.5 3.0 3.5

    050

    150

    Histogram of mean.size20

    mean.size20

    Freq

    uenc

    y

    1.5 2.0 2.5 3.0

    010

    020

    0

    Histogram of mean.size100

    mean.size100

    Freq

    uenc

    y

    1.6 1.8 2.0 2.2 2.4

    010

    020

    0

    Figure 15: Sampling Distributions of Sample means for sample sizes 2, 3, 6, 10, 20, 100.

    R-codes For Simulation

    Population 1: Right Skewed Distribution We can simulate from a Poisson dis-tribution:

    > test1=rpois(1000,2)

    > hist(test1)

    > mean(test1)

    [1] 2.029

    > sd(test1)

    [1] 1.382777

    Population 1: Obtaining 1000 Samples With Size 2, 3, 6, 10, 20, 100Here is the case for Size 2. Others are similar.

    29

  • 3 2 1 0 1 2 3

    01

    23

    45

    Normal QQ Plot

    Theoretical Quantiles

    Sam

    ple

    Quan

    tiles

    3 2 1 0 1 2 3

    01

    23

    45

    Normal QQ Plot

    Theoretical Quantiles

    Sam

    ple

    Quan

    tiles

    3 2 1 0 1 2 3

    12

    34

    Normal QQ Plot

    Theoretical Quantiles

    Sam

    ple

    Quan

    tiles

    3 2 1 0 1 2 3

    1.0

    2.0

    3.0

    Normal QQ Plot

    Theoretical Quantiles

    Sam

    ple

    Quan

    tiles

    3 2 1 0 1 2 3

    1.5

    2.5

    Normal QQ Plot

    Theoretical Quantiles

    Sam

    ple

    Quan

    tiles

    3 2 1 0 1 2 3

    1.6

    2.0

    2.4

    Normal QQ Plot

    Theoretical Quantiles

    Sam

    ple

    Quan

    tiles

    Figure 16: QQ-plots for the sample mean distributions for different sample sizes.

    > test=matrix(nrow=1000,ncol=2)

    > for(i in 1:1000)

    {

    test[i,]=sample(test1,2)

    }

    > mean.size2=apply(test,1,mean)

    > mean(mean.size2)

    [1] 2.0945

    > sd(mean.size2)

    [1] 0.9860487

    30

  • Histogram of test2

    test2

    Freq

    uenc

    y

    70 80 90 100 110 120 130

    050

    100

    150

    200

    Figure 17: Case two: Population distribution. The distribution symmetric.

    Population 2: Obtaining 1000 Samples With Size 2, 3, 6, 10, 20, 100 Again,only the case for size 2 is included.

    test=matrix(nrow=1000,ncol=2)

    for(i in 1:1000)

    {

    test[i,]=sample(test2,2)

    }

    mean.size2.norm=apply(test,1,mean)

    > mean(mean.size2.norm)

    [1] 99.95793

    > sd(mean.size2.norm)

    31

  • Histogram of mean.size2.norm

    mean.size2.norm

    Freq

    uenc

    y

    80 90 100 110 120 130

    010

    020

    0

    Histogram of mean.size3.norm

    mean.size3.norm

    Freq

    uenc

    y

    80 90 100 110 120

    010

    020

    030

    0

    Histogram of mean.size6.norm

    mean.size6.norm

    Freq

    uenc

    y

    85 90 95 100 105 110

    050

    150

    Histogram of mean.size10.norm

    mean.size10.norm

    Freq

    uenc

    y

    90 95 100 105 110

    050

    150

    Histogram of mean.size20.norm

    mean.size20.norm

    Freq

    uenc

    y

    95 100 105

    050

    100

    Histogram of mean.size100.norm

    mean.size100.norm

    Freq

    uenc

    y

    97 98 99 100 101 102 103

    050

    150

    Figure 18: Sampling Distributions of Sample means for sample sizes 2, 3, 6, 10, 20, 100.

    [1] 7.199265

    Population 1: Plotting Histograms and QQ-plots

    par(mfrow=c(3,2))

    hist(mean.size2)

    hist(mean.size3)

    hist(mean.size6)

    hist(mean.size10)

    hist(mean.size20)

    hist(mean.size100)

    32

  • 3 2 1 0 1 2 3

    8010

    012

    0

    Normal QQ Plot

    Theoretical Quantiles

    Sam

    ple

    Quan

    tiles

    3 2 1 0 1 2 3

    8090

    110

    Normal QQ Plot

    Theoretical Quantiles

    Sam

    ple

    Quan

    tiles

    3 2 1 0 1 2 3

    9010

    011

    0

    Normal QQ Plot

    Theoretical Quantiles

    Sam

    ple

    Quan

    tiles

    3 2 1 0 1 2 3

    9010

    011

    0

    Normal QQ Plot

    Theoretical Quantiles

    Sam

    ple

    Quan

    tiles

    3 2 1 0 1 2 3

    9296

    100

    106

    Normal QQ Plot

    Theoretical Quantiles

    Sam

    ple

    Quan

    tiles

    3 2 1 0 1 2 3

    9799

    101

    Normal QQ Plot

    Theoretical Quantiles

    Sam

    ple

    Quan

    tiles

    Figure 19: QQ-plots for the sample mean distributions for different sample sizes.

    par(mfrow=c(3,2))

    qqnorm(mean.size2)

    qqnorm(mean.size3)

    qqnorm(mean.size6)

    qqnorm(mean.size10)

    qqnorm(mean.size20)

    qqnorm(mean.size100)

    33

  • SAT Scores Again

    Verbal: t-test and Confidence Interval

    > mean(Verbal)

    [1] 598.49

    > t.test(Verbal,mu=600)

    One Sample t-test

    data: verbal t = -0.1986, df = 99, p-value = 0.843 alternative

    hypothesis: true mean is not equal to 600 95 percent confidence

    interval:

    583.4042 613.5758

    sample estimates: mean of x

    598.49

    Verbal: Two-sided Versus One-sided Tests

    > t.test(Verbal,mu=600,alternative="two.sided")

    One Sample t-test

    data: verbal t = -0.1986, df = 99, p-value = 0.843 alternative

    hypothesis: true mean is not equal to 600 95 percent confidence

    interval:

    583.4042 613.5758

    sample estimates: mean of x

    598.49

    > t.test(Verbal,mu=600,alternative="less")

    One Sample t-test

    data: verbal t = -0.1986, df = 99, p-value = 0.4215 alternative

    hypothesis: true mean is less than 600 95 percent confidence

    interval:

    -Inf 611.1138

    sample estimates: mean of x

    598.49

    > t.test(Verbal,mu=600,alternative="greater")

    34

  • One Sample t-test

    data: verbal t = -0.1986, df = 99, p-value = 0.5785 alternative

    hypothesis: true mean is greater than 600 95 percent confidence

    interval:

    585.8662 Inf

    sample estimates: mean of x

    598.49

    Verbal: Changing

    > t.test(Verbal,mu=620)

    One Sample t-test

    data: verbal t = -2.8292, df = 99, p-value = 0.00565 alternative

    hypothesis: true mean is not equal to 620 95 percent confidence

    interval:

    583.4042 613.5758

    sample estimates: mean of x

    598.49

    > t.test(Verbal,mu=620,alternative="less")

    One Sample t-test

    data: verbal t = -2.8292, df = 99, p-value = 0.002825 alternative

    hypothesis: true mean is less than 620 95 percent confidence

    interval:

    -Inf 611.1138

    sample estimates: mean of x

    598.49

    > t.test(Verbal,mu=620,alternative="greater")

    One Sample t-test

    data: verbal t = -2.8292, df = 99, p-value = 0.9972 alternative

    hypothesis: true mean is greater than 620

    35

  • Math: Confidence Interval and t-test

    > t.test(Math)

    One Sample t-test

    data: math t = 99.6647, df = 99, p-value = < 2.2e-16 alternative

    hypothesis: true mean is not equal to 0 95 percent confidence

    interval:

    641.0874 667.1326

    sample estimates: mean of x

    654.11

    Two-sample t-test for Math and Verbal

    > t.test(Math,verbal,mu=0)

    Welch Two Sample t-test

    data: math and verbal t = 5.5377, df = 193.867, p-value =

    9.88e-08 alternative hypothesis: true difference in means is not

    equal to 0 95 percent confidence interval:

    35.81078 75.42922

    sample estimates: mean of x mean of y

    654.11 598.49

    > t.test(Math,verbal,mu=0,alternative="greater")

    Welch Two Sample t-test

    data: math and verbal t = 5.5377, df = 193.867, p-value =

    4.94e-08 alternative hypothesis: true difference in means is

    greater than 0 95 percent confidence interval:

    39.02003 Inf

    sample estimates: mean of x mean of y

    654.11 598.49

    > t.test(Math,verbal,mu=0,alternative="less")

    36

  • Welch Two Sample t-test

    data: math and verbal t = 5.5377, df = 193.867, p-value = 1

    alternative hypothesis: true difference in means is less than 0 95

    percent confidence interval:

    -Inf 72.21997

    sample estimates: mean of x mean of y

    654.11 598.49

    > t.test(Math,verbal,mu=50)

    Welch Two Sample t-test

    data: math and verbal t = 0.5595, df = 193.867, p-value = 0.5764

    alternative hypothesis: true difference in means is not equal to

    50 95 percent confidence interval:

    35.81078 75.42922

    sample estimates: mean of x mean of y

    654.11 598.49

    37

  • Project 2

    (1) (Moore and McCabe, 1998) Crop researchers plant 15 plots with a new variety of corn.The yields in bushels per acre are:138.0 139.1 113.0 132.5 140.7 109.7 118.9 134.8109.6 127.3 115.6 130.4 130.2 111.7 105.5Assume that the population of yields is normal.

    (a) Find the 90% confidence interval for the mean yield for this variety of corn.

    (b) Find the 95% confidence interval.

    (c) Find the 99% confidence interval.

    (d) How do margin of error (sampling error) in (a), (b), and (c) change as confidencelevel increases?

    (2) (Moore and McCabe, 1998) The table below gives the pretest and posttest scores onMLA listening test in Spanish for 20 high school Spanish teachers who attended anintensive summer course in Spanish.Subject Pretest Posttest Subject Pretest Posttest

    1 30 29 11 30 322 28 30 12 29 283 31 32 13 31 344 26 30 14 29 325 20 16 15 34 326 30 25 16 20 277 34 31 17 26 288 15 18 18 25 299 28 33 19 31 3210 20 25 20 29 32

    Give a 90% confidence interval for the mean increase in listening score due to attendingthe summer institute.

    (3) Download the dataset grades.txt from the course webpage. Build a 95% confidenceintervals for math, and verbal. Are there overlaps? Interpret your findings.

    (4) Bonus: Consider verbal scores in the grades dataset. First, show that verbalscores follows a normal distribution. Then, construct a 95% confidence interval for thepopulation mean verbal of verbal scores.An alternative way of constructing a 95% confidence interval is to use the quantiles ofthe data: Consider (grades0.025,grades0.975), where grades0.025, and grades0.975 are sim-ply the 2.5% and the 97.5% percentiles of the dataset. Does this confidence intervalagree with the confidence interval you constructed before? Why?Note that x = 598.49, and s = 76.029 for grades. Generate 100,000 normal val-ues from Normal(598, 76/

    100). Create a quantile confidence interval for the latter

    38

  • dataset. Does this confidence interval agree with the first confidence interval? Canyou think of an explanation for this agreement(disagreement)?Hint: look at the command quantile.

    39


Recommended