Descriptive Statistics - University of New...

transcript

Marc Mehlman

Descriptive Statistics

Marc H. Mehlmanmarcmehlman@yahoo.com

University of New Haven

Marc Mehlman (University of New Haven) Descriptive Statistics 1 / 44

Marc Mehlman

Table of Contents

1 Data Distributions

2 Graphical Representation of Distributions

3 Measuring the Center

4 Measuring the Spread

5 Normal Distribution

6 Order Statistics

7 Chapter #1 R Assignment

Marc Mehlman

Data Distributions

Marc Mehlman

Data Distributions

In any graph of data, look for the overall pattern and for striking deviations from that pattern.

You can describe the overall pattern by its shape, center, and spread.

An important kind of deviation is an outlier, an individual that falls outside the overall pattern.

Examining Distributions

Marc Mehlman

Data Distributions

A distribution is symmetric if the right and left sides of the graph are approximately mirror images of each other.

A distribution is skewed to the right (right-skewed) if the right side of the graph (containing the half of the observations with larger values) is much longer than the left side.

It is skewed to the left (left-skewed) if the left side of the graph is much longer than the right side.

SymmetricSymmetric Skewed-leftSkewed-left Skewed-rightSkewed-right

Examining Distributions

Marc Mehlman

Data Distributions

Alaska Florida

An important kind of deviation is an outlier. Outliers are observations that

lie outside the overall pattern of a distribution. Always look for outliers and

try to explain them.

The overall pattern is fairly

symmetrical except for two

states that clearly do not

belong to the main trend.

Alaska and Florida have

unusual representation of

the elderly in their

population.

A large gap in the

distribution is typically a

sign of an outlier.

Outliers

Marc Mehlman

Data Distributions

in a class of 200 students let

xi = # pts out of 500 possible that student i gets

Frequency Table Relative F Table Cumulative F Table0 – 99 10 0 – 99 5% ≤ 99 10

100 – 199 8 100 – 199 4% ≤ 199 18200 – 299 22 200 – 299 11% ≤ 299 40300 – 399 100 300 – 399 50% ≤ 399 140400 – 500 60 400 – 500 30% ≤ 500 200

1 # classes should be between 5 and 20 inclusive.

2 class width ≈ max value−min value# of classes

Marc Mehlman

Graphical Representation of Distributions

Marc Mehlman

Distribution of a Variable

To examine a single variable, we graphically display its distribution.

The distribution of a variable tells us what values it takes and how often it takes these values.

Distributions can be displayed using a variety of graphical tools. The proper choice of graph depends on the nature of the variable.

The distribution of a variable tells us what values it takes and how often it takes these values.

Distributions can be displayed using a variety of graphical tools. The proper choice of graph depends on the nature of the variable.

Categorical VariablePie chartBar graph

Quantitative VariableHistogramStemplot

Marc Mehlman

Categorical Variables

The distribution of a categorical variable lists the categories and gives the count or percent of individuals who fall into that category.

Pie Charts show the distribution of a categorical variable as a “pie” whose slices are sized by the counts or percents for the categories.

Bar Graphs represent each category as a bar whose heights show the category counts or percents.

Marc Mehlman

> pie.sales = c(0.12, 0.3, 0.26, 0.16, 0.04, 0.12)

> lbls = c("Blueberry", "Cherry", "Apple", "Boston Cream", "Other", "Vanilla Cream")

> pie(pie.sales, labels = lbls, main="Pie Sales")

Blueberry

Cherry

Boston Cream

Vanilla Cream

Pie Sales

Marc Mehlman

> counts=c(40,30,20,10)

> colors=c("Red","Blue","Green","Brown")

> barplot(counts,names.arg=colors,main="Favorite Colors")

Red Blue Green Brown

Favorite Colors

Marc Mehlman

Quantitative Variables

The distribution of a quantitative variable tells us what values the variable takes on and how often it takes those values.

Histograms show the distribution of a quantitative variable by using bars whose height represents the number of individuals who take on a value within a particular class.

Stemplots separate each observation into a stem and a leaf that are then plotted to display the distribution while maintaining the original values of the variable.

Marc Mehlman

For quantitative variables that take many values and/or large datasets.

Divide the possible values into classes (equal widths).

Count how many observations fall into each interval (may change to percents).

Draw picture representing the distribution―bar heights are equivalent to the number (percent) of observations in each interval.

Histograms

Marc Mehlman

> hist(trees$Girth,main="Girth of Black Cherry Trees",xlab="Diameter in Inches")

Girth of Black Cherry Trees

Diameter in Inches

8 10 12 14 16 18 20 22

Marc Mehlman

To construct a stemplot:

Separate each observation into a stem (first part of the number) and a leaf (the remaining part of the number).

Write the stems in a vertical column; draw a vertical line to the right of the stems.

Write each leaf in the row to the right of its stem; order leaves if desired.

Stemplots

Marc Mehlman

> Girth=trees$Girth

> stem(Girth) # stem and leaf plot

The decimal point is at the |

8 | 368

10 | 57800123447

12 | 099378

14 | 025

16 | 03359

18 | 00

20 | 6

> stem(Girth, scale=2)

The decimal point is at the |

8 | 368

10 | 578

11 | 00123447

12 | 099

13 | 378

14 | 025

16 | 03

17 | 359

18 | 00

20 | 6

Marc Mehlman

Bivariate Data

Bivariate data comes from measuring two aspects of the sameitem/individual. For instance,

(70, 178), (72, 192), (74, 184), (68, 181)

is a random sample of size four obtained from four male college students.The bivariate data gives the height in inches and the weight in pounds ofeach of the for students. The third student sampled is 74 inches high andweighs 184 pounds.

Can one variable be used to predict the other? Do tall people tend toweigh more?

Definition

A response (or dependent) variable measures the outcome of a study.The explanatory (or independent) variable is the one that predicts theresponse variable.

Marc Mehlman

Student ID

Number of Beers

Blood Alcohol Content

1 5 0.1

2 2 0.03

3 9 0.19

6 7 0.095

7 3 0.07

9 3 0.02

11 4 0.07

13 5 0.085

4 8 0.12

5 3 0.04

8 5 0.06

10 5 0.05

12 6 0.1

14 7 0.09

15 1 0.01

16 4 0.05

Here we have two quantitative variables

recorded for each of 16 students:

1. how many beers they drank

2. their resulting blood alcohol content (BAC)

Bivariate data

For each individual studied, we record

data on two variables.

We then examine whether there is a

relationship between these two

variables: Do changes in one variable

tend to be associated with specific

changes in the other variables?

Marc Mehlman

Student Beers BAC

1 5 0.1

2 2 0.03

3 9 0.19

6 7 0.095

7 3 0.07

9 3 0.02

11 4 0.07

13 5 0.085

4 8 0.12

5 3 0.04

8 5 0.06

10 5 0.05

12 6 0.1

14 7 0.09

15 1 0.01

16 4 0.05

Scatterplots

A scatterplot is used to display quantitative bivariate data.

Each variable makes up one axis. Each individual is a point on the graph.

Marc Mehlman

> plot(trees$Girth~trees$Height,main="girth vs height")

●●

●● ●

● ● ●● ●●●●

● ●

●●●

●●

●●●

65 70 75 80 85

girth vs height

trees$Height

Marc Mehlman

Measuring the Center

Marc Mehlman

Measures of the Center

Definition

Given x1, x2, · · · , xn, the sample mean is xdef= x1+x2+···+xn

n = 1n

∑nj=1 xj .

The population mean is µdef= 1

∑Nj=1 xj . The mode is the most value

data value (it needs not be unique).If one orders the data from smallest to largest, the median is

{middle value of data if n is oddthe average of the middle two values of data if n is even

Laymen refer to the mean as the average.

Example

The median sales price of a house in Milford was $212,175 for Feb–Apr2013. If Bill Gates buys a house in Milford for $100 million, what will thatdo to mean cost of a house in Milford? to the median house in Milford?What is a better measure of the cost of buying a house in Milford, themean or median?

Marc Mehlman

“Statistically, if you lie with your head in the oven and your feetin the fridge, on average you will be comfortably warm.”–Anonymous

“Then there is the man who drowned crossing a stream with anaverage depth of six inches.” – W.I.E. Gates

“The average human has one breast and one testicle.” –humorist Des McHale

Marc Mehlman

The mean and median measure center in different ways, and both are useful.

The mean and median of a roughly symmetric distribution are close together.

If the distribution is exactly symmetric, the mean and median are exactly the same.

In a skewed distribution, the mean is usually farther out in the long tail than is the median.

The mean and median of a roughly symmetric distribution are close together.

If the distribution is exactly symmetric, the mean and median are exactly the same.

In a skewed distribution, the mean is usually farther out in the long tail than is the median.

Comparing Mean and Median

Marc Mehlman

Measuring the Spread

Marc Mehlman

Definition

The population variance is σ2 def= 1

∑Nj=1(xj − µ)2 and the population

standard deviation is σdef=√σ2 =

∑Nj=1(xj − µ)2.

However one often hase only a random sample to examine, not the entirepoplulation. With only a random sample, one can not calculate thepopulation mean, µ, so the best one can do is use the sample mean, xinstead.

Definition

The sample variance is s2 def= 1

∑nj=1(xj − x)2 and the sample

standard deviation is sdef=√s2 =

∑nj=1(xj − x)2.

Notice the use of n − 1 instead n for the sample variance and standarddeviation.

Marc Mehlman

Example

Suppose our random sample is

4.5, 3.7, 2.8, 5.3, 4.6.

5[4.5 + 3.7 + 2.8 + 5.3 + 4.6] = 4.18

5− 1

[(4.5− 4.18)2 + (3.7− 4.18)2 + (2.8− 4.18)2 + (5.3− 4.18)2 + (4.6− 4.18)2

]= 0.917

s =√

0.917 = 0.9576012.

Using R:

> dat=c(4.5,3.7,2.8,5.3,4.6)

> mean(dat)

[1] 4.18

> var(dat)

[1] 0.917

> sd(dat)

[1] 0.9576012

Marc Mehlman

Why s2 = 1n−1

∑nj=1(x − x)2 and not s2 = 1

∑nj=1(x − x)2?

Ans: s2 is unbiased – on average s2 will be σ2 and 1n

∑nj=1(x − x)2 is

biased – on average it will be n−1n s2.

1 s2 =n∑n

j=1 x2j −(

∑nj=1 xj)

n(n−1) and s =

√n∑n

j=1 x2j −(

∑nj=1 xj)

n(n−1)

2 s2 = 0 ⇔ s = 0 ⇒ all of the random sample is the same.

3 x is an estimator of µ.

4 s2 is an (unbiased) estimator of σ2.

5 s is an (biased) estimator of σ.

6 s measures the amount the data is dispersed about the mean.

7 s has the same units of measurement as the data.

8 s is sensitive to the existence of outliers.

Marc Mehlman

Definition (Coefficient of Variation, CV)

sample CVdef=

x× 100,

population CVdef=

µ× 100.

CV standardizes standard deviation according to the mean.

Example

There are 250 dogs at a dog show that weigh an average of 25 pounds,with a standard deviation of 8 pounds. The 250 human owners had anaverage weight of of 172 lbs with a standard deviation of 29 lbs. Do thehumans or the dogs have vary greater in weight?Sol: It depends how the question is interpreted. The varance of weight forthe owners is 29 which is greater than the variance for the dogs which is 8.However, for humans CV = 29

172 = 0.17 which is less than the CV for dogswhich is 8

25 = 0.32.

Marc Mehlman

Theorem (Chebyshev’s Theorem)

At least 1− 1K2 of data is within K standard deviations of the mean.

Example

at least 3/4 ths of data is within 2 standard deviation of the meanat least 8/9 ths of data is within 3 standard deviation of the meanat least 15/18 ths of data is within 4 standard deviation of the mean

Marc Mehlman

Normal Distribution

Marc Mehlman

Normal Distribution

A density curve is a curve that:

• is always on or above the horizontal axis• has an area of exactly 1 underneath it

A density curve describes the overall pattern of a distribution. The area under the curve and above any range of values on the horizontal axis is the proportion of all observations that fall in that range.

A density curve is a curve that:

• is always on or above the horizontal axis• has an area of exactly 1 underneath it

A density curve describes the overall pattern of a distribution. The area under the curve and above any range of values on the horizontal axis is the proportion of all observations that fall in that range.

Density Curves

Marc Mehlman

Normal Distribution

Advantage of s over s2: s uses same unit of measurement as the data.

Definition (Standardized Scores (z–scores))

sample standardized score: zdef=

x − x

population standardized score: zdef=

x − µσ

Advantage of z–scores: unit independent.

Example (from Book)

z–score of Michael Jordan’s height = 3.21

z–score of Rebecca Lobo’s height = 4.96

Conclusion:Jordan is 3.21 standard deviations taller than the average male height.Lobo is 4.96 standard deviations taller than average female height.

Note: negative z–scores ⇒ smaller than average.Marc Mehlman (University of New Haven) Descriptive Statistics 34 / 44

Marc Mehlman

Order Statistics

Marc Mehlman

Order Statistics

Definition

percentile of x = percentage of data less than x .

Pk = k th percentile = is the k th percentile

Q1 = 1st quantile = P25 = “border of bottom 25% and top 75%”.

= median of all values ≤ overall median

Q2 = 2nd quantile = P50 = median.

Q3 = 3rd quantile = P75 = border of top 25% and bottom 75%.

= median of all values ≥ overall median

5–number summary = min,Q1,Q2,Q3,max

Interquartile Range = IQR = Q3 − Q1.

How to find the Pk , the k th percentile according to the book:Order x1, x2, · · · , xn in increasing order. Count L in to get Pk where L is the smallest

integer ≥ k

Example

5 year old is in 90% percentile, weight–wise, for his age group.

Marc Mehlman

Order Statistics

We now have a choice between two descriptions for center and spread

Mean and Standard Deviation

Median and Interquartile Range

The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers.

Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers.

NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA!

The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers.

Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers.

NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA!

Choosing Measures of Center and SpreadChoosing Measures of Center and Spread

Choosing Measures ofCenter and Spread

Marc Mehlman

Order Statistics

R Commands:

Example

> mean(trees$Volume)

[1] 30.17097

> median(trees$Volume)

[1] 24.2

> summary(trees$Volume)

Min. 1st Qu. Median Mean 3rd Qu. Max.

10.20 19.40 24.20 30.17 37.30 77.00

> IQR(trees$Volume)

[1] 17.9

> var(trees$Volume)

[1] 270.2028

> sd(trees$Volume)

[1] 16.43785

Marc Mehlman

Order Statistics

Outliers

Definition (The 1.5 x IQR Rule for Outliers)

Call an observation an outlier if it falls more than 1.5 x IQR above thethird quartile or below the first quartile.

Outliers have effects on means and variance far greater than their numbers.

Marc Mehlman

Order Statistics

Example

The number of eggs laid by Farmer John’s chickens (hens) in September2016 is given below.

18, 13, 3, 16, 9, 35, 5, 15, 23, 11, 7.

Are there any outlier hens?Solution: One figures out the quartiles:

Q1 M Q33 5 7 9 11 13 15 16 18 23 35

Since IQR = 18− 7 = 11 and 35−Q3 = 17 > 1.5 x IQR = 16.5, oneidentifies 35 as an outlier.

Marc Mehlman

Order Statistics

Boxplots

Definition

Given x1, x2, · · · , xn, to create a boxplot (also called a box and whiskersplot)

1 draw and label a vertical number line that includes the range of thedistribution.

2 draw a box from height Q1 to Q3.

3 draw a horizontal line inside the box at the height of the median.

4 draw vertical line segments (whiskers) from the bottom and top ofthe box to the minimum and maximum data values that are notoutliers.

5 sometimes outliers are identified with ◦’s (R does this).

Boxplots are often useful when comparing the values of two differentvariables.

Marc Mehlman

Order Statistics

> boxplot(trees$Height, main="Heights of Black Cherry Trees")

> boxplot(USJudgeRatings$DMNR,USJudgeRatings$DILG,

+ main="Lawyers’ Demeanor/Diligence ratings of US Superior Court state judges")

Heights of Black Cherry Trees

Lawyers' Demeanor/Diligence ratings of US Superior Court state judges

Marc Mehlman

Chapter #1 R Assignment

Marc Mehlman

Chapter #1 R Assignment

Fifty-eight sailors are sampled and their eye color is noted as below

blue brown green hazel red11 32 8 5 2

1 Create a barplot and pie chart of eye color from the sailor sample.

2 Create of a histogram and a stemplot of the height of loblolly treesfrom the dataset “Loblolly”. The dataset, “Loblolly” comes with R,just as “trees” does. To observe “Loblolly”, type “Loblolly” at the Rprompt (without the quotes). To learn more about the dataset, type“help(Loblolly)” at the R prompt.

3 Find the mean, median, five number summary, variance and standarddeviation from the sample of heights in the dataset “Loblolly”.

4 Create a scatterplot of weight versus quarter mile times for thedataset, “mtcars”. Assume the the independent variable is thequarter mile times and the dependent variable is the weight.

Descriptive Statistics - University of New...

Documents