Post on 13-Jul-2020
transcript
Marc Mehlman
Descriptive Statistics
Marc H. Mehlmanmarcmehlman@yahoo.com
University of New Haven
Marc Mehlman (University of New Haven) Descriptive Statistics 1 / 44
Marc Mehlman
Table of Contents
1 Data Distributions
2 Graphical Representation of Distributions
3 Measuring the Center
4 Measuring the Spread
5 Normal Distribution
6 Order Statistics
7 Chapter #1 R Assignment
Marc Mehlman (University of New Haven) Descriptive Statistics 2 / 44
Marc Mehlman
Data Distributions
Data Distributions
Data Distributions
Marc Mehlman (University of New Haven) Descriptive Statistics 3 / 44
Marc Mehlman
Data Distributions
15
In any graph of data, look for the overall pattern and for striking deviations from that pattern.
You can describe the overall pattern by its shape, center, and spread.
An important kind of deviation is an outlier, an individual that falls outside the overall pattern.
Examining Distributions
Marc Mehlman (University of New Haven) Descriptive Statistics 4 / 44
Marc Mehlman
Data Distributions
16
A distribution is symmetric if the right and left sides of the graph are approximately mirror images of each other.
A distribution is skewed to the right (right-skewed) if the right side of the graph (containing the half of the observations with larger values) is much longer than the left side.
It is skewed to the left (left-skewed) if the left side of the graph is much longer than the right side.
SymmetricSymmetric Skewed-leftSkewed-left Skewed-rightSkewed-right
Examining Distributions
Marc Mehlman (University of New Haven) Descriptive Statistics 5 / 44
Marc Mehlman
Data Distributions
Alaska Florida
An important kind of deviation is an outlier. Outliers are observations that
lie outside the overall pattern of a distribution. Always look for outliers and
try to explain them.
The overall pattern is fairly
symmetrical except for two
states that clearly do not
belong to the main trend.
Alaska and Florida have
unusual representation of
the elderly in their
population.
A large gap in the
distribution is typically a
sign of an outlier.
Outliers
Marc Mehlman (University of New Haven) Descriptive Statistics 6 / 44
Marc Mehlman
Data Distributions
in a class of 200 students let
xi = # pts out of 500 possible that student i gets
Frequency Table Relative F Table Cumulative F Table0 – 99 10 0 – 99 5% ≤ 99 10
100 – 199 8 100 – 199 4% ≤ 199 18200 – 299 22 200 – 299 11% ≤ 299 40300 – 399 100 300 – 399 50% ≤ 399 140400 – 500 60 400 – 500 30% ≤ 500 200
1 # classes should be between 5 and 20 inclusive.
2 class width ≈ max value−min value# of classes
Marc Mehlman (University of New Haven) Descriptive Statistics 7 / 44
Marc Mehlman
Graphical Representation of Distributions
Graphical Representation of Distributions
Graphical Representation of Distributions
Marc Mehlman (University of New Haven) Descriptive Statistics 8 / 44
Marc Mehlman
Graphical Representation of Distributions
Distribution of a Variable
6
To examine a single variable, we graphically display its distribution.
The distribution of a variable tells us what values it takes and how often it takes these values.
Distributions can be displayed using a variety of graphical tools. The proper choice of graph depends on the nature of the variable.
The distribution of a variable tells us what values it takes and how often it takes these values.
Distributions can be displayed using a variety of graphical tools. The proper choice of graph depends on the nature of the variable.
Categorical VariablePie chartBar graph
Categorical VariablePie chartBar graph
Quantitative VariableHistogramStemplot
Quantitative VariableHistogramStemplot
Marc Mehlman (University of New Haven) Descriptive Statistics 9 / 44
Marc Mehlman
Graphical Representation of Distributions
Categorical Variables
7
The distribution of a categorical variable lists the categories and gives the count or percent of individuals who fall into that category.
Pie Charts show the distribution of a categorical variable as a “pie” whose slices are sized by the counts or percents for the categories.
Bar Graphs represent each category as a bar whose heights show the category counts or percents.
Marc Mehlman (University of New Haven) Descriptive Statistics 10 / 44
Marc Mehlman
Graphical Representation of Distributions
> pie.sales = c(0.12, 0.3, 0.26, 0.16, 0.04, 0.12)
> lbls = c("Blueberry", "Cherry", "Apple", "Boston Cream", "Other", "Vanilla Cream")
> pie(pie.sales, labels = lbls, main="Pie Sales")
Blueberry
Cherry
Apple
Boston Cream
Other
Vanilla Cream
Pie Sales
Marc Mehlman (University of New Haven) Descriptive Statistics 11 / 44
Marc Mehlman
Graphical Representation of Distributions
> counts=c(40,30,20,10)
> colors=c("Red","Blue","Green","Brown")
> barplot(counts,names.arg=colors,main="Favorite Colors")
Red Blue Green Brown
Favorite Colors
010
2030
40
Marc Mehlman (University of New Haven) Descriptive Statistics 12 / 44
Marc Mehlman
Graphical Representation of Distributions
Quantitative Variables
9
The distribution of a quantitative variable tells us what values the variable takes on and how often it takes those values.
Histograms show the distribution of a quantitative variable by using bars whose height represents the number of individuals who take on a value within a particular class.
Stemplots separate each observation into a stem and a leaf that are then plotted to display the distribution while maintaining the original values of the variable.
Marc Mehlman (University of New Haven) Descriptive Statistics 13 / 44
Marc Mehlman
Graphical Representation of Distributions
13
For quantitative variables that take many values and/or large datasets.
Divide the possible values into classes (equal widths).
Count how many observations fall into each interval (may change to percents).
Draw picture representing the distribution―bar heights are equivalent to the number (percent) of observations in each interval.
Histograms
Marc Mehlman (University of New Haven) Descriptive Statistics 14 / 44
Marc Mehlman
Graphical Representation of Distributions
> hist(trees$Girth,main="Girth of Black Cherry Trees",xlab="Diameter in Inches")
Girth of Black Cherry Trees
Diameter in Inches
Fre
quen
cy
8 10 12 14 16 18 20 22
02
46
810
12
Marc Mehlman (University of New Haven) Descriptive Statistics 15 / 44
Marc Mehlman
Graphical Representation of Distributions
10
To construct a stemplot:
Separate each observation into a stem (first part of the number) and a leaf (the remaining part of the number).
Write the stems in a vertical column; draw a vertical line to the right of the stems.
Write each leaf in the row to the right of its stem; order leaves if desired.
Stemplots
Marc Mehlman (University of New Haven) Descriptive Statistics 16 / 44
Marc Mehlman
Graphical Representation of Distributions
> Girth=trees$Girth
> stem(Girth) # stem and leaf plot
The decimal point is at the |
8 | 368
10 | 57800123447
12 | 099378
14 | 025
16 | 03359
18 | 00
20 | 6
> stem(Girth, scale=2)
The decimal point is at the |
8 | 368
9 |
10 | 578
11 | 00123447
12 | 099
13 | 378
14 | 025
15 |
16 | 03
17 | 359
18 | 00
19 |
20 | 6
Marc Mehlman (University of New Haven) Descriptive Statistics 17 / 44
Marc Mehlman
Graphical Representation of Distributions
Bivariate Data
Bivariate data comes from measuring two aspects of the sameitem/individual. For instance,
(70, 178), (72, 192), (74, 184), (68, 181)
is a random sample of size four obtained from four male college students.The bivariate data gives the height in inches and the weight in pounds ofeach of the for students. The third student sampled is 74 inches high andweighs 184 pounds.
Can one variable be used to predict the other? Do tall people tend toweigh more?
Definition
A response (or dependent) variable measures the outcome of a study.The explanatory (or independent) variable is the one that predicts theresponse variable.
Marc Mehlman (University of New Haven) Descriptive Statistics 18 / 44
Marc Mehlman
Graphical Representation of Distributions
Student ID
Number of Beers
Blood Alcohol Content
1 5 0.1
2 2 0.03
3 9 0.19
6 7 0.095
7 3 0.07
9 3 0.02
11 4 0.07
13 5 0.085
4 8 0.12
5 3 0.04
8 5 0.06
10 5 0.05
12 6 0.1
14 7 0.09
15 1 0.01
16 4 0.05
Here we have two quantitative variables
recorded for each of 16 students:
1. how many beers they drank
2. their resulting blood alcohol content (BAC)
Bivariate data
For each individual studied, we record
data on two variables.
We then examine whether there is a
relationship between these two
variables: Do changes in one variable
tend to be associated with specific
changes in the other variables?
Marc Mehlman (University of New Haven) Descriptive Statistics 19 / 44
Marc Mehlman
Graphical Representation of Distributions
Student Beers BAC
1 5 0.1
2 2 0.03
3 9 0.19
6 7 0.095
7 3 0.07
9 3 0.02
11 4 0.07
13 5 0.085
4 8 0.12
5 3 0.04
8 5 0.06
10 5 0.05
12 6 0.1
14 7 0.09
15 1 0.01
16 4 0.05
Scatterplots
A scatterplot is used to display quantitative bivariate data.
Each variable makes up one axis. Each individual is a point on the graph.
Marc Mehlman (University of New Haven) Descriptive Statistics 20 / 44
Marc Mehlman
Graphical Representation of Distributions
> plot(trees$Girth~trees$Height,main="girth vs height")
●●
●
●● ●
● ● ●● ●●●●
●
● ●
●
●●●
●●
●●
●●
●●●
●
65 70 75 80 85
810
1214
1618
20
girth vs height
trees$Height
tree
s$G
irth
Marc Mehlman (University of New Haven) Descriptive Statistics 21 / 44
Marc Mehlman
Measuring the Center
Measuring the Center
Measuring the Center
Marc Mehlman (University of New Haven) Descriptive Statistics 22 / 44
Marc Mehlman
Measuring the Center
Measures of the Center
Definition
Given x1, x2, · · · , xn, the sample mean is xdef= x1+x2+···+xn
n = 1n
∑nj=1 xj .
The population mean is µdef= 1
N
∑Nj=1 xj . The mode is the most value
data value (it needs not be unique).If one orders the data from smallest to largest, the median is
Mdef=
{middle value of data if n is oddthe average of the middle two values of data if n is even
.
Laymen refer to the mean as the average.
Example
The median sales price of a house in Milford was $212,175 for Feb–Apr2013. If Bill Gates buys a house in Milford for $100 million, what will thatdo to mean cost of a house in Milford? to the median house in Milford?What is a better measure of the cost of buying a house in Milford, themean or median?
Marc Mehlman (University of New Haven) Descriptive Statistics 23 / 44
Marc Mehlman
Measuring the Center
“Statistically, if you lie with your head in the oven and your feetin the fridge, on average you will be comfortably warm.”–Anonymous
“Then there is the man who drowned crossing a stream with anaverage depth of six inches.” – W.I.E. Gates
“The average human has one breast and one testicle.” –humorist Des McHale
Marc Mehlman (University of New Haven) Descriptive Statistics 24 / 44
Marc Mehlman
Measuring the Center
24
The mean and median measure center in different ways, and both are useful.
The mean and median of a roughly symmetric distribution are close together.
If the distribution is exactly symmetric, the mean and median are exactly the same.
In a skewed distribution, the mean is usually farther out in the long tail than is the median.
The mean and median of a roughly symmetric distribution are close together.
If the distribution is exactly symmetric, the mean and median are exactly the same.
In a skewed distribution, the mean is usually farther out in the long tail than is the median.
Comparing Mean and Median
Marc Mehlman (University of New Haven) Descriptive Statistics 25 / 44
Marc Mehlman
Measuring the Spread
Measuring the Spread
Measuring the Spread
Marc Mehlman (University of New Haven) Descriptive Statistics 26 / 44
Marc Mehlman
Measuring the Spread
Definition
The population variance is σ2 def= 1
N
∑Nj=1(xj − µ)2 and the population
standard deviation is σdef=√σ2 =
√1N
∑Nj=1(xj − µ)2.
However one often hase only a random sample to examine, not the entirepoplulation. With only a random sample, one can not calculate thepopulation mean, µ, so the best one can do is use the sample mean, xinstead.
Definition
The sample variance is s2 def= 1
n−1
∑nj=1(xj − x)2 and the sample
standard deviation is sdef=√s2 =
√1
n−1
∑nj=1(xj − x)2.
Notice the use of n − 1 instead n for the sample variance and standarddeviation.
Marc Mehlman (University of New Haven) Descriptive Statistics 27 / 44
Marc Mehlman
Measuring the Spread
Example
Suppose our random sample is
4.5, 3.7, 2.8, 5.3, 4.6.
Then
x =1
5[4.5 + 3.7 + 2.8 + 5.3 + 4.6] = 4.18
s2 =1
5− 1
[(4.5− 4.18)2 + (3.7− 4.18)2 + (2.8− 4.18)2 + (5.3− 4.18)2 + (4.6− 4.18)2
]= 0.917
s =√
0.917 = 0.9576012.
Using R:
> dat=c(4.5,3.7,2.8,5.3,4.6)
> mean(dat)
[1] 4.18
> var(dat)
[1] 0.917
> sd(dat)
[1] 0.9576012
Marc Mehlman (University of New Haven) Descriptive Statistics 28 / 44
Marc Mehlman
Measuring the Spread
Why s2 = 1n−1
∑nj=1(x − x)2 and not s2 = 1
n
∑nj=1(x − x)2?
Ans: s2 is unbiased – on average s2 will be σ2 and 1n
∑nj=1(x − x)2 is
biased – on average it will be n−1n s2.
Note:
1 s2 =n∑n
j=1 x2j −(
∑nj=1 xj)
2
n(n−1) and s =
√n∑n
j=1 x2j −(
∑nj=1 xj)
2
n(n−1)
2 s2 = 0 ⇔ s = 0 ⇒ all of the random sample is the same.
3 x is an estimator of µ.
4 s2 is an (unbiased) estimator of σ2.
5 s is an (biased) estimator of σ.
6 s measures the amount the data is dispersed about the mean.
7 s has the same units of measurement as the data.
8 s is sensitive to the existence of outliers.
Marc Mehlman (University of New Haven) Descriptive Statistics 29 / 44
Marc Mehlman
Measuring the Spread
Definition (Coefficient of Variation, CV)
sample CVdef=
s
x× 100,
population CVdef=
σ
µ× 100.
CV standardizes standard deviation according to the mean.
Example
There are 250 dogs at a dog show that weigh an average of 25 pounds,with a standard deviation of 8 pounds. The 250 human owners had anaverage weight of of 172 lbs with a standard deviation of 29 lbs. Do thehumans or the dogs have vary greater in weight?Sol: It depends how the question is interpreted. The varance of weight forthe owners is 29 which is greater than the variance for the dogs which is 8.However, for humans CV = 29
172 = 0.17 which is less than the CV for dogswhich is 8
25 = 0.32.
Marc Mehlman (University of New Haven) Descriptive Statistics 30 / 44
Marc Mehlman
Measuring the Spread
Theorem (Chebyshev’s Theorem)
At least 1− 1K2 of data is within K standard deviations of the mean.
Example
at least 3/4 ths of data is within 2 standard deviation of the meanat least 8/9 ths of data is within 3 standard deviation of the meanat least 15/18 ths of data is within 4 standard deviation of the mean
Marc Mehlman (University of New Haven) Descriptive Statistics 31 / 44
Marc Mehlman
Normal Distribution
Normal Distribution
Normal Distribution
Marc Mehlman (University of New Haven) Descriptive Statistics 32 / 44
Marc Mehlman
Normal Distribution
40
A density curve is a curve that:
• is always on or above the horizontal axis• has an area of exactly 1 underneath it
A density curve describes the overall pattern of a distribution. The area under the curve and above any range of values on the horizontal axis is the proportion of all observations that fall in that range.
A density curve is a curve that:
• is always on or above the horizontal axis• has an area of exactly 1 underneath it
A density curve describes the overall pattern of a distribution. The area under the curve and above any range of values on the horizontal axis is the proportion of all observations that fall in that range.
Density Curves
Marc Mehlman (University of New Haven) Descriptive Statistics 33 / 44
Marc Mehlman
Normal Distribution
Advantage of s over s2: s uses same unit of measurement as the data.
Definition (Standardized Scores (z–scores))
sample standardized score: zdef=
x − x
s,
population standardized score: zdef=
x − µσ
.
Advantage of z–scores: unit independent.
Example (from Book)
z–score of Michael Jordan’s height = 3.21
z–score of Rebecca Lobo’s height = 4.96
Conclusion:Jordan is 3.21 standard deviations taller than the average male height.Lobo is 4.96 standard deviations taller than average female height.
Note: negative z–scores ⇒ smaller than average.Marc Mehlman (University of New Haven) Descriptive Statistics 34 / 44
Marc Mehlman
Order Statistics
Order Statistics
Order Statistics
Marc Mehlman (University of New Haven) Descriptive Statistics 35 / 44
Marc Mehlman
Order Statistics
Definition
percentile of x = percentage of data less than x .
Pk = k th percentile = is the k th percentile
Q1 = 1st quantile = P25 = “border of bottom 25% and top 75%”.
= median of all values ≤ overall median
Q2 = 2nd quantile = P50 = median.
Q3 = 3rd quantile = P75 = border of top 25% and bottom 75%.
= median of all values ≥ overall median
5–number summary = min,Q1,Q2,Q3,max
Interquartile Range = IQR = Q3 − Q1.
How to find the Pk , the k th percentile according to the book:Order x1, x2, · · · , xn in increasing order. Count L in to get Pk where L is the smallest
integer ≥ k
100n.
Example
5 year old is in 90% percentile, weight–wise, for his age group.
Marc Mehlman (University of New Haven) Descriptive Statistics 36 / 44
Marc Mehlman
Order Statistics
34
We now have a choice between two descriptions for center and spread
Mean and Standard Deviation
Median and Interquartile Range
The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers.
Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers.
NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA!
The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers.
Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers.
NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA!
Choosing Measures of Center and SpreadChoosing Measures of Center and Spread
Choosing Measures ofCenter and Spread
Marc Mehlman (University of New Haven) Descriptive Statistics 37 / 44
Marc Mehlman
Order Statistics
R Commands:
Example
> mean(trees$Volume)
[1] 30.17097
> median(trees$Volume)
[1] 24.2
> summary(trees$Volume)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.20 19.40 24.20 30.17 37.30 77.00
> IQR(trees$Volume)
[1] 17.9
> var(trees$Volume)
[1] 270.2028
> sd(trees$Volume)
[1] 16.43785
Marc Mehlman (University of New Haven) Descriptive Statistics 38 / 44
Marc Mehlman
Order Statistics
Outliers
Definition (The 1.5 x IQR Rule for Outliers)
Call an observation an outlier if it falls more than 1.5 x IQR above thethird quartile or below the first quartile.
Outliers have effects on means and variance far greater than their numbers.
Marc Mehlman (University of New Haven) Descriptive Statistics 39 / 44
Marc Mehlman
Order Statistics
Example
The number of eggs laid by Farmer John’s chickens (hens) in September2016 is given below.
18, 13, 3, 16, 9, 35, 5, 15, 23, 11, 7.
Are there any outlier hens?Solution: One figures out the quartiles:
Q1 M Q33 5 7 9 11 13 15 16 18 23 35
Since IQR = 18− 7 = 11 and 35−Q3 = 17 > 1.5 x IQR = 16.5, oneidentifies 35 as an outlier.
Marc Mehlman (University of New Haven) Descriptive Statistics 40 / 44
Marc Mehlman
Order Statistics
Boxplots
Definition
Given x1, x2, · · · , xn, to create a boxplot (also called a box and whiskersplot)
1 draw and label a vertical number line that includes the range of thedistribution.
2 draw a box from height Q1 to Q3.
3 draw a horizontal line inside the box at the height of the median.
4 draw vertical line segments (whiskers) from the bottom and top ofthe box to the minimum and maximum data values that are notoutliers.
5 sometimes outliers are identified with ◦’s (R does this).
Boxplots are often useful when comparing the values of two differentvariables.
Marc Mehlman (University of New Haven) Descriptive Statistics 41 / 44
Marc Mehlman
Order Statistics
> boxplot(trees$Height, main="Heights of Black Cherry Trees")
> boxplot(USJudgeRatings$DMNR,USJudgeRatings$DILG,
+ main="Lawyers’ Demeanor/Diligence ratings of US Superior Court state judges")
6570
7580
85
Heights of Black Cherry Trees
●
●
56
78
9
Lawyers' Demeanor/Diligence ratings of US Superior Court state judges
Marc Mehlman (University of New Haven) Descriptive Statistics 42 / 44
Marc Mehlman
Chapter #1 R Assignment
Chapter #1 R Assignment
Chapter #1 R Assignment
Marc Mehlman (University of New Haven) Descriptive Statistics 43 / 44
Marc Mehlman
Chapter #1 R Assignment
Fifty-eight sailors are sampled and their eye color is noted as below
blue brown green hazel red11 32 8 5 2
1 Create a barplot and pie chart of eye color from the sailor sample.
2 Create of a histogram and a stemplot of the height of loblolly treesfrom the dataset “Loblolly”. The dataset, “Loblolly” comes with R,just as “trees” does. To observe “Loblolly”, type “Loblolly” at the Rprompt (without the quotes). To learn more about the dataset, type“help(Loblolly)” at the R prompt.
3 Find the mean, median, five number summary, variance and standarddeviation from the sample of heights in the dataset “Loblolly”.
4 Create a scatterplot of weight versus quarter mile times for thedataset, “mtcars”. Assume the the independent variable is thequarter mile times and the dependent variable is the weight.
Marc Mehlman (University of New Haven) Descriptive Statistics 44 / 44