Lesson Plan
Representing data with graphs
Representing data with statistics
Homework:
1-8, 1-18, 2-6, 2-18
– p. 1/45
Data
Are there good ways and bad ways to represent data?
Yes, depending on the nature of the data, therepresentation may differ.
Therefore, we first need to realize what kind of data can wepossibly observe.
In order to do that, it is necessary to understand whydata can be different!!
A population is composed of individuals (in a generalsense!! They could be persons, financial stocks,nations, trees, etc..).A sample is a collection of individuals.On each individual we can measure several differentcharacteristics ( variables ).
– p. 2/45
Example
Name Gender Salary Education # Family membersAA F 50 B 2BB M 43 B 3CC F 65 M 1DD M 200 B 2EE M 60 M 4FF M 25 S 2GG F 15 S 0HH F 80 D 3II M 22 S 1JJ F 69 B 4KK F 70 M 2
– p. 3/45
Where
M stands for male
F stands for female
S stands for senior high
B stands for bachelor
M stands for master
D stands for doctorate
Do the data have same nature?Can they be represented them all in the same way?Can they be analyzed in the same way?
– p. 4/45
Terminology
A variable can be usually characterized according to manydifferent criteria. We we will say that
Variables can beQualitative (representing characteristics whichcannot be naturally associated to a number)
Ordinal (although not naturally associated with anumber can be somehow ordered)Non-ordinal
Quantitative (characteristics on which it is possible toapply arithmetic operations)
Discrete (assuming values only on a discrete setlike the integers)Continuous
– p. 5/45
Gender
According to the data, there are 5 males and 6 females inthe sample. A very simple representation is the bar chartrepresentation
Gender
Fre
quen
cy
01
23
45
6
F M
– p. 6/45
Salary: frequencies
The analysis of the salary is a little bit more complicated. Anatural approach would be to divide data into categories.For example smaller or bigger than 100 (thousands).
This representation is called histogram
Salary
Fre
quen
cy
0 50 100 150 200
02
46
810
– p. 7/45
Salary: distribution
The analysis of the salary can also be conducted in relativeterms
Salary
Den
sity
0 50 100 150 200
0.00
00.
002
0.00
40.
006
0.00
8
– p. 8/45
The construction of the distribution requires a little bit ofattention since it is based on the relative frequencies
Relative frequency =frequency
sumof all frequencies
f<100 = # salaries<100# salaries
= 1011
f≥100 = # salaries≥100# salaries
= 111
Then the relative frequencies must be distributed evenlyalong the cells
height<100 = f<100
base<100
= 1011×(100−0) = 0.009
height≥100 =f≥100
base≥100
= 111×(200−100) = 0.001
– p. 9/45
There is an important property for the histogram built interms of relative terms:
The total area of the rectangles is equal to 1
height<100 × base<100 + height≥100 × base≥100 =
10
11 × (100 − 0)× (100− 0) +
1
11 × (100 − 0)× (100− 0) = 1
This property is worth to remember because it willwill be recalled later in the course.
– p. 10/45
Salary: frequencies (higher detail)
The number of cells is however arbitrary (there is no naturaldistinction as in the gender case).
For example, it is possible group data according to thefollowing scheme (0,50,100,150,200)
Salary
Fre
quen
cy
0 50 100 150 200
01
23
45
– p. 11/45
Salary: frequencies (even higher detail)
For example, it is possible group data according to thefollowing scheme(0,20,40,60,80,100,120,140,160,180,200)
Salary
Fre
quen
cy
0 50 100 150 200
01
23
4
– p. 12/45
Education
The education variable is qualitative as the gender.However, for the gender there is no natural ordering.
For the education variable there seems to be a generalordering: from SENIOR HIGH to DOCTORATE level.
Education
Fre
quen
cy
01
23
4
SH B M D– p. 13/45
Number of family members
Finally the the variable number-of-family-members isquantitative as the salary.
However, the number-of-family-members is restricted toa the set of integer numbers.
Family Members
Fre
quen
cy
0 1 2 3 4
01
23
4
– p. 14/45
P-Percentile
Definition
The p-percentile is the observation within the samplesuch that p% of the remaining observations are equalor lower than the p-percentile.
The position of the p − percentile is p100(n + 1)
the 25% − percentile is called 1stquartilethe 50% − percentile is called Medianthe 75% − percentile is called 3stquartile
It is important to notice that in order to evaluate thepercentile, the data must be qualitative ordinal orquantitative
– p. 15/45
P-Percentile in practice (Salary)
For the salary
The original data are: (50 43 65 200 60 25 15 80 22 6970)
The sorted data are : (15 22 25 43 50 60 65 69 70 80200)
salary[0.25∗(11+1)] = salary[3] = 25
salary[0.5∗(11+1)] = salary[6] = 60
salary[0.75∗(11+1)] = salary[9] = 70
What if only 10 data were available? A little bit more difficult!
IMPORTANT the evaluation of percentile may differaccording to the book!!
– p. 16/45
P-Percentile (Number of family members)
For the salary
The original data are: (2 3 1 2 4 2 0 3 1 4 2)
The sorted data are : (0 1 1 2 2 2 2 3 3 4 4)
nfm[0.25∗(11+1)] = nfm[3] = 1
nfm[0.5∗(11+1)] = nfm[6] = 2
nfm[0.75∗(11+1)] = nfm[9] = 3
In some cases, the p-percentile does not coincideexactly with an observation. This can be a problemfor quantitative discrete data.
– p. 17/45
P-Percentile (Tricky Case)
Suppose you have the following data
y=1,3,6,8What are the q1,median, q3?
According to the percentile formula the locations are0.25 × 5 = 1.25
0.5 × 5 = 2.5
0.755 × 5 = 3.75
Thereforeq1 = 1 + (3 − 1) ∗ 0.25 = 1.5 (or between 1 and 2)median = 3 + (6 − 3) ∗ 0.5 = 4.5 (or between 3 and 6)q3 = 36(8 − 6) ∗ 0.75 = 7.5 (or between 6 and 8)
What about qualitative ordinal and quantitative discretevariables?
– p. 18/45
The Box-plot
The box plot is a particular graphical representation whichcombines
min
q1
median
q3
max
The box-plot cannot be built for the Gender variableThe Education variable is ordinal. Therefore it is intheory possible to determine q1,median, q3,min,max.However, the spatial connotation would be arbitrary.
– p. 19/45
Box-plot for Salary
50 100 150 200
Salary
– p. 20/45
Box-plot for Number-of-family-members
0 1 2 3 4
Family Members
– p. 21/45
Describing a distribution with numbers
So far we have learned how to represent a distributionwith graphs.
Graphs are good at “marketing” your analysis, butnot necessarily the best way to achieve a deepunderstanding of the phenomenon.
On the other hand, a table with lots of numbers givesyou lots of information but it is not “appealing”.
What is the right way?
It depends mostly on the person you are talking to.
– p. 22/45
Center of a Distribution
The most obvious starting point is the center of thedistributions.
Suppose I want to buy a house in San Francisco and Ihave no-perspective, no-clue, not even a vague idea ofwhat the price for a studio is. What should I do?
I should probably contact a friend in San Franciscoand askOn average, how much should I expect to pay
if I want to but a decent studio?I am basically asking what is the center of thedistribution!!!
– p. 23/45
Terminology
We will say that
A parameter is a descriptive measure of a population.
A statistic is a descriptive measure of a sample.
In many cases we will say (later in the course) that statisticsare approximations of parameters.
Some approximations will be better than others.
We will see which are good approximations.
For the moment just know that we are focusing on thesample part.
– p. 24/45
Mode
The mode of a variable is the most frequent observationoccurring in the data set.
Mode(gender)=FEMALE [6(F) and 5(M)]
Mode(education)=BACHELOR [4(B), 3(S), 3(M), 1(D)]
Mode(# family members)=2 [4(2), 2(1), 2(3), 2(4), 1(0)]
If you look at the graphs you will realize that the modesusually coincide with the peaks of the graphicalrepresentation.
What about Mode(salary)?Because the variable is continuous, we need togroup the data. But this is an arbitrary operation,therefore the mode will depend on the choice of thegroups.
– p. 25/45
Salary Mode: TRICKY
SalaryD
ensi
ty
0 50 100 150 200
0.00
00.
005
0.01
00.
015
Salary
Den
sity
0 50 100 150 200
0.00
00.
002
0.00
40.
006
0.00
80.
010
0.01
20.
014
What is the mode?Example of suggesting a fact which is not “TRUE”
– p. 26/45
Mean
Given y1, y2, . . . , yn are the n variable observations within asample, then the mean is defined to be
y =
nP
i=1
yi
n
What about the mean of the gender and education?
They cannot be computed because they are qualitativevariable.
You cannot apply arithmetic operations on words!!
mean(# family members) = 2+3+1+2+4+2+0+3+1+4+211 =
2.18
2.18 people doesn’t make much sense! But it is nota big deal after all.
mean(salary) =? Again it can be tricky!!!– p. 27/45
Salary Mean: TRICKY
If the data are NOT grouped then the mean is equal to
50 + 43 + 65 + 200 + 60 + 25 + 15 + 80 + 22 + 69 + 70
11= 63.54
if data are grouped0-20 20-40 40-60 60-80 80-180 180-200
ni 1 2 3 4 0 1ni
n.09 .18 .27 .36 0 .09
ni
n×base.005 .009 .014 .018 0 .005
1×10+2×30+3×50+4×70+0×130+1×19011 = 62.73
.09×10+.18×30+.27×50+.36×70+0×130+.09×1901 = 62.1
.005×10+.009×30+.014×50+.018×70+0×130+.005×1901 × 20 =
64.6– p. 28/45
Median
The median is a statistic based on the order of theobservation.
It is a particular case of percentile: the 50-percentile.Mode(education)=Bachelor
since [SSSBB(B)BMMMD]Mode(# family members)=2
since [01122(2)233]median(gender) =?
It cannot be evaluated because the variable is notordinal
median(Salary) =?Like usual it takes extra care.
– p. 29/45
Salary Median: TRICKY
If the data are NOT grouped then the median is 60Since [15 22 25 43 50 (60) 65 69 70 80 200]
if data are grouped0-20 20-40 [40-60] 60-80 80-180 180-200
ni 1 2 3 4 0 1ni
n.09 .18 .27 .36 0 .09
ni
n×base.005 .009 .014 .018 0 .005
k∑
i=1ni 1 3 (6) 10 10 11
k∑
i=1ni
ni
n0.09 0.27 (0.54) 0.90 0.90 1∗
k∑
i=1
ni
n×base0.005 0.014 (0.028) 0.046 0.046 0.050∗
– p. 30/45
Left Skewness
Left skewed
x
Fre
quen
cy
−0.07 −0.06 −0.05 −0.04 −0.03 −0.02 −0.01 0.00
020
040
060
080
010
0012
0014
00 meanmedianmode
– p. 31/45
Right Skewness
Right skewed
x
Fre
quen
cy
0.00 0.01 0.02 0.03 0.04 0.05 0.06
050
010
0015
00
meanmedianmode
– p. 32/45
Symmetry
Symmetric
x
Fre
quen
cy
−4 −2 0 2 4
020
040
060
080
0 meanmedianmode
– p. 33/45
When to use what
Mean: quantitative data and the frequency distributionis approximately symmetric
Median: quantitative data and the frequency distributionis skewed (left or right)
Mode: When most frequent observation is desiredmeasure of central tendency or the data are qualitative
– p. 34/45
Spread of a Distribution
When describing a distribution, the indication of the “center”may be not sufficient.
In many case we want to know how much data aredispersed.
For example: financial analysts are usually interest in theexpected returns (think of it as the mean) and the risk(think of it as the spread)
In this sense, the “spread” indicates how muchuncertainty characterizes the expected return.
consider two hypothetical stock returns at time 1,2,3,4,5:
-2,-1,0,1,2 with mean 0
-200,-100,0,100,200 with mean 0
– p. 35/45
Range
The range is simply defined as
range = max − min
range(salary) = 200 − 15 = 185
range(# family members) = 4 − 0 = 4
range(gender) =not possiblerange(education) =not possible
– p. 36/45
Inter-quartile Range
The Inter-quartile range is simply defined as
IQR = q3 − q1
where q1 and q3 are the 1st and 3rd quartile.Therefore
IQR(salary) = 70 − 25 = 45
IQR(# family members) = 3 − 1 = 2
IQR(gender) =not possible
IQR(education) =not possible
– p. 37/45
Variance
The variance, s2, is computed as the sum of the squareddeviations about the mean, x, divided by (n-1).
s2 =
n∑
i=1(xi − x)2
n − 1
variance(gender) = not possible!!
variance(education) = not possible!!
variance(# family members) =
[(2 − 2.18)2 + (3 − 2.18)2 + (1 − 2.18)2 + (2 − 2.18)2 + (4 −
2.18)2 + (2 − 2.18)2 + (0 − 2.18)2 + (3 − 2.18)2 + (1 −
2.18)2 + (4 − 2.18)2 + (2 − 2.18)2]/(11 − 1) = 1.56
– p. 38/45
Salary Variance
If data are not grouped
variance(salary) =
[(50−63.54)2+(43−63.54)2+(65−63.54)2+(200−63.54)2+
(60−63.54)2 +(25−63.54)2 +(15−63.54)2 +(80−63.54)2 +
(22−63.54)2+(69−63.54)2+(70−63.54)2]/(11−1) = 2515.07
If data are grouped, please read the book!!!!
The idea is simple and similar to the meanYou need to approximate each group around thecentral value!
– p. 39/45
Standard Deviation
The standard deviation is defined as the square root of thevariance:
sd(gender) = not possible!!
sd(education) = not possible!!
sd(# family members) = sqrt(1.56) = 1.25
sd(salary) = sqrt(2515.07) = 50.15 (if data are notgrouped)
IMPORTANT: why do we need the standard deviation if wealready have the variance?
– p. 40/45
Degrees of freedom
The degrees of freedom represent the effective number ofvalues free to vary in the computation a statistic??!!!
What does that mean?What is the variance of a sample characterized byone observation (ex y=3)? Why dividing by (n − 1)and not n?
var(y) =(3 − y)2
1
since y = 31 = 3
var(y) =(3 − 3)2
1= 0
essentially we cannot use y = 3 twice.
– p. 41/45
Empirical Rule
If a distribution is roughly bell shaped, then
Approximately 68
Approximately 95
Approximately 99.7
– p. 42/45
Bell−Shaped
Den
sity
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4 68%
95%99.7%
– p. 43/45
Outliers
The outliers are “strange” values which seem not be inaccordance with the rest of the distribution.
Think about the salary variable. One person earns200000$. Much more than the rest of the people in thesample.
Generally we can consider this observation anoutlier.
As a general rule we can define an outlier to be any value
yi < q1 − 1.5 × IQR
yi > q3 + 1.5 × IQR
In the salary variable it turns out that these values are
25 − 1.5 × (70 − 25) = −42.25
70 + 1.5 × (70 − 25) = 137.5 < 200 !!!!!– p. 44/45
Linear Transformation
Please read the book!
– p. 45/45