Lesson Plan - Statistical Sciencecm160/chapter2.pdfLesson Plan Representing data with graphs...

Lesson Plan

Representing data with graphs

Representing data with statistics

Homework:

1-8, 1-18, 2-6, 2-18

– p. 1/45

Data

Are there good ways and bad ways to represent data?

Yes, depending on the nature of the data, therepresentation may differ.

Therefore, we first need to realize what kind of data can wepossibly observe.

In order to do that, it is necessary to understand whydata can be different!!

A population is composed of individuals (in a generalsense!! They could be persons, financial stocks,nations, trees, etc..).A sample is a collection of individuals.On each individual we can measure several differentcharacteristics ( variables ).

– p. 2/45

Example

Name Gender Salary Education # Family membersAA F 50 B 2BB M 43 B 3CC F 65 M 1DD M 200 B 2EE M 60 M 4FF M 25 S 2GG F 15 S 0HH F 80 D 3II M 22 S 1JJ F 69 B 4KK F 70 M 2

– p. 3/45

Where

M stands for male

F stands for female

S stands for senior high

B stands for bachelor

M stands for master

D stands for doctorate

Do the data have same nature?Can they be represented them all in the same way?Can they be analyzed in the same way?

– p. 4/45

Terminology

A variable can be usually characterized according to manydifferent criteria. We we will say that

Variables can beQualitative (representing characteristics whichcannot be naturally associated to a number)

Ordinal (although not naturally associated with anumber can be somehow ordered)Non-ordinal

Quantitative (characteristics on which it is possible toapply arithmetic operations)

Discrete (assuming values only on a discrete setlike the integers)Continuous

– p. 5/45

Gender

According to the data, there are 5 males and 6 females inthe sample. A very simple representation is the bar chartrepresentation

Gender

Fre

quen

cy

01

23

45

6

F M

– p. 6/45

Salary: frequencies

The analysis of the salary is a little bit more complicated. Anatural approach would be to divide data into categories.For example smaller or bigger than 100 (thousands).

This representation is called histogram

Salary

Fre

quen

cy

0 50 100 150 200

02

46

810

– p. 7/45

Salary: distribution

The analysis of the salary can also be conducted in relativeterms

Salary

Den

sity

0 50 100 150 200

0.00

00.

002

0.00

40.

006

0.00

8

– p. 8/45

The construction of the distribution requires a little bit ofattention since it is based on the relative frequencies

Relative frequency =frequency

sumof all frequencies

f<100 = # salaries<100# salaries

= 1011

f≥100 = # salaries≥100# salaries

= 111

Then the relative frequencies must be distributed evenlyalong the cells

height<100 = f<100

base<100

= 1011×(100−0) = 0.009

height≥100 =f≥100

base≥100

= 111×(200−100) = 0.001

– p. 9/45

There is an important property for the histogram built interms of relative terms:

The total area of the rectangles is equal to 1

height<100 × base<100 + height≥100 × base≥100 =

10

11 × (100 − 0)× (100− 0) +

1

11 × (100 − 0)× (100− 0) = 1

This property is worth to remember because it willwill be recalled later in the course.

– p. 10/45

Salary: frequencies (higher detail)

The number of cells is however arbitrary (there is no naturaldistinction as in the gender case).

For example, it is possible group data according to thefollowing scheme (0,50,100,150,200)

Salary

Fre

quen

cy

0 50 100 150 200

01

23

45

– p. 11/45

Salary: frequencies (even higher detail)

For example, it is possible group data according to thefollowing scheme(0,20,40,60,80,100,120,140,160,180,200)

Salary

Fre

quen

cy

0 50 100 150 200

01

23

4

– p. 12/45

Education

The education variable is qualitative as the gender.However, for the gender there is no natural ordering.

For the education variable there seems to be a generalordering: from SENIOR HIGH to DOCTORATE level.

Education

Fre

quen

cy

01

23

4

SH B M D– p. 13/45

Number of family members

Finally the the variable number-of-family-members isquantitative as the salary.

However, the number-of-family-members is restricted toa the set of integer numbers.

Family Members

Fre

quen

cy

0 1 2 3 4

01

23

4

– p. 14/45

P-Percentile

Definition

The p-percentile is the observation within the samplesuch that p% of the remaining observations are equalor lower than the p-percentile.

The position of the p − percentile is p100(n + 1)

the 25% − percentile is called 1stquartilethe 50% − percentile is called Medianthe 75% − percentile is called 3stquartile

It is important to notice that in order to evaluate thepercentile, the data must be qualitative ordinal orquantitative

– p. 15/45

P-Percentile in practice (Salary)

For the salary

The original data are: (50 43 65 200 60 25 15 80 22 6970)

The sorted data are : (15 22 25 43 50 60 65 69 70 80200)

salary[0.25∗(11+1)] = salary[3] = 25

salary[0.5∗(11+1)] = salary[6] = 60

salary[0.75∗(11+1)] = salary[9] = 70

What if only 10 data were available? A little bit more difficult!

IMPORTANT the evaluation of percentile may differaccording to the book!!

– p. 16/45

P-Percentile (Number of family members)

For the salary

The original data are: (2 3 1 2 4 2 0 3 1 4 2)

The sorted data are : (0 1 1 2 2 2 2 3 3 4 4)

nfm[0.25∗(11+1)] = nfm[3] = 1

nfm[0.5∗(11+1)] = nfm[6] = 2

nfm[0.75∗(11+1)] = nfm[9] = 3

In some cases, the p-percentile does not coincideexactly with an observation. This can be a problemfor quantitative discrete data.

– p. 17/45

P-Percentile (Tricky Case)

Suppose you have the following data

y=1,3,6,8What are the q1,median, q3?

According to the percentile formula the locations are0.25 × 5 = 1.25

0.5 × 5 = 2.5

0.755 × 5 = 3.75

Thereforeq1 = 1 + (3 − 1) ∗ 0.25 = 1.5 (or between 1 and 2)median = 3 + (6 − 3) ∗ 0.5 = 4.5 (or between 3 and 6)q3 = 36(8 − 6) ∗ 0.75 = 7.5 (or between 6 and 8)

What about qualitative ordinal and quantitative discretevariables?

– p. 18/45

The Box-plot

The box plot is a particular graphical representation whichcombines

min

q1

median

q3

max

The box-plot cannot be built for the Gender variableThe Education variable is ordinal. Therefore it is intheory possible to determine q1,median, q3,min,max.However, the spatial connotation would be arbitrary.

– p. 19/45

Box-plot for Salary

50 100 150 200

Salary

– p. 20/45

Box-plot for Number-of-family-members

0 1 2 3 4

Family Members

– p. 21/45

Describing a distribution with numbers

So far we have learned how to represent a distributionwith graphs.

Graphs are good at “marketing” your analysis, butnot necessarily the best way to achieve a deepunderstanding of the phenomenon.

On the other hand, a table with lots of numbers givesyou lots of information but it is not “appealing”.

What is the right way?

It depends mostly on the person you are talking to.

– p. 22/45

Center of a Distribution

The most obvious starting point is the center of thedistributions.

Suppose I want to buy a house in San Francisco and Ihave no-perspective, no-clue, not even a vague idea ofwhat the price for a studio is. What should I do?

I should probably contact a friend in San Franciscoand askOn average, how much should I expect to pay

if I want to but a decent studio?I am basically asking what is the center of thedistribution!!!

– p. 23/45

Terminology

We will say that

A parameter is a descriptive measure of a population.

A statistic is a descriptive measure of a sample.

In many cases we will say (later in the course) that statisticsare approximations of parameters.

Some approximations will be better than others.

We will see which are good approximations.

For the moment just know that we are focusing on thesample part.

– p. 24/45

Mode

The mode of a variable is the most frequent observationoccurring in the data set.

Mode(gender)=FEMALE [6(F) and 5(M)]

Mode(education)=BACHELOR [4(B), 3(S), 3(M), 1(D)]

Mode(# family members)=2 [4(2), 2(1), 2(3), 2(4), 1(0)]

If you look at the graphs you will realize that the modesusually coincide with the peaks of the graphicalrepresentation.

What about Mode(salary)?Because the variable is continuous, we need togroup the data. But this is an arbitrary operation,therefore the mode will depend on the choice of thegroups.

– p. 25/45

Salary Mode: TRICKY

SalaryD

ensi

ty

0 50 100 150 200

0.00

00.

005

0.01

00.

015

Salary

Den

sity

0 50 100 150 200

0.00

00.

002

0.00

40.

006

0.00

80.

010

0.01

20.

014

What is the mode?Example of suggesting a fact which is not “TRUE”

– p. 26/45

Mean

Given y1, y2, . . . , yn are the n variable observations within asample, then the mean is defined to be

y =

nP

i=1

yi

n

What about the mean of the gender and education?

They cannot be computed because they are qualitativevariable.

You cannot apply arithmetic operations on words!!

mean(# family members) = 2+3+1+2+4+2+0+3+1+4+211 =

2.18

2.18 people doesn’t make much sense! But it is nota big deal after all.

mean(salary) =? Again it can be tricky!!!– p. 27/45

Salary Mean: TRICKY

If the data are NOT grouped then the mean is equal to

50 + 43 + 65 + 200 + 60 + 25 + 15 + 80 + 22 + 69 + 70

11= 63.54

if data are grouped0-20 20-40 40-60 60-80 80-180 180-200

ni 1 2 3 4 0 1ni

n.09 .18 .27 .36 0 .09

ni

n×base.005 .009 .014 .018 0 .005

1×10+2×30+3×50+4×70+0×130+1×19011 = 62.73

.09×10+.18×30+.27×50+.36×70+0×130+.09×1901 = 62.1

.005×10+.009×30+.014×50+.018×70+0×130+.005×1901 × 20 =

64.6– p. 28/45

Median

The median is a statistic based on the order of theobservation.

It is a particular case of percentile: the 50-percentile.Mode(education)=Bachelor

since [SSSBB(B)BMMMD]Mode(# family members)=2

since [01122(2)233]median(gender) =?

It cannot be evaluated because the variable is notordinal

median(Salary) =?Like usual it takes extra care.

– p. 29/45

Salary Median: TRICKY

If the data are NOT grouped then the median is 60Since [15 22 25 43 50 (60) 65 69 70 80 200]

if data are grouped0-20 20-40 [40-60] 60-80 80-180 180-200

ni 1 2 3 4 0 1ni

n.09 .18 .27 .36 0 .09

ni

n×base.005 .009 .014 .018 0 .005

k∑

i=1ni 1 3 (6) 10 10 11

k∑

i=1ni

ni

n0.09 0.27 (0.54) 0.90 0.90 1∗

k∑

i=1

ni

n×base0.005 0.014 (0.028) 0.046 0.046 0.050∗

– p. 30/45

Left Skewness

Left skewed

x

Fre

quen

cy

−0.07 −0.06 −0.05 −0.04 −0.03 −0.02 −0.01 0.00

020

040

060

080

010

0012

0014

00 meanmedianmode

– p. 31/45

Right Skewness

Right skewed

x

Fre

quen

cy

0.00 0.01 0.02 0.03 0.04 0.05 0.06

050

010

0015

00

meanmedianmode

– p. 32/45

Symmetry

Symmetric

x

Fre

quen

cy

−4 −2 0 2 4

020

040

060

080

0 meanmedianmode

– p. 33/45

When to use what

Mean: quantitative data and the frequency distributionis approximately symmetric

Median: quantitative data and the frequency distributionis skewed (left or right)

Mode: When most frequent observation is desiredmeasure of central tendency or the data are qualitative

– p. 34/45

Spread of a Distribution

When describing a distribution, the indication of the “center”may be not sufficient.

In many case we want to know how much data aredispersed.

For example: financial analysts are usually interest in theexpected returns (think of it as the mean) and the risk(think of it as the spread)

In this sense, the “spread” indicates how muchuncertainty characterizes the expected return.

consider two hypothetical stock returns at time 1,2,3,4,5:

-2,-1,0,1,2 with mean 0

-200,-100,0,100,200 with mean 0

– p. 35/45

Range

The range is simply defined as

range = max − min

range(salary) = 200 − 15 = 185

range(# family members) = 4 − 0 = 4

range(gender) =not possiblerange(education) =not possible

– p. 36/45

Inter-quartile Range

The Inter-quartile range is simply defined as

IQR = q3 − q1

where q1 and q3 are the 1st and 3rd quartile.Therefore

IQR(salary) = 70 − 25 = 45

IQR(# family members) = 3 − 1 = 2

IQR(gender) =not possible

IQR(education) =not possible

– p. 37/45

Variance

The variance, s2, is computed as the sum of the squareddeviations about the mean, x, divided by (n-1).

s2 =

n∑

i=1(xi − x)2

n − 1

variance(gender) = not possible!!

variance(education) = not possible!!

variance(# family members) =

[(2 − 2.18)2 + (3 − 2.18)2 + (1 − 2.18)2 + (2 − 2.18)2 + (4 −

2.18)2 + (2 − 2.18)2 + (0 − 2.18)2 + (3 − 2.18)2 + (1 −

2.18)2 + (4 − 2.18)2 + (2 − 2.18)2]/(11 − 1) = 1.56

– p. 38/45

Salary Variance

If data are not grouped

variance(salary) =

[(50−63.54)2+(43−63.54)2+(65−63.54)2+(200−63.54)2+

(60−63.54)2 +(25−63.54)2 +(15−63.54)2 +(80−63.54)2 +

(22−63.54)2+(69−63.54)2+(70−63.54)2]/(11−1) = 2515.07

If data are grouped, please read the book!!!!

The idea is simple and similar to the meanYou need to approximate each group around thecentral value!

– p. 39/45

Standard Deviation

The standard deviation is defined as the square root of thevariance:

sd(gender) = not possible!!

sd(education) = not possible!!

sd(# family members) = sqrt(1.56) = 1.25

sd(salary) = sqrt(2515.07) = 50.15 (if data are notgrouped)

IMPORTANT: why do we need the standard deviation if wealready have the variance?

– p. 40/45

Degrees of freedom

The degrees of freedom represent the effective number ofvalues free to vary in the computation a statistic??!!!

What does that mean?What is the variance of a sample characterized byone observation (ex y=3)? Why dividing by (n − 1)and not n?

var(y) =(3 − y)2

1

since y = 31 = 3

var(y) =(3 − 3)2

1= 0

essentially we cannot use y = 3 twice.

– p. 41/45

Empirical Rule

If a distribution is roughly bell shaped, then

Approximately 68

Approximately 95

Approximately 99.7

– p. 42/45

Bell−Shaped

Den

sity

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4 68%

95%99.7%

– p. 43/45

Outliers

The outliers are “strange” values which seem not be inaccordance with the rest of the distribution.

Think about the salary variable. One person earns200000$. Much more than the rest of the people in thesample.

Generally we can consider this observation anoutlier.

As a general rule we can define an outlier to be any value

yi < q1 − 1.5 × IQR

yi > q3 + 1.5 × IQR

In the salary variable it turns out that these values are

25 − 1.5 × (70 − 25) = −42.25

70 + 1.5 × (70 − 25) = 137.5 < 200 !!!!!– p. 44/45

Linear Transformation

Please read the book!

– p. 45/45

Date post:	14-Feb-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Lesson Plan - Statistical Sciencecm160/chapter2.pdfLesson Plan Representing data with graphs...

Documents