Download - 2 Descriptive Statistics

Descriptive statistics(Statistical Analysis)

DataFacts, figures collected for presentation and interpretation.Analysis of data yields informationVariable: a characteristic of interest

PopulationSample

INFERENCE

Census: data collected on every member of the populationExpensive,Time consumingSample survey: data collected from a representative sample of the populationLess expensive,Takes much less timeStatistical Analysis is used for this purpose

Approaches

Exploratory/descriptive

what does the data tell me?

Inferential

formulate a theory

construct a hypothesis

collect evidence to test the hypothesis

Why Descriptive Statistics are important?The variability (S.D) and central tendency (mean & median) values give us idea about the use of particular tests.For example, in case of symmetric (usually normal) date, mean is the good estimator of central tendency which results in the usage of parametric tests. In case of asymmetric (usually not-normal) date, median is the good estimator of central tendency which results in the usage of non-parametric tests. Descriptive statistics give us only a crude idea about the nature of data, further tests are used to verify the nature.

A fundamental task in many statistical analyses is to characterize the location and variability of a data set.

Comparison of data setsDescriptive statisticsTypical valueSpread or variabilitySymmetry / asymmetryTables and FiguresSkewnessKurtosis

Frequency tablesBar chartsPie chartsDot plotsHistogramsStem and leaf displaysCross-tabulations

A measure of location summarises where the values are on the measuring or counting scale

LOCATION

Sample mean

Very well known but often misunderstood.

Many users are not aware of its limitations.

Median

Less well known so less often used

but

overcomes some of the limitations of the mean

Sample mean

_1059120934.unknown

_1063708861.txt

Worksheet 1: Dotplot for banks-breweries et

The median

The median is the middle value in a set of figures when they have been sorted into ascending order.

P/E ratios for the 19 companies quoted in the Breweries, Hotels and Restaurants sector

1

2

3

4

5

6

7

5.6

7.0

7.3

8.6

8.9

9.4

9.5

8

9

10

11

12

13

14

9.8

10.3

10.4

12.7

13.2

13.8

14.0

15

16

17

18

19

14.8

16.4

18.9

22.0

23.9

With 19 values the median is the 10th value which is 10.4.

P/E ratios for the 14 companies in the Banking sector

1

2

3

4

5

6

7

10.8

11.2

11.3

11.8

11.9

12.0

12.2

8

9

10

11

12

13

14

12.7

12.8

13.9

14.5

14.5

14.6

15.4

So median = (12.2 + 12.7)/2 = 12.45

Median very easy to interpret

There are as many values below

the median as there are above it.

In fact there will be very close to 50% of values either side of the median when the data set contains an odd number of values and exactly 50% when there is an even number.

Variable N Mean

banks 14 12.829

breweries etc 19 12.45

Variable Median

banks 12.450

breweries etc 10.40

Limitations of the mean

What does the mean really mean?

Are 50% of values below the mean?

Not as easy to interpret as people think.

Mean is most easily interpreted when the data are (nearly) symmetrically spread.

In this case approximately 50% of values either side of the mean

10 people

Annual income before tax

25000, 19000, 22000, 23000,

19500, 27000, 25000, 22000,

24500, 250000

mean income is 45700

All except one have incomes below this figure!

Median salary is

(23000 + 24500)/2 = 23750

Annual income before tax

25000, 19000, 22000, 23000, 19500, 27000, 25000, 22000, 24500, 250000

The same values sorted

19000, 19500, 22000, 22000, 23000, 24500, 25000, 25000, 27000, 250000

Quartiles

They divide a set of values

into quarters

Lower quartile symbol Q1

25% of values are

no greater than Q1

Upper quartile symbol Q3

75% of values are

no greater than Q3

The median is Q2

Q1 Q2 Q3

25%

25%

25%

25%

Easy interpretation

always valid does not require symmetry

Measures of variabilityInterquartile range IQR = Q3 Q1It measures the spread of the middle 50% of the values.

Sample variance

sample standard deviation

_1059139477.unknown

Interpretation

Standard deviation is a summary statistic relating to spread.

Small standard deviation implies values fairly tightly clustered around the mean.

Large standard deviation implies large spread of values.

About 60-70% will be within 1 standard deviation of the meanAbout 95% will be within 2 standard deviations of the meanAlmost all will be within 3 standard deviations of the mean

Practical interpretation of standard deviation?

Useful to think in terms of intervals centred on the mean.

If the values have a more or less symmetrical spread around the mean

Minitab output for P/E ratios

Descriptive Statistics P/E ratios

Variable N Mean Median StDev

banks 14 12.829 12.450 1.483

breweries etc 19 12.45 10.40 5.02

Descriptive Statistics

Variable N Mean Median StDev

banks 14 12.829 12.45 1.483

breweries etc 19 12.45 10.40 5.02

Variable Minimum Maximum Q1 Q3

banks 10.800 15.400 11.675 14.500

breweries etc 5.60 23.90 8.90 14.80

Frequency tablePurpose:To show how often each of a number of different categories occurs in a sample.

Bar charts

Used to display a frequency table

A rectangular bar with its height representing the value is plotted for each category

Chart1

689

649

520

736

1152

male

Male student numbers 2000-01

Sheet1

numberFacultymalefemaleFaculty

4518MD&N2512.562.56891033A&SS

1972S&E752512.5649829A&D

1025L&A2525520468L&A

2647A&SS37.57361745MD&N

1575A&D1152674S&E

Sheet1

Student numbers 1999-2000

Sheet2

Do both 25% slices appear the same size?

Sheet3


Female student numbers 2000-01

male

female

Student numbers 2000-01

male


A table is nearly always better than a dumb pie chart; the only worse design than a pie chart is several of them, Edward TufteThe Visual Display of Quantitative Information

Pie Charts

Use only for very simple data showing how a total is split into distinct categories.

Deficiencies

Not easy to compare two or more pie charts

Can have circles with sizes (areas) proportional to numbers but many viewers do not appreciate the significance of this.

Become messy and so less informative with more than about 5 categories.

Not very easy to compare sizes of different slices

Dotplots

exactly what the name suggests

a dot is plotted for each data value

Useful for comparing two or more sets of values.

They help us to see

Where the values are on the scale,

How spread out they are

Any atypical values

P/E ratios for banks are fairly tightly clustered

P/E ratios for breweries are much more spread out

About half the breweries have P/E ratios below the lowest bank P/E ratio

Use when data is continuous

Construct a frequency table

Plot each frequency as a rectangle, base is the class interval.

Rectangles for neighbouring class intervals should touch.

THE Histogram

Use:

to show graphically the way a set of values is spread out, e.g. daily changes in FTSE index

What can we see?

near symmetry of the distribution of changes

the most common occurrence was for the daily change to be between

7.5 and +7.5 points

What can we see

a daily change of between +7.5 and +15 points was somewhat rarer

changes in excess of 60 points up

( + ) or down ( - ) were very infrequent

Stem and leaf displayLike a histogramGives a visual impression of:Where the data are on the scaleSpreadSymmetry or asymmetry

Cross-TabulationUsed to investigate relationships between pairs of variablesA cross-tabulation gives you a basic picture of how two variables inter-relate.May be presented in form of Contingency Table.

SkewnessSkewness is a measure of symmetry, or the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.The Histogramis an effective graphical technique for showing both the skewness and kurtosis of data set.

SkewnessFisher-Pearson coefficient is a formula used to calculate skewness.The skewness for a normal distributionis zero, and any symmetric data should have a skewness near zero. Negative values for the skewness indicate data that are skewed left and positive values for the skewness indicate data that are skewed right. By skewed left, we mean that the left tail is long relative to the right tail. Similarly, skewed right means that the right tail is long relative to the left tail.

Skewness

KurtosisKurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. Data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak.

The kurtosis for astandard normal distributionis 3.As Kurtosis value is more than three so it reflects that the distribution is leptokurtic which is the fat tailed distribution in contrast to the normal distribution, showing the characteristic of time series data regarding the financial market (Mittal and Goyal, 2012). The fat tailed distribution also shows that the return series entails the large positive and negative returns which are not likely to be observed in normally distributed returns (Danielsson, 2011; Danielsson, 2011).

EXAMPLE and Various Forms of data Set based on Skewness and Kurtosis

Normal DistributionThe first histogram is a sample from anormal distribution. The normal distribution is a symmetric distribution with well-behaved tails. This is indicated by the skewness of 0.03. The kurtosis of 2.96 is near the expected value of 3. The histogram verifies the symmetry.

Double Exponential DistributionThe second histogram is a sample from adouble exponential distribution. The double exponential is a symmetric distribution. Compared to the normal, it has a stronger peak, more rapid decay, and heavier tails. That is, we would expect a skewness near zero and a kurtosis higher than 3. The skewness is 0.06 and the kurtosis is 5.9.

Cauchy DistributionThe third histogram is a sample from aCauchy distribution. For better visual comparison with the other data sets, we restricted the histogram of the Cauchy distribution to values between -10 and 10. The full data set for the Cauchy data in fact has a minimum of approximately -29,000 and a maximum of approximately 89,000.The Cauchy distribution is a symmetric distribution with heavy tails and a single peak at the center of the distribution. Since it is symmetric, we would expect a skewness near zero. Due to the heavier tails, we might expect the kurtosis to be larger than for a normal distribution. In fact the skewness is 69.99 and the kurtosis is 6,693. These extremely high values can be explained by the heavy tails. Just as the mean and standard deviation can be distorted by extreme values in the tails, so too can the skewness and kurtosis measures.

Weibull DistributionThe fourth histogram is a sample from aWeibull distribution with shape parameter 1.5. The Weibull distribution is a skewed distribution with the amount of skewness depending on the value of the shape parameter. The degree of decay as we move away from the center also depends on the value of the shape parameter. For this data set, the skewness is 1.08 and the kurtosis is 4.46, which indicates moderate skewness and kurtosis.

Dealing with Skewness and KurtosisMany classical statistical tests and intervals depend on normality assumptions. Significant skewness and kurtosis clearly indicate that data are not normal. If a data set exhibits significant skewness or kurtosis (as indicated by a histogram or the numerical measures), what can we do about it?One approach is to apply some type of transformation to try to make the data normal, or more nearly normal. TheBox-Cox transformationis a useful technique for trying to normalize a data set. In particular, taking the log or square root of a data set is often useful for data that exhibit moderate right skewness.