+ All Categories
Home > Documents > 2 Descriptive Statistics

2 Descriptive Statistics

Date post: 07-Nov-2015
Category:
Upload: syed-ali
View: 216 times
Download: 0 times
Share this document with a friend
Description:
econometrics
58
Descriptive statistics (Statistical Analysis)
Transcript
  • Descriptive statistics(Statistical Analysis)

  • DataFacts, figures collected for presentation and interpretation.Analysis of data yields informationVariable: a characteristic of interest

  • PopulationSample

    INFERENCE

  • Census: data collected on every member of the populationExpensive,Time consumingSample survey: data collected from a representative sample of the populationLess expensive,Takes much less timeStatistical Analysis is used for this purpose

  • Approaches

    Exploratory/descriptive

    what does the data tell me?

    Inferential

    formulate a theory

    construct a hypothesis

    collect evidence to test the hypothesis

  • Why Descriptive Statistics are important?The variability (S.D) and central tendency (mean & median) values give us idea about the use of particular tests.For example, in case of symmetric (usually normal) date, mean is the good estimator of central tendency which results in the usage of parametric tests. In case of asymmetric (usually not-normal) date, median is the good estimator of central tendency which results in the usage of non-parametric tests. Descriptive statistics give us only a crude idea about the nature of data, further tests are used to verify the nature.

  • A fundamental task in many statistical analyses is to characterize the location and variability of a data set.

  • Comparison of data setsDescriptive statisticsTypical valueSpread or variabilitySymmetry / asymmetryTables and FiguresSkewnessKurtosis

  • Frequency tablesBar chartsPie chartsDot plotsHistogramsStem and leaf displaysCross-tabulations

  • A measure of location summarises where the values are on the measuring or counting scale

    LOCATION

  • Sample mean

    Very well known but often misunderstood.

    Many users are not aware of its limitations.

    Median

    Less well known so less often used

    but

    overcomes some of the limitations of the mean

  • Sample mean

    _1059120934.unknown

  • _1063708861.txt

    Worksheet 1: Dotplot for banks-breweries et

  • The median

    The median is the middle value in a set of figures when they have been sorted into ascending order.

  • P/E ratios for the 19 companies quoted in the Breweries, Hotels and Restaurants sector

    1

    2

    3

    4

    5

    6

    7

    5.6

    7.0

    7.3

    8.6

    8.9

    9.4

    9.5

    8

    9

    10

    11

    12

    13

    14

    9.8

    10.3

    10.4

    12.7

    13.2

    13.8

    14.0

    15

    16

    17

    18

    19

    14.8

    16.4

    18.9

    22.0

    23.9

    With 19 values the median is the 10th value which is 10.4.

  • P/E ratios for the 14 companies in the Banking sector

    1

    2

    3

    4

    5

    6

    7

    10.8

    11.2

    11.3

    11.8

    11.9

    12.0

    12.2

    8

    9

    10

    11

    12

    13

    14

    12.7

    12.8

    13.9

    14.5

    14.5

    14.6

    15.4

    So median = (12.2 + 12.7)/2 = 12.45

  • Median very easy to interpret

    There are as many values below

    the median as there are above it.

    In fact there will be very close to 50% of values either side of the median when the data set contains an odd number of values and exactly 50% when there is an even number.

  • Variable N Mean

    banks 14 12.829

    breweries etc 19 12.45

    Variable Median

    banks 12.450

    breweries etc 10.40

  • Limitations of the mean

    What does the mean really mean?

    Are 50% of values below the mean?

    Not as easy to interpret as people think.

    Mean is most easily interpreted when the data are (nearly) symmetrically spread.

    In this case approximately 50% of values either side of the mean

  • 10 people

    Annual income before tax

    25000, 19000, 22000, 23000,

    19500, 27000, 25000, 22000,

    24500, 250000

    mean income is 45700

    All except one have incomes below this figure!

  • Median salary is

    (23000 + 24500)/2 = 23750

    Annual income before tax

    25000, 19000, 22000, 23000, 19500, 27000, 25000, 22000, 24500, 250000

    The same values sorted

    19000, 19500, 22000, 22000, 23000, 24500, 25000, 25000, 27000, 250000

  • Quartiles

    They divide a set of values

    into quarters

    Lower quartile symbol Q1

    25% of values are

    no greater than Q1

    Upper quartile symbol Q3

    75% of values are

    no greater than Q3

    The median is Q2

  • Q1 Q2 Q3

    25%

    25%

    25%

    25%

    Easy interpretation

    always valid does not require symmetry

  • Measures of variabilityInterquartile range IQR = Q3 Q1It measures the spread of the middle 50% of the values.

  • Sample variance

    sample standard deviation

    _1059139477.unknown

  • Interpretation

    Standard deviation is a summary statistic relating to spread.

    Small standard deviation implies values fairly tightly clustered around the mean.

    Large standard deviation implies large spread of values.

  • About 60-70% will be within 1 standard deviation of the meanAbout 95% will be within 2 standard deviations of the meanAlmost all will be within 3 standard deviations of the mean

    Practical interpretation of standard deviation?

    Useful to think in terms of intervals centred on the mean.

    If the values have a more or less symmetrical spread around the mean

  • Minitab output for P/E ratios

    Descriptive Statistics P/E ratios

    Variable N Mean Median StDev

    banks 14 12.829 12.450 1.483

    breweries etc 19 12.45 10.40 5.02

  • Descriptive Statistics

    Variable N Mean Median StDev

    banks 14 12.829 12.45 1.483

    breweries etc 19 12.45 10.40 5.02

    Variable Minimum Maximum Q1 Q3

    banks 10.800 15.400 11.675 14.500

    breweries etc 5.60 23.90 8.90 14.80

  • Frequency tablePurpose:To show how often each of a number of different categories occurs in a sample.

  • Bar charts

    Used to display a frequency table

    A rectangular bar with its height representing the value is plotted for each category

    Chart1

    689

    649

    520

    736

    1152

    male

    Male student numbers 2000-01

    Sheet1

    numberFacultymalefemaleFaculty

    4518MD&N2512.562.56891033A&SS

    1972S&E752512.5649829A&D

    1025L&A2525520468L&A

    2647A&SS37.57361745MD&N

    1575A&D1152674S&E

    Sheet1

    Student numbers 1999-2000

    Sheet2

    Do both 25% slices appear the same size?

    Sheet3

    Male student numbers 2000-01

    Female student numbers 2000-01

    male

    female

    Student numbers 2000-01

    male

    Male student numbers 2000-01

  • A table is nearly always better than a dumb pie chart; the only worse design than a pie chart is several of them, Edward TufteThe Visual Display of Quantitative Information

    Pie Charts

    Use only for very simple data showing how a total is split into distinct categories.

  • Deficiencies

    Not easy to compare two or more pie charts

    Can have circles with sizes (areas) proportional to numbers but many viewers do not appreciate the significance of this.

    Become messy and so less informative with more than about 5 categories.

    Not very easy to compare sizes of different slices

  • Dotplots

    exactly what the name suggests

    a dot is plotted for each data value

  • Useful for comparing two or more sets of values.

    They help us to see

    Where the values are on the scale,

    How spread out they are

    Any atypical values

  • P/E ratios for banks are fairly tightly clustered

    P/E ratios for breweries are much more spread out

    About half the breweries have P/E ratios below the lowest bank P/E ratio

  • Use when data is continuous

    Construct a frequency table

    Plot each frequency as a rectangle, base is the class interval.

    Rectangles for neighbouring class intervals should touch.

    THE Histogram

    Use:

    to show graphically the way a set of values is spread out, e.g. daily changes in FTSE index

  • What can we see?

    near symmetry of the distribution of changes

    the most common occurrence was for the daily change to be between

    7.5 and +7.5 points

  • What can we see

    a daily change of between +7.5 and +15 points was somewhat rarer

    changes in excess of 60 points up

    ( + ) or down ( - ) were very infrequent

  • Stem and leaf displayLike a histogramGives a visual impression of:Where the data are on the scaleSpreadSymmetry or asymmetry

  • Cross-TabulationUsed to investigate relationships between pairs of variablesA cross-tabulation gives you a basic picture of how two variables inter-relate.May be presented in form of Contingency Table.

  • SkewnessSkewness is a measure of symmetry, or the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.The Histogramis an effective graphical technique for showing both the skewness and kurtosis of data set.

  • SkewnessFisher-Pearson coefficient is a formula used to calculate skewness.The skewness for a normal distributionis zero, and any symmetric data should have a skewness near zero. Negative values for the skewness indicate data that are skewed left and positive values for the skewness indicate data that are skewed right. By skewed left, we mean that the left tail is long relative to the right tail. Similarly, skewed right means that the right tail is long relative to the left tail.

  • Skewness

  • KurtosisKurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. Data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak.

  • The kurtosis for astandard normal distributionis 3.As Kurtosis value is more than three so it reflects that the distribution is leptokurtic which is the fat tailed distribution in contrast to the normal distribution, showing the characteristic of time series data regarding the financial market (Mittal and Goyal, 2012). The fat tailed distribution also shows that the return series entails the large positive and negative returns which are not likely to be observed in normally distributed returns (Danielsson, 2011; Danielsson, 2011).

  • EXAMPLE and Various Forms of data Set based on Skewness and Kurtosis

  • Normal DistributionThe first histogram is a sample from anormal distribution. The normal distribution is a symmetric distribution with well-behaved tails. This is indicated by the skewness of 0.03. The kurtosis of 2.96 is near the expected value of 3. The histogram verifies the symmetry.

  • Double Exponential DistributionThe second histogram is a sample from adouble exponential distribution. The double exponential is a symmetric distribution. Compared to the normal, it has a stronger peak, more rapid decay, and heavier tails. That is, we would expect a skewness near zero and a kurtosis higher than 3. The skewness is 0.06 and the kurtosis is 5.9.

  • Cauchy DistributionThe third histogram is a sample from aCauchy distribution. For better visual comparison with the other data sets, we restricted the histogram of the Cauchy distribution to values between -10 and 10. The full data set for the Cauchy data in fact has a minimum of approximately -29,000 and a maximum of approximately 89,000.The Cauchy distribution is a symmetric distribution with heavy tails and a single peak at the center of the distribution. Since it is symmetric, we would expect a skewness near zero. Due to the heavier tails, we might expect the kurtosis to be larger than for a normal distribution. In fact the skewness is 69.99 and the kurtosis is 6,693. These extremely high values can be explained by the heavy tails. Just as the mean and standard deviation can be distorted by extreme values in the tails, so too can the skewness and kurtosis measures.

  • Weibull DistributionThe fourth histogram is a sample from aWeibull distribution with shape parameter 1.5. The Weibull distribution is a skewed distribution with the amount of skewness depending on the value of the shape parameter. The degree of decay as we move away from the center also depends on the value of the shape parameter. For this data set, the skewness is 1.08 and the kurtosis is 4.46, which indicates moderate skewness and kurtosis.

  • Dealing with Skewness and KurtosisMany classical statistical tests and intervals depend on normality assumptions. Significant skewness and kurtosis clearly indicate that data are not normal. If a data set exhibits significant skewness or kurtosis (as indicated by a histogram or the numerical measures), what can we do about it?One approach is to apply some type of transformation to try to make the data normal, or more nearly normal. TheBox-Cox transformationis a useful technique for trying to normalize a data set. In particular, taking the log or square root of a data set is often useful for data that exhibit moderate right skewness.


Recommended