Descriptive statistics(Statistical Analysis)
DataFacts, figures collected for presentation and interpretation.Analysis of data yields informationVariable: a characteristic of interest
PopulationSample
INFERENCE
Census: data collected on every member of the populationExpensive,Time consumingSample survey: data collected from a representative sample of the populationLess expensive,Takes much less timeStatistical Analysis is used for this purpose
Approaches
Exploratory/descriptive
what does the data tell me?
Inferential
formulate a theory
construct a hypothesis
collect evidence to test the hypothesis
Why Descriptive Statistics are important?The variability (S.D) and central tendency (mean & median) values give us idea about the use of particular tests.For example, in case of symmetric (usually normal) date, mean is the good estimator of central tendency which results in the usage of parametric tests. In case of asymmetric (usually not-normal) date, median is the good estimator of central tendency which results in the usage of non-parametric tests. Descriptive statistics give us only a crude idea about the nature of data, further tests are used to verify the nature.
A fundamental task in many statistical analyses is to characterize the location and variability of a data set.
Comparison of data setsDescriptive statisticsTypical valueSpread or variabilitySymmetry / asymmetryTables and FiguresSkewnessKurtosis
Frequency tablesBar chartsPie chartsDot plotsHistogramsStem and leaf displaysCross-tabulations
A measure of location summarises where the values are on the measuring or counting scale
LOCATION
Sample mean
Very well known but often misunderstood.
Many users are not aware of its limitations.
Median
Less well known so less often used
but
overcomes some of the limitations of the mean
Sample mean
_1059120934.unknown
_1063708861.txt
Worksheet 1: Dotplot for banks-breweries et
The median
The median is the middle value in a set of figures when they have been sorted into ascending order.
P/E ratios for the 19 companies quoted in the Breweries, Hotels and Restaurants sector
1
2
3
4
5
6
7
5.6
7.0
7.3
8.6
8.9
9.4
9.5
8
9
10
11
12
13
14
9.8
10.3
10.4
12.7
13.2
13.8
14.0
15
16
17
18
19
14.8
16.4
18.9
22.0
23.9
With 19 values the median is the 10th value which is 10.4.
P/E ratios for the 14 companies in the Banking sector
1
2
3
4
5
6
7
10.8
11.2
11.3
11.8
11.9
12.0
12.2
8
9
10
11
12
13
14
12.7
12.8
13.9
14.5
14.5
14.6
15.4
So median = (12.2 + 12.7)/2 = 12.45
Median very easy to interpret
There are as many values below
the median as there are above it.
In fact there will be very close to 50% of values either side of the median when the data set contains an odd number of values and exactly 50% when there is an even number.
Variable N Mean
banks 14 12.829
breweries etc 19 12.45
Variable Median
banks 12.450
breweries etc 10.40
Limitations of the mean
What does the mean really mean?
Are 50% of values below the mean?
Not as easy to interpret as people think.
Mean is most easily interpreted when the data are (nearly) symmetrically spread.
In this case approximately 50% of values either side of the mean
10 people
Annual income before tax
25000, 19000, 22000, 23000,
19500, 27000, 25000, 22000,
24500, 250000
mean income is 45700
All except one have incomes below this figure!
Median salary is
(23000 + 24500)/2 = 23750
Annual income before tax
25000, 19000, 22000, 23000, 19500, 27000, 25000, 22000, 24500, 250000
The same values sorted
19000, 19500, 22000, 22000, 23000, 24500, 25000, 25000, 27000, 250000
Quartiles
They divide a set of values
into quarters
Lower quartile symbol Q1
25% of values are
no greater than Q1
Upper quartile symbol Q3
75% of values are
no greater than Q3
The median is Q2
Q1 Q2 Q3
25%
25%
25%
25%
Easy interpretation
always valid does not require symmetry
Measures of variabilityInterquartile range IQR = Q3 Q1It measures the spread of the middle 50% of the values.
Sample variance
sample standard deviation
_1059139477.unknown
Interpretation
Standard deviation is a summary statistic relating to spread.
Small standard deviation implies values fairly tightly clustered around the mean.
Large standard deviation implies large spread of values.
About 60-70% will be within 1 standard deviation of the meanAbout 95% will be within 2 standard deviations of the meanAlmost all will be within 3 standard deviations of the mean
Practical interpretation of standard deviation?
Useful to think in terms of intervals centred on the mean.
If the values have a more or less symmetrical spread around the mean
Minitab output for P/E ratios
Descriptive Statistics P/E ratios
Variable N Mean Median StDev
banks 14 12.829 12.450 1.483
breweries etc 19 12.45 10.40 5.02
Descriptive Statistics
Variable N Mean Median StDev
banks 14 12.829 12.45 1.483
breweries etc 19 12.45 10.40 5.02
Variable Minimum Maximum Q1 Q3
banks 10.800 15.400 11.675 14.500
breweries etc 5.60 23.90 8.90 14.80
Frequency tablePurpose:To show how often each of a number of different categories occurs in a sample.
Bar charts
Used to display a frequency table
A rectangular bar with its height representing the value is plotted for each category
Chart1
689
649
520
736
1152
male
Male student numbers 2000-01
Sheet1
numberFacultymalefemaleFaculty
4518MD&N2512.562.56891033A&SS
1972S&E752512.5649829A&D
1025L&A2525520468L&A
2647A&SS37.57361745MD&N
1575A&D1152674S&E
Sheet1
Student numbers 1999-2000
Sheet2
Do both 25% slices appear the same size?
Sheet3
Male student numbers 2000-01
Female student numbers 2000-01
male
female
Student numbers 2000-01
male
Male student numbers 2000-01
A table is nearly always better than a dumb pie chart; the only worse design than a pie chart is several of them, Edward TufteThe Visual Display of Quantitative Information
Pie Charts
Use only for very simple data showing how a total is split into distinct categories.
Deficiencies
Not easy to compare two or more pie charts
Can have circles with sizes (areas) proportional to numbers but many viewers do not appreciate the significance of this.
Become messy and so less informative with more than about 5 categories.
Not very easy to compare sizes of different slices
Dotplots
exactly what the name suggests
a dot is plotted for each data value
Useful for comparing two or more sets of values.
They help us to see
Where the values are on the scale,
How spread out they are
Any atypical values
P/E ratios for banks are fairly tightly clustered
P/E ratios for breweries are much more spread out
About half the breweries have P/E ratios below the lowest bank P/E ratio
Use when data is continuous
Construct a frequency table
Plot each frequency as a rectangle, base is the class interval.
Rectangles for neighbouring class intervals should touch.
THE Histogram
Use:
to show graphically the way a set of values is spread out, e.g. daily changes in FTSE index
What can we see?
near symmetry of the distribution of changes
the most common occurrence was for the daily change to be between
7.5 and +7.5 points
What can we see
a daily change of between +7.5 and +15 points was somewhat rarer
changes in excess of 60 points up
( + ) or down ( - ) were very infrequent
Stem and leaf displayLike a histogramGives a visual impression of:Where the data are on the scaleSpreadSymmetry or asymmetry
Cross-TabulationUsed to investigate relationships between pairs of variablesA cross-tabulation gives you a basic picture of how two variables inter-relate.May be presented in form of Contingency Table.
SkewnessSkewness is a measure of symmetry, or the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.The Histogramis an effective graphical technique for showing both the skewness and kurtosis of data set.
SkewnessFisher-Pearson coefficient is a formula used to calculate skewness.The skewness for a normal distributionis zero, and any symmetric data should have a skewness near zero. Negative values for the skewness indicate data that are skewed left and positive values for the skewness indicate data that are skewed right. By skewed left, we mean that the left tail is long relative to the right tail. Similarly, skewed right means that the right tail is long relative to the left tail.
Skewness
KurtosisKurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. Data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak.
The kurtosis for astandard normal distributionis 3.As Kurtosis value is more than three so it reflects that the distribution is leptokurtic which is the fat tailed distribution in contrast to the normal distribution, showing the characteristic of time series data regarding the financial market (Mittal and Goyal, 2012). The fat tailed distribution also shows that the return series entails the large positive and negative returns which are not likely to be observed in normally distributed returns (Danielsson, 2011; Danielsson, 2011).
EXAMPLE and Various Forms of data Set based on Skewness and Kurtosis
Normal DistributionThe first histogram is a sample from anormal distribution. The normal distribution is a symmetric distribution with well-behaved tails. This is indicated by the skewness of 0.03. The kurtosis of 2.96 is near the expected value of 3. The histogram verifies the symmetry.
Double Exponential DistributionThe second histogram is a sample from adouble exponential distribution. The double exponential is a symmetric distribution. Compared to the normal, it has a stronger peak, more rapid decay, and heavier tails. That is, we would expect a skewness near zero and a kurtosis higher than 3. The skewness is 0.06 and the kurtosis is 5.9.
Cauchy DistributionThe third histogram is a sample from aCauchy distribution. For better visual comparison with the other data sets, we restricted the histogram of the Cauchy distribution to values between -10 and 10. The full data set for the Cauchy data in fact has a minimum of approximately -29,000 and a maximum of approximately 89,000.The Cauchy distribution is a symmetric distribution with heavy tails and a single peak at the center of the distribution. Since it is symmetric, we would expect a skewness near zero. Due to the heavier tails, we might expect the kurtosis to be larger than for a normal distribution. In fact the skewness is 69.99 and the kurtosis is 6,693. These extremely high values can be explained by the heavy tails. Just as the mean and standard deviation can be distorted by extreme values in the tails, so too can the skewness and kurtosis measures.
Weibull DistributionThe fourth histogram is a sample from aWeibull distribution with shape parameter 1.5. The Weibull distribution is a skewed distribution with the amount of skewness depending on the value of the shape parameter. The degree of decay as we move away from the center also depends on the value of the shape parameter. For this data set, the skewness is 1.08 and the kurtosis is 4.46, which indicates moderate skewness and kurtosis.
Dealing with Skewness and KurtosisMany classical statistical tests and intervals depend on normality assumptions. Significant skewness and kurtosis clearly indicate that data are not normal. If a data set exhibits significant skewness or kurtosis (as indicated by a histogram or the numerical measures), what can we do about it?One approach is to apply some type of transformation to try to make the data normal, or more nearly normal. TheBox-Cox transformationis a useful technique for trying to normalize a data set. In particular, taking the log or square root of a data set is often useful for data that exhibit moderate right skewness.