+ All Categories
Home > Documents > Analysis of Data Session 1

Analysis of Data Session 1

Date post: 09-Apr-2018
Category:
Upload: yogaladdo
View: 215 times
Download: 0 times
Share this document with a friend

of 26

Transcript
  • 8/8/2019 Analysis of Data Session 1

    1/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    Analysis of Data

    Session I

  • 8/8/2019 Analysis of Data Session 1

    2/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    2

    Summarizing Data

    Raw data is often voluminous and difficultto handle

    Decision makers want a few numbers tosummarize the entire data

    Summarization leads to loss of information

    but can help focus on key aspects of thedataset

  • 8/8/2019 Analysis of Data Session 1

    3/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    3

    Five Number Summary of Data

    (Min, Q1, Median, Q3, Max) is called the five numbersummary of data

    25% of the observations are below Q1and 75% of theobservations are above Q1.

    Q1 is called the first quartile

    50% of the observations are below Median and 50% areabove Median

    75% of the observations are below Q3 and 25% areabove Q3.

    Q3 is called the third quartile.

    Each of the segments Min-Q1, Q1-Med, Med-Q3 andQ3-Max contains 25% of the data.

  • 8/8/2019 Analysis of Data Session 1

    4/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    4

    Box Plot

    C1

    160

    140

    120

    100

    80

    60

    40

    20

    0

    Boxplot of C1

    Max

    Q3

    Median

    Q1

    Min

  • 8/8/2019 Analysis of Data Session 1

    5/26

  • 8/8/2019 Analysis of Data Session 1

    6/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    6

    Identifying Outliers

    The interquartile range (IQR) is defined as IQR =Q3 Q1

    An observation (x) is a soft or possible outlier ifx > Q3 + 1.5 IQR or x < Q1 1.5 IQR

    An observation (x) is a hard or confirmedoutlier if x > Q3 + 3 IQR or x < Q1 3 IQR

    Note: All hard outliers are also soft outliers butnot vice versa.

  • 8/8/2019 Analysis of Data Session 1

    7/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    7

    Dealing with outliers (if present)

    Accommodative Approach: Use methods whichare resistant to the presence of outliers (Robustmethods)

    Example: 5% Trimmed Mean 5% Trimmed Mean is computed as follows:

    Arrange the data in increasing order and thendelete the lower 5% of the observations and alsothe upper 5% of the observations. Compute thesimple average of the remaining observations.

    Deletion Approach: Delete the outliers and workwith the remaining data set.

  • 8/8/2019 Analysis of Data Session 1

    8/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    8

    Two number summary of data

    Often data sets are summarized by givingonly two numbers:

    - a measure of central tendency and

    - a measure of spread (around themeasure of central tendency)

  • 8/8/2019 Analysis of Data Session 1

    9/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    9

    Mean and Standard Deviation

    2n

    1i

    2i

    2n

    21

    n1

    xxn

    1

    n

    )xx(...)x(xs:DeviationStandard

    n

    x...xx:MeanArithmetic

    =

    ++=

    ++=

    =

  • 8/8/2019 Analysis of Data Session 1

    10/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    10

    Mean and Standard Deviation

    Note that Standard Deviation (SD) =0 onlywhen all the equal observations are equal.

    The higher the SD the higher is the spreadaround the mean value

    Lower SD indicates better reliability of the

    mean value in representing the dataset.

  • 8/8/2019 Analysis of Data Session 1

    11/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    11

    Chebyshevs Inequality

    In most situations of common occurrenceChebyshevs inequality asserts that

    proportion of observations outside the interval(mean t SD, mean + t SD) is at most t-2

    Using Chebyshevs inequality we have theproportion of observations outside

    i) (mean 2 SD, mean + 2 SD) is at most 0.25

    ii) (mean 3 SD, mean + 3 SD) is at most 0.11

    iii) (mean 4 SD, mean + 4 SD) is at most 0.06

  • 8/8/2019 Analysis of Data Session 1

    12/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    12

    Impact of Outliers

    Both the mean and the SD are quitesensitive to the presence of outliers

    If mean and SD are proposed to be usedfor summarizing a data set it is better todelete the outliers first and then proceed to

    calculate the mean and SD

  • 8/8/2019 Analysis of Data Session 1

    13/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    13

    Median and MAD

    An alternative to using mean and SD as a summaryof the data is to use Median and MAD

    MAD is the acronym for Median Absolute Deviationabout Median

    If med is the median of the dataset x1,,xn, MAD isthe median of the set of numbers {|x1-med|,

    |x2-med|,,|xn-med|}

    Usually 1.4826 MAD is used as a measure ofspread

    The Median and MAD are both far less sensitive topresence of outliers than mean and SD. No deletionof data is required if these are used.

  • 8/8/2019 Analysis of Data Session 1

    14/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    14

    Empirical Cumulative Distribution Function

    quartilethird

    thecalledis75.0)3Q(FsatisfyingQ3numbersmallestThe

    medianthecalledis5.0)m~(Fsatisfyingm~numbersmallestThe

    quartilefirst

    thecalledis25.0)1Q(FsatisfyingQ1numbersmallestThe

    t.toequalorthanless

    butsthangreaternsobservatioofproportion)s(F)t(F

    n

    tnsobservatio#)t(Fasdefined

    is(ECDF)FunctiononDistributiCumulativeEmpiricalTheset.datagiventhebe}x,...,{xLet

    n

    n

    n

    nn

    n

    n1

    =

    =

  • 8/8/2019 Analysis of Data Session 1

    15/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    15

    Frequency Distribution

    Often, particularly for large data sets, it isadvantageous to summarize data using afrequency distribution.

    The entire range (Range = Max Min) of thedata is divided into a few disjoint classes each ofwhich is an interval

    A frequency distribution gives the list of theclasses along with the number of observations ineach class (called the frequency of the class)

  • 8/8/2019 Analysis of Data Session 1

    16/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    16

    Example of Frequency Distribution

    1094Total

    10287 289

    205280 286

    348273 279

    334266 272

    121259 265

    76256 258

    Cases(frequency)Duration ofPregnancy(days)

    From: Bhat & Khustagi:Singapore Med J 2006; 47(12)

  • 8/8/2019 Analysis of Data Session 1

    17/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    17

    Relative Frequency and Frequency Density

    Relative Frequency of a class is thefrequency of the class divided by the totalnumber of observations

    Frequency Density of a class is theRelative frequency of a class divided by

    the class width.

  • 8/8/2019 Analysis of Data Session 1

    18/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    18

    Example

    1094Total

    0.0040.0110286.5 289

    0.0750.19205279.5 286.5

    0.1270.32348272.5 279.5

    0.1220.31334265.5 272.5

    0.0440.11121258.5 265.5

    0.0280.0776255 258.5

    Frequency

    Density

    Relative

    Frequency

    FrequencyDuration of

    Pregnancy (days)

  • 8/8/2019 Analysis of Data Session 1

    19/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    19

    HistogramHistogram

    0.000

    0.020

    0.040

    0.060

    0.080

    0.100

    0.120

    0.140

    255 258.5 258.5 265.5 265.5 272.5 272.5 279.5 279.5 286.5 286.5 289

    Duration of Pregnancy

    FrequencyDensity

    Series1

  • 8/8/2019 Analysis of Data Session 1

    20/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    2

    20

    Histogram: How many classes?

    For construction of the histogram it is important to decideon the number of classes or equivalently the class width

    The shape of the histogram heavily depends on thechoice of the number of classes / class width.

    Two popular approaches are:

    a) Sturges rule

    b) Freedman Diaconis rule

    Both these rules usually (but not always) give similarresults if the number of observations is less than 200.

    Freedman Diaconis rule is the preferred/ better rule fordetermining the class width of a histogram.

  • 8/8/2019 Analysis of Data Session 1

    21/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    2

    21

    Sturges Rule

    The number of classes (k) = [1 + log2n]+1 where[1+log2n] is the greatest integer less than orequal to 1+log2n

    In other words, choose k such that

    2k-3n k = 8

    n = 83 => k = 9 The class width (h) is computed as Range

    divided by the number of classes.

    h = (Max Min) / k

  • 8/8/2019 Analysis of Data Session 1

    22/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    2

    22

    Freedman-Diaconis Rule

    [ ] function.integergreatesttheis.where

    1h

    RangekclassesofnumberThe

    n

    IQR2hbygiveniswidthclassThe

    1/3

    +

    =

    =

  • 8/8/2019 Analysis of Data Session 1

    23/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    2

    23

    Example

    For the pregnancy duration data set we have n =1094, Range = 289-255 = 34

    Q1 = 267.1, Q3=278.3, IQR = 11.2

    Freedman Diaconis rule gives the class widthto be 2.18

    The number of class intervals is therefore 16.

    The class intervals are 255 257.18, 257.18 259.36, , 285.52 287.7 and 287.7 289

    (note the last interval has shorter length)

  • 8/8/2019 Analysis of Data Session 1

    24/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    2

    24

    Example: Stock Returns

    55 daily returns

    Five number summary:

    Min = -8.213, Q1 = -1.413, Median = -0.1364

    Q3 = 0.817, Max = 4.071

  • 8/8/2019 Analysis of Data Session 1

    25/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    2

    25

    Histogram using Sturges rule

    Histogram of tm

    tm

    Frequency

    -10 -5 0 5

    0

    5

    10

    15

    20

  • 8/8/2019 Analysis of Data Session 1

    26/26

    Prof. Arnab K Laha - Analysis of Data

    (PGP-X)

    26

    Histogram using Freedman-Diaconis rule

    Histogram of tm

    tm

    Frequency

    -8 -6 -4 -2 0 2 4

    0

    2

    4

    6

    8

    10

    12

    14


Recommended