+ All Categories
Home > Documents > (4) Condensation of Data

(4) Condensation of Data

Date post: 14-Apr-2018
Category:
Upload: asclabisb
View: 246 times
Download: 0 times
Share this document with a friend

of 22

Transcript
  • 7/30/2019 (4) Condensation of Data

    1/22

    Applied Statistics and Computing Lab

    CONDENSATION OF DATA

    Applied Statistics and Computing Lab

    Indian School of Business

  • 7/30/2019 (4) Condensation of Data

    2/22

    Applied Statistics and Computing Lab

    Learning goals Understanding a possible approach to data

    analysis

    Studying three data representation

    techniques:Stem and leaf plot

    Frequency table

    Dot plot

    2

  • 7/30/2019 (4) Condensation of Data

    3/22

    Applied Statistics and Computing Lab

    Data Analysis Exploratory

    CleaningSummarization

    Exploration of salient features

    Location Variability (spread)

    Concentration

    Shape

    SkewnessTail information

    Inferential

    3

  • 7/30/2019 (4) Condensation of Data

    4/22

    Applied Statistics and Computing Lab

    Dataset The percentage of employees involved in a certain worker involvement in decision making

    program, in 30 companies:

    (5, 32, 53, 35, 42, 43, 52, 45, 46, 44, 37, 48, 58, 49, 57, 50, 47, 78, 34, 51, 42, 52, 47, 33, 55, 56,49, 48, 63, 38)

    Arranged in ascending order:

    (5, 32, 33, 34, 35, 37, 38, 42, 42, 43, 44, 45, 46, 47, 47, 48, 48, 49, 49, 50, 51, 52, 52, 53, 55, 56,

    57, 58, 63, 78)

    4

    0 | 5

    1 |

    2 |

    3 | 234578

    4 | 223456778899

    5 | 012235678

    6 | 3

    7 | 8

    Data taken from Aczel A., Sounderpandian J. Complete business statistics

  • 7/30/2019 (4) Condensation of Data

    5/22

    Applied Statistics and Computing Lab

    Stem and leaf plot Most basic and an easy method of visualizing data in its original form

    Stem and leaf plot displays the actual values of all the data points

    Each value separated into a stem and a leaf, separated by |, with stem on the

    left side and leaf on the right side of the vertical line

    Which part of the number qualifies as a stem and which part a leaf, is

    determined on data-to-data basis

    For example, a data consisting of 2 digit values may consider the digits at tens

    place to be the stem and the digits at units place to be the leaves, similar to

    our previous diagram

    The leaves generally consist of the last or unit digit of a number and the other

    digits may be considered as the stem The numbers can sometimes be rounded up to a particular number of digits

    and the last digit may be considered to be the leaf

    A common format applies to all the values of a dataset

    All the stems must be listed, irrespective of whether any leaf follows or not5

  • 7/30/2019 (4) Condensation of Data

    6/22

    Applied Statistics and Computing Lab

    Example

    GPA of 50 students in the first semester exam

    for their second course in Quantitativemethods

    The GPA range is 0-10

    The numbers have 7 values after the decimal

    point

    Converted into 1 value after decimal pointformat

    6

  • 7/30/2019 (4) Condensation of Data

    7/22

    Applied Statistics and Computing Lab

    Stem and leaf plot (contd.)The decimal point is at the |

    7

    0| 3446

    1| 1145677

    2| 222245993| 1344488

    4| 3556

    5| 23578899

    6| 4

    7|

    14789

    8| 13

    9|10|11|12|13|14|15|16|17|

    18| 6

    Represents 4 values: 0.3,0.4,0.4,0.6

    Represents 8 values: 2.2,2.2,2.2,2.2,2.4,2.5,2.9,2.9

    Represents the only value with 6 at its tens place: 6.4

    For negative values, a ve sign is put in front of the stem

    Stem and leaf plot is a powerful tool to study a data

    Gives an idea about the distribution of values; their spread and

    density Useful in detecting unusual values and the value occurring with

    the highest frequency

    Easy to read and understand

    Not very informative if there are too few or too many values

  • 7/30/2019 (4) Condensation of Data

    8/22

    Applied Statistics and Computing Lab

    Frequency table A table listing the frequency counts for each value of

    a variable Useful tool to give a basic idea about the data in a

    quick glance

    Very easy to construct and is mostly self-explanatory

    Can accommodate many types of data, whether

    categorical or numerical. Both types of numerical

    data; discrete and continuous, can be represented ina frequency table

    8

  • 7/30/2019 (4) Condensation of Data

    9/22

    Applied Statistics and Computing Lab

    Cars dataset Consists of data on 804 used cars in the USA

    Data is collected on 12 features, such as theprice, make and model of the car, the number

    of cylinders, number of doors etc. Collected from the Kelly Blue Book

    9

  • 7/30/2019 (4) Condensation of Data

    10/22

    Applied Statistics and Computing Lab

    Frequency table (contd.) For Cars data, let us take a look at various frequencytables:

    10

    Car make Frequency of each make

    Buick 80

    Cadillac 80

    Chevrolet 320

    Pontiac 150SAAB 114

    Saturn 60

    No. of

    cylinders

    Frequency of cars with corresponding

    no. of cylinders

    4 394

    6 310

    8 100

    Price Frequency

    8638.93 18769 1

    8870.95 1

    9041.91 1

    9220.83 1

    9482.22 19506.05 1

    9563.79 1

    9654.06 1

    9665.85 1

    9720.98 1

    There are 798 unique prices!

  • 7/30/2019 (4) Condensation of Data

    11/22

    Applied Statistics and Computing Lab

    Frequency table (contd.) Is there a better way of tabulating the prices?

    What if we split into bands of prices and calculate thefrequencies?

    Would such a table be useful?

    The prices of cars range from $8639 to $70760

    11

  • 7/30/2019 (4) Condensation of Data

    12/22

    Applied Statistics and Computing Lab

    Frequency table for class intervalsPrice range Number of cars

    [$8000, $13000) 135

    [$13000, $18000) 265[$18000, $23000) 150

    [$23000, $28000) 75

    [$28000, $33000) 76

    [$33000, $38000) 45[$38000, $43000) 33

    [$43000, $48000) 11

    [$48000, $53000) 5

    [$53000, $58000) 2[$58000, $63000) 1

    [$63000, $68000) 3

    [$68000, $73000) 3

    12

  • 7/30/2019 (4) Condensation of Data

    13/22

    Applied Statistics and Computing Lab

    Determining class intervals Each band of prices or a group of values of a variable, is referred to

    as a class or a class interval

    The number of class intervals and size of each interval can be best

    determined by the researcher or analyst, who has prior knowledge

    of the behaviour of the variable

    Classes must be determined keeping the range of values in mind Very few, yet wide class intervals, may not be very informative as

    most of the information may get hidden into the large intervals

    Too many small intervals may be able to capture a detailed picture

    but such a table will be sparse and the sheer length of it may take

    away the usefulness of the table

    As far as possible, having class intervals of equal width makes the

    table easier to understand

    13

  • 7/30/2019 (4) Condensation of Data

    14/22

    Applied Statistics and Computing Lab

    Determining class intervals (contd.) The class limits i.e. the highest and lowest values of a class interval must be chosen carefully

    Must ensure that classes are determined such that any one value of the dataset can not possiblybelong to more than one class intervals

    Using two types of brackets; closed [] or open () A class interval can have one open and one closed bracket

    Closed bracket => include the number on that side of the interval

    Open bracket => all numbers up to or starting from, but excluding the number on that side of theinterval

    For a discrete data, limits of class intervals can be easily determined in a non-overlapping manner

    For continuous data, values at the limits can repeat across classes

    14

    Interval Meaning

    [1,3] Includes every number from 1 to 3, including the limits

    e.g. 1, 1.3, 1.8, 2.24, 2.6, 2.98, 2.999999, 3

    [1,3) Includes every number starting from 1 and reaching up to but not including 3

    e.g. 1, 1.01, 1.3, 1.78,2.4, 2.9, 2.99, 2.999, 2.9999, 2.99999 (There can be as many 9s after the decimal)

    (1,3] Includes every number starting after 1 (but not 1) and reaching up to and including 3

    e.g. 1.000000000001, 1.0000001, 1.1, 1.24, 1.7, 2.3, 2.69, 2.99, 3(There can be as many zeroes after the decimal point but the last digit must be a 1)

    (1,3) Includes every number in between 1 and 3, excluding 1 and 3

    e.g. 1.0000000000000000001, 1.15, 1.6, 1.92, 2.3, 2.89, 2.99999999999999999999999

  • 7/30/2019 (4) Condensation of Data

    15/22

    Applied Statistics and Computing Lab

    Dot plot A simple tool to depict the frequencies of values in a dataset

    X-axis denotes the value and the corresponding frequency isdenoted on the Y-axis

    Gives an idea about the distribution of values

    Indicates the intervals within which the variable may not take

    any values

    The value with highest frequency is easily determined

    To create a dot plot in R, the variable has to be numeric

    In case of a categorical variable or a variable with class intervals,

    an equivalent variable assigning a numeric value to each

    category or class must be created15

  • 7/30/2019 (4) Condensation of Data

    16/22

    Applied Statistics and Computing Lab16

  • 7/30/2019 (4) Condensation of Data

    17/22

    Applied Statistics and Computing Lab

    Comparison

    17

    Stem and leaf plot Frequency table Dot plot

    Discrete data

    Continuous data Constructing classintervals can be useful

    Need to create class

    intervals

    Categorical data

    Advantages Depicts actual values

    Can detect unusual

    observations

    Most informative with

    large data

    Best depiction if there are

    many values but only a few

    of them have a high

    frequency

    Disadvantages Not very informative for

    a large dataset

    Gives less information

    than a stem and leaf plot

  • 7/30/2019 (4) Condensation of Data

    18/22

    Applied Statistics and Computing Lab

    In this case

    stem plot is not

    at all a good

    idea. Most

    importantly, for

    this variable, we

    do not need to

    know the exact

    values. Knowingthe range

    within which

    they lie might

    be sufficient

    Heightis in cms.

    18

    Height (in

    cms.)

    Frequency

    147.2 1

    149.5 1

    149.9 1

    151.1 1

    198.1 1

    Clearly, for this

    continuous data we

    need to make class

    intervals!

    Out of the 507 total

    data points, 147 haveunique height values

    Height (in

    cms.)

    Frequency

    [146, 152) 5

    [152, 158) 31

    [158, 164) 92

    [164, 170) 101

    [170, 176) 118

    [176, 182) 92

    [182, 188) 40

    [188, 194) 26

    [194, 200) 2

  • 7/30/2019 (4) Condensation of Data

    19/22

    Applied Statistics and Computing Lab 19

    Height (in cms.) Frequency

    [146, 152) 5

    [152, 158) 31

    [158, 164) 92

    [164, 170) 101

    [170, 176) 118

    [176, 182) 92

    [182, 188) 40

    [188, 194) 26

    [194, 200) 2

    Heightis in cms.

  • 7/30/2019 (4) Condensation of Data

    20/22

    Applied Statistics and Computing Lab

    Conclusion Easy to construct

    Tools important to get a feel of the data!

    Must use the appropriate representation

    based on the characteristics of the data Helpful in determining the further course of

    data analysis

    20

  • 7/30/2019 (4) Condensation of Data

    21/22

    Applied Statistics and Computing Lab

    R-codesFunctions R-code

    Stem and leaf plot stem(variable name)

    Note: scaleis an important parameter toexplore in Rs stem function

    Frequency table table(variable name)

    Dot plot Install.packages(TeachingDemos)

    library(TeachingDemos)dots(variable name)

    21

  • 7/30/2019 (4) Condensation of Data

    22/22

    Applied Statistics and Computing Lab

    Thank you


Recommended