+ All Categories
Home > Documents > Lecture PPT Section1-1

Lecture PPT Section1-1

Date post: 05-Apr-2018
Category:
Upload: engr-shahbaz-siddiqui
View: 223 times
Download: 0 times
Share this document with a friend

of 22

Transcript
  • 8/2/2019 Lecture PPT Section1-1

    1/22

    8/16/20

    Variables

    In a study, we collect informationdatafrom individuals. Individuals

    can be people, animals, plants, or any object of interest.

    A variable is any characteristic of an individual. A variable varies among

    individuals.

    Example: age, height, blood pressure, ethnicity, leaf length, first language

    The distribution of a variable tells us what values the variable takes and

    how often it takes these values.

    Two types of variables Variables can be eitherquantitative

    Something that takes numerical values for which arithmetic operations,

    such as adding and averaging, make sense.

    Example: How tall you are, your age, your blood cholesterol level, the

    number of credit cards you own.

    orcategorical.

    Something that falls into one of several categories. What can be counted

    is the count or proportion of individuals in each category.

    Example: Your blood type (A, B, AB, O), your hair color, your ethnicity,

    whether you paid income tax last tax year or not.

  • 8/2/2019 Lecture PPT Section1-1

    2/22

    8/16/20

    How do you know if a variable is categorical or quantitative?

    Ask:

    What are the n individuals/units in the sample (of size n)? What is being recorded about those n individuals/units?

    Is that a number ( quantitative) or a statement ( categorical)?

    Individuals

    in sample

    DIAGNOSIS AGE AT DEATH

    Patient A Heart disease 56

    Patient B Stroke 70

    Patient C Stroke 75

    Patient D Lung cancer 60

    Patient E Heart disease 80

    Patient F Accident 73

    Patient G Diabetes 69

    QuantitativeEach individual is

    attributed a

    numerical value.

    CategoricalEach individual is

    assigned to one of

    several categories.

    Ways to chart categorical dataBecause the variable is categorical, the data in the graph can be

    ordered any way we want (alphabetical, by increasing value, by year,

    by personal preference, etc.)

    Bar graphs

    Each category is

    represented by

    a bar.

    Pie charts

    The slices must

    represent the parts of one whole.

  • 8/2/2019 Lecture PPT Section1-1

    3/22

    8/16/20

    Example: Top 10 causes of death in the United States 2001

    Rank Causes of death Counts% of top

    10s

    % of total

    deaths

    1 Heart disease 700,142 37% 29%

    2 Cancer 553,768 29% 23%

    3 Cerebrovascular 163,538 9% 7%

    4 Chronic respiratory 123,013 6% 5%

    5 Accidents 101,537 5% 4%

    6 Diabetes mellitus 71,372 4% 3%

    7 Flu and pneumonia 62,034 3% 3%

    8 Alzheimers disease 53,852 3% 2%

    9 Kidney disorders 39,480 2% 2%

    10 Septicemia 32,238 2% 1%

    All other causes 629,967 26%

    For each individual who died in the United States in 2001, we record what was

    the cause of death. The table above is a summary of that information.

    0

    100

    200

    300

    400

    500

    600

    700

    800

    Heartdise

    ases

    Cancers

    Cerebrovascular

    Chronicrespira

    tory

    Accid

    ents

    Diabetes

    mellitu

    s

    Flu&pneumonia

    Alzheimer'sdisease

    Kidney

    diso

    rders

    Septice

    mia

    Counts(x1000)

    Top 10 causes of deaths in the United States 2001

    Bar graphs

    Each category is represented by one bar. The bars height shows the count (or

    sometimes the percentage) for that particular category.

    The number of individuals

    who died of an accident in

    2001 is approximately

    100,000.

  • 8/2/2019 Lecture PPT Section1-1

    4/22

    8/16/20

    0

    100

    200

    300

    400

    500

    600

    700

    800

    Heartdise

    ases

    Cancers

    Cerebrovascular

    Chronicrespira

    tory

    Accidents

    Diabetes

    mellitu

    s

    Flu&pneumonia

    Alzheimer'sdisease

    Kidney

    diso

    rders

    Septice

    mia

    Counts(x1000)

    Bar graph sorted by rank

    Easy to analyze

    Top 10 causes of deaths in the United States 2001

    0

    100

    200

    300

    400

    500

    600

    700

    800

    Accidents

    Alzheimer'sdisease

    Cancers

    Cerebrovascular

    Chronicrespira

    tory

    Diabetes

    mellitu

    s

    Flu&pneumonia

    Heartdise

    ases

    Kidney

    diso

    rders

    Septice

    mia

    Counts(x1000) Sorted alphabetically

    Much less useful

    Percent of people dying from

    top 10 causes of death in the United States in 2000

    Pie charts

    Each slice represents a piece of one whole. The size of a slice depends on whatpercent of the whole this category represents.

  • 8/2/2019 Lecture PPT Section1-1

    5/22

    8/16/20

    Percent of deaths from top 10 causes

    Percent of

    deaths from

    all causes

    Make sure your

    labels match

    the data.

    Make sure

    all percents

    add up to 100.

    Child poverty before and after government

    interventionUNICEF, 1996

    What does this chart tell you?

    The United States has the highest rate of child

    poverty among developed nations (22% of under 18).

    Its government does the leastthrough taxes and

    subsidiesto remedy the problem (size of orange

    bars and percent difference between orange/blue

    bars).

    Could you transform this bar graph to fit in 1 pie

    chart? In two pie charts? Why?

    The poverty line is defined as 50% of national median income.

  • 8/2/2019 Lecture PPT Section1-1

    6/22

    8/16/20

    Graphing Data with Excel Bar Graph

    Graphing Data with Excel Pie Chart

  • 8/2/2019 Lecture PPT Section1-1

    7/22

    8/16/20

    Tire model for 2969 accidents that

    involved Firestone tires Exercise:Display this

    set of data

    graphically

    using a bar

    graph and a

    pie chart.

    Exercise Solution Bar Graph

  • 8/2/2019 Lecture PPT Section1-1

    8/22

    8/16/20

    Exercise Solution Pie Chart

    Using the TI-83/84 Bar Graph

  • 8/2/2019 Lecture PPT Section1-1

    9/22

    8/16/20

    Using the TI-83/84

    Using the TI-83/84

  • 8/2/2019 Lecture PPT Section1-1

    10/22

    8/16/20

    Ways to chart quantitative data

    Histograms and stemplots

    These are summary graphs for a single variable. They are very useful to

    understand the pattern of variability in the data.

    Line graphs: time plots

    Use when there is a meaningful sequence, like time. The line connecting

    the points helps emphasize any change over time.

    Histograms

    The range of values that a

    variable can take is divided

    into equal size intervals.

    The histogram shows the

    number of individual data

    points that fall in each

    interval.

    The first column represents all states with a Hispanic percent in their

    population between 0% and 4.99%. The height of the column shows how

    many states (27) have a percent in this range.

    The last column represents all states with a Hispanic percent in their

    population between 40% and 44.99%. There is only one such state: New

    Mexico, at 42.1% Hispanics.

  • 8/2/2019 Lecture PPT Section1-1

    11/22

    8/16/20

    IQ Scores - Example

    81 101 109 114 123 13182 102 110 115 124 133

    89 102 110 116 124 134

    90 102 110 117 124 134

    94 103 112 117 125 136

    96 105 112 117 126 137

    97 106 113 118 127 139

    100 108 113 118 127 139

    101 109 114 122 128 142

    101 109 114 122 130 145

    IQ Scores Example - HistogramInterval Frequency

    85 2

    85 - 95 3

    95 - 105 11

    105 - 115 16

    115 - 125 13

    125 - 135 9

    > 135 6

  • 8/2/2019 Lecture PPT Section1-1

    12/22

    8/16/20

    How to Create a Histogram - Excel

    1. Select Histogram & click OK

    2. Input data range

    3. Check Labels (if applicable)

    4. Check Chart Output

    5. Select New Worksheet Ply (default)

    6. Click OK

    How to Create a Histogram - ExcelExcel creates a frequency table by

    dividing the data range into intervals

    of equal width automatically and charts the data.

    Intervals:

    [ x 81], [81< x 90.14286], [90.14286 < x 99.28571], , [ x > 135.8571]

  • 8/2/2019 Lecture PPT Section1-1

    13/22

    8/16/20

    How to Create a Histogram Excel

    Excel creates a bar graph (with gaps between intervals). Right-click onany of the bars and select Format Data Point and move the Gap

    Width slider to No Gap (0%)

    How to Create a Histogram ExcelYou can instruct Excel how to divide the data into equal intervals by

    entering your own bin ranges to a column and then entering it into the

    Bin Range in the Histogram dialog window.

  • 8/2/2019 Lecture PPT Section1-1

    14/22

    8/16/20

    How to Create a Histogram TI -83/84

    How to Create a Histogram TI -83/84

  • 8/2/2019 Lecture PPT Section1-1

    15/22

    8/16/20

    How to Create a Histogram TI -83/84

    Stem plots

    How to make a stemplot:

    1) Separate each observation into a stem, consisting of

    all but the final (rightmost) digit, and a leaf, which is

    that remaining final digit. Stems may have as many

    digits as needed, but each leaf contains only a single

    digit.

    2) Write the stems in a vertical column with the smallest

    value at the top, and draw a vertical line at the right

    of this column.

    3) Write each leaf in the row to the right of its stem, in

    increasing order out from the stem.

    STEM LEAVES

  • 8/2/2019 Lecture PPT Section1-1

    16/22

    8/16/20

    State Percent

    Ala ba m a 1.5

    Ala sk a 4.1

    Ar izo na 25 .3

    Ar ka ns as 2.8

    Ca lifornia 32.4

    Co lo rado 17.1

    Connectic ut 9.4

    De la w are 4.8

    Florida 16.8

    Georgia 5.3Haw ai i 7.2

    Idaho 7.9

    Illino is 10.7

    Ind ian a 3.5

    Iow a 2.8

    Ka nsas 7

    Ke ntuck y 1.5

    Lou is iana 2.4

    M aine 0.7

    M aryland 4.3

    M as sac hu se tt 6.8

    M ic higan 3.3

    M innes ota 2.9

    M is s is s ipp i 1.3

    M is souri 2.1

    M ontana 2

    Nebrask a 5.5

    Nevada 19.7

    New Ha mp sh ir 1.7

    New Jersey 13.3

    New Mexico 42.1

    New York 15.1

    N orth Ca ro lin a 4 .7

    NorthDako ta 1.2

    Ohio 1.9

    Okla hom a 5.2

    Oregon 8

    Pe nns ylvan ia 3.2

    Rhode Is land 8.7S ou th Ca ro lin a 2 .4

    So uthDa ko ta 1.4

    Tennes see 2

    Texas 32

    Utah 9

    Ve rm on t 0.9

    Virginia 4.7

    W ash ing ton 7.2

    W e stV irgin ia 0.7

    W iscons in 3.6

    W yo m ing 6.4

    Percent of Hispanic residents

    in each of the 50 states

    Step 2:

    Assign the

    values to

    stems and

    leaves

    Step 1:

    Sort the

    data

    S tate Pe rc ent

    M a in e 0 .7

    W e stV irg in ia 0 .7

    V erm o n t 0 .9

    N orthD ak o ta 1 .2

    M is s is s ip p i 1 .3

    S ou th D ako ta 1 .4

    Al ab am a 1. 5

    K en tu c ky 1 .5

    N ew H am psh ir 1 .7

    O h io 1 .9M o n ta n a 2

    Tenne s s ee 2

    M is sou ri 2 .1

    L ou is ian a 2 .4

    S ou th Ca ro lin a 2 .4

    Ar ka ns as 2. 8

    Io w a 2 .8

    M inn e so ta 2 .9

    P en ns ylva nia 3 .2

    M ic h igan 3 .3

    In d iana 3 .5

    W is c ons in 3 .6

    Al as ka 4. 1

    M a ryla nd 4 .3

    N orth Ca ro lin a 4 .7

    V irg in ia 4 .7

    D e law a re 4 .8

    O k lah o m a 5 .2

    G eo rg ia 5 .3

    N ebras k a 5 .5

    W yom ing 6 .4

    M a ss ach us et t 6 .8

    K an s as 7

    H aw a ii 7 .2

    W a sh ing ton 7 .2

    Id aho 7 .9

    O re go n 8

    R hod e Is la nd 8 .7

    U tah 9C onn ec tic u t 9 .4

    Ill ino is 10 .7

    N ew J ers e y 13 .3

    N ew Yo rk 15 .1

    F lo rida 16 .8

    C o lo rado 1 7 .1

    N evada 19 .7

    Ar izo na 25 .3

    Texas 32

    C a lifo rn ia 3 2 .4

    N ew M exic o 4 2 .1

    Stem Plot

    To compare two related distributions, a back-to-back stem plot with

    common stems is useful.

    Stem plots do not work well for large datasets.

    When the observed values have too many digits, trim the numbers

    before making a stem plot.

    When plotting a moderate number of observations, you can split

    each stem.

  • 8/2/2019 Lecture PPT Section1-1

    17/22

    8/16/20

    Stemplots are quick and dirty histograms that can easily be done byhand, and therefore are very convenient for back of the envelope

    calculations. However, they are rarely found in scientific or laymen

    publications.

    Stemplots versus histograms

    Interpreting histogramsWhen describing the distribution of a quantitative variable, we look for the

    overall pattern and for striking deviations from that pattern. We can describe

    the overallpattern of a histogram by its shape, center, and spread.

    Histogram with a line connecting

    each column too detailed

    Histogram with a smoothed curve

    highlighting the overall pattern of

    the distribution

  • 8/2/2019 Lecture PPT Section1-1

    18/22

    8/16/20

    Most common distribution shapes

    A distribution is symmetric if the right and left

    sides of the histogram are approximately mirror

    images of each other.

    Symmetric

    distribution

    Complex,

    multimodal

    distribution

    Not all distributions have a simple overall shape,

    especially when there are few observations.

    Skewed

    distribution

    A distribution is skewed to the right if the right

    side of the histogram (side with larger values)

    extends much farther out than the left side. It is

    skewed to the left if the left side of the histogram

    extends much farther out than the right side.

    Alaska Florida

    Outliers

    An important kind of deviation is an outlier. Outliers are observations

    that lie outside the overall pattern of a distribution. Always look for

    outliers and try to explain them.

    The overall pattern is fairly

    symmetrical except for 2

    states that clearly do not

    belong to the main trend.

    Alaska and Florida have

    unusual representation of

    the elderly in theirpopulation.

    A large gap in the

    distribution is typically a

    sign of an outlier.

  • 8/2/2019 Lecture PPT Section1-1

    19/22

    8/16/20

    2

    How to create a histogram

    It is an iterative process try and try again.

    What bin size should you use?

    Not too many bins with either 0 or 1 counts

    Not overly summarized that you loose all the information

    Not so detailed that it is no longer summary

    rule of thumb: start with 5 to 10 bins

    Look at the distribution and refine your bins

    (There isnt a unique or perfect solution)

    Not

    summarized

    enough

    Too summarized

    Same data set

  • 8/2/2019 Lecture PPT Section1-1

    20/22

    8/16/20

    2

    IMPORTANT NOTE:

    Your data are the way they are.

    Do not try to force them into a

    particular shape.

    It is a common misconception

    that if you have a large enough

    data set, the data will eventuallyturn out nice and symmetrical.

    Histogram of Drydays in 1995

    Line graphs: time plots

    A trend is a rise or fall that

    persists over time, despite

    small irregularities.

    In a time plot, time always goes on the horizontal, x axis.

    We describe time series by looking for an overall pattern and for striking

    deviations from that pattern. In a time series:

    A pattern that repeats itself

    at regular intervals of time is

    called seasonal variation.

  • 8/2/2019 Lecture PPT Section1-1

    21/22

    8/16/20

    2

    Retail price of fresh oranges

    over time

    This time plot shows a regular pattern of yearly variations. These are seasonal

    variations in fresh orange pricing most likely due to similar seasonal variations in

    the production of fresh oranges.

    There is also an overall upward trend in pricing over time. It could simply be

    reflecting inflation trends or a more fundamental change in this industry.

    Time is on the horizontal, x axis.

    The variable of interesthere

    retail price of fresh oranges

    goes on the vertical, y axis.

    1918 influenza epidemic

    Date # Cases # Deaths

    week 1 36 0

    week 2 531 0

    week 3 4233 130

    week 4 8682 552

    week 5 7164 738

    week 6 2229 414

    week 7 600 198

    week 8 164 90

    week 9 57 56

    week 10 722 50

    week 11 1517 71week 12 1828 137

    week 13 1539 178

    week 14 2416 194week 15 3148 290week 16 3465 310

    week 17 1440 149

    01000

    2000300040005000

    600070008000

    900010000

    we

    ek1

    we

    ek3

    we

    ek5

    we

    ek7

    we

    ek9

    wee

    k11

    wee

    k13

    wee

    k15

    wee

    k17

    #casesdiagnosed

    0

    10 0

    20 0

    30 0

    40 0

    50 0

    60 0

    70 0

    80 0

    #

    deathsreported

    # C as es # Deaths

    A time plot can be used to compare two or more

    data sets covering the same time period.

    The pattern over time for the number of flu diagnoses closely resembles that for the

    number of deaths from the flu, indicating that about 8% to 10% of the people

    diagnosed that year died shortly afterward, from complications of the flu.

  • 8/2/2019 Lecture PPT Section1-1

    22/22

    8/16/20

    Death rates from cancer (US, 1945-95)

    0

    50

    100

    150

    200

    250

    1 940 1950 1960 1970 1980 1990 20 00

    Years

    Deathrate(per

    thousand)

    Death rates from cancer (US, 1945-95)

    0

    50

    100

    150

    200

    250

    1940 1960 1980 2000

    Years

    Deathrate

    (perthousa

    nd)

    Death rates from cancer (US, 1945-95)

    0

    50

    100

    150

    200

    250

    1940 1960 1980 2000

    Years

    Deathrate(perthousand)

    A picture is worth a

    thousand words,

    BUT

    There is nothing like

    hard numbers.

    Look at the scales.

    Scales matter

    How you stretch the axes and choose your

    scales can give a different impression.

    Death rates from cancer (US, 1945-95)

    120

    140

    160

    180

    200

    220

    1940 1960 1980 2000

    Years

    Deathrate(per

    thousand)

    Lesson SummaryKey Concepts Individuals

    Variables

    o Categorical Variables

    o Quantitative Variables

    Graphs for Categorical Variables

    o Bar graphs

    o Pie charts

    Graphs for Quantitative Variables

    o Stemplots

    o Histograms

    o Timeplots

    Examining distributions

    o Overall pattern (Shape, center,

    spread)

    o Deviations from the pattern (Outliers)

    o Trends

    o Seasonal Variation

    Skills Learned

    Displaying data graphically

    o Bar graphs

    o Pie charts

    o Stemplots

    o Histograms

    o Timeplots

    Interpreting graphs

    Describing distributions


Recommended