Date post: | 05-Apr-2018 |
Category: |
Documents |
Upload: | engr-shahbaz-siddiqui |
View: | 223 times |
Download: | 0 times |
of 22
8/2/2019 Lecture PPT Section1-1
1/22
8/16/20
Variables
In a study, we collect informationdatafrom individuals. Individuals
can be people, animals, plants, or any object of interest.
A variable is any characteristic of an individual. A variable varies among
individuals.
Example: age, height, blood pressure, ethnicity, leaf length, first language
The distribution of a variable tells us what values the variable takes and
how often it takes these values.
Two types of variables Variables can be eitherquantitative
Something that takes numerical values for which arithmetic operations,
such as adding and averaging, make sense.
Example: How tall you are, your age, your blood cholesterol level, the
number of credit cards you own.
orcategorical.
Something that falls into one of several categories. What can be counted
is the count or proportion of individuals in each category.
Example: Your blood type (A, B, AB, O), your hair color, your ethnicity,
whether you paid income tax last tax year or not.
8/2/2019 Lecture PPT Section1-1
2/22
8/16/20
How do you know if a variable is categorical or quantitative?
Ask:
What are the n individuals/units in the sample (of size n)? What is being recorded about those n individuals/units?
Is that a number ( quantitative) or a statement ( categorical)?
Individuals
in sample
DIAGNOSIS AGE AT DEATH
Patient A Heart disease 56
Patient B Stroke 70
Patient C Stroke 75
Patient D Lung cancer 60
Patient E Heart disease 80
Patient F Accident 73
Patient G Diabetes 69
QuantitativeEach individual is
attributed a
numerical value.
CategoricalEach individual is
assigned to one of
several categories.
Ways to chart categorical dataBecause the variable is categorical, the data in the graph can be
ordered any way we want (alphabetical, by increasing value, by year,
by personal preference, etc.)
Bar graphs
Each category is
represented by
a bar.
Pie charts
The slices must
represent the parts of one whole.
8/2/2019 Lecture PPT Section1-1
3/22
8/16/20
Example: Top 10 causes of death in the United States 2001
Rank Causes of death Counts% of top
10s
% of total
deaths
1 Heart disease 700,142 37% 29%
2 Cancer 553,768 29% 23%
3 Cerebrovascular 163,538 9% 7%
4 Chronic respiratory 123,013 6% 5%
5 Accidents 101,537 5% 4%
6 Diabetes mellitus 71,372 4% 3%
7 Flu and pneumonia 62,034 3% 3%
8 Alzheimers disease 53,852 3% 2%
9 Kidney disorders 39,480 2% 2%
10 Septicemia 32,238 2% 1%
All other causes 629,967 26%
For each individual who died in the United States in 2001, we record what was
the cause of death. The table above is a summary of that information.
0
100
200
300
400
500
600
700
800
Heartdise
ases
Cancers
Cerebrovascular
Chronicrespira
tory
Accid
ents
Diabetes
mellitu
s
Flu&pneumonia
Alzheimer'sdisease
Kidney
diso
rders
Septice
mia
Counts(x1000)
Top 10 causes of deaths in the United States 2001
Bar graphs
Each category is represented by one bar. The bars height shows the count (or
sometimes the percentage) for that particular category.
The number of individuals
who died of an accident in
2001 is approximately
100,000.
8/2/2019 Lecture PPT Section1-1
4/22
8/16/20
0
100
200
300
400
500
600
700
800
Heartdise
ases
Cancers
Cerebrovascular
Chronicrespira
tory
Accidents
Diabetes
mellitu
s
Flu&pneumonia
Alzheimer'sdisease
Kidney
diso
rders
Septice
mia
Counts(x1000)
Bar graph sorted by rank
Easy to analyze
Top 10 causes of deaths in the United States 2001
0
100
200
300
400
500
600
700
800
Accidents
Alzheimer'sdisease
Cancers
Cerebrovascular
Chronicrespira
tory
Diabetes
mellitu
s
Flu&pneumonia
Heartdise
ases
Kidney
diso
rders
Septice
mia
Counts(x1000) Sorted alphabetically
Much less useful
Percent of people dying from
top 10 causes of death in the United States in 2000
Pie charts
Each slice represents a piece of one whole. The size of a slice depends on whatpercent of the whole this category represents.
8/2/2019 Lecture PPT Section1-1
5/22
8/16/20
Percent of deaths from top 10 causes
Percent of
deaths from
all causes
Make sure your
labels match
the data.
Make sure
all percents
add up to 100.
Child poverty before and after government
interventionUNICEF, 1996
What does this chart tell you?
The United States has the highest rate of child
poverty among developed nations (22% of under 18).
Its government does the leastthrough taxes and
subsidiesto remedy the problem (size of orange
bars and percent difference between orange/blue
bars).
Could you transform this bar graph to fit in 1 pie
chart? In two pie charts? Why?
The poverty line is defined as 50% of national median income.
8/2/2019 Lecture PPT Section1-1
6/22
8/16/20
Graphing Data with Excel Bar Graph
Graphing Data with Excel Pie Chart
8/2/2019 Lecture PPT Section1-1
7/22
8/16/20
Tire model for 2969 accidents that
involved Firestone tires Exercise:Display this
set of data
graphically
using a bar
graph and a
pie chart.
Exercise Solution Bar Graph
8/2/2019 Lecture PPT Section1-1
8/22
8/16/20
Exercise Solution Pie Chart
Using the TI-83/84 Bar Graph
8/2/2019 Lecture PPT Section1-1
9/22
8/16/20
Using the TI-83/84
Using the TI-83/84
8/2/2019 Lecture PPT Section1-1
10/22
8/16/20
Ways to chart quantitative data
Histograms and stemplots
These are summary graphs for a single variable. They are very useful to
understand the pattern of variability in the data.
Line graphs: time plots
Use when there is a meaningful sequence, like time. The line connecting
the points helps emphasize any change over time.
Histograms
The range of values that a
variable can take is divided
into equal size intervals.
The histogram shows the
number of individual data
points that fall in each
interval.
The first column represents all states with a Hispanic percent in their
population between 0% and 4.99%. The height of the column shows how
many states (27) have a percent in this range.
The last column represents all states with a Hispanic percent in their
population between 40% and 44.99%. There is only one such state: New
Mexico, at 42.1% Hispanics.
8/2/2019 Lecture PPT Section1-1
11/22
8/16/20
IQ Scores - Example
81 101 109 114 123 13182 102 110 115 124 133
89 102 110 116 124 134
90 102 110 117 124 134
94 103 112 117 125 136
96 105 112 117 126 137
97 106 113 118 127 139
100 108 113 118 127 139
101 109 114 122 128 142
101 109 114 122 130 145
IQ Scores Example - HistogramInterval Frequency
85 2
85 - 95 3
95 - 105 11
105 - 115 16
115 - 125 13
125 - 135 9
> 135 6
8/2/2019 Lecture PPT Section1-1
12/22
8/16/20
How to Create a Histogram - Excel
1. Select Histogram & click OK
2. Input data range
3. Check Labels (if applicable)
4. Check Chart Output
5. Select New Worksheet Ply (default)
6. Click OK
How to Create a Histogram - ExcelExcel creates a frequency table by
dividing the data range into intervals
of equal width automatically and charts the data.
Intervals:
[ x 81], [81< x 90.14286], [90.14286 < x 99.28571], , [ x > 135.8571]
8/2/2019 Lecture PPT Section1-1
13/22
8/16/20
How to Create a Histogram Excel
Excel creates a bar graph (with gaps between intervals). Right-click onany of the bars and select Format Data Point and move the Gap
Width slider to No Gap (0%)
How to Create a Histogram ExcelYou can instruct Excel how to divide the data into equal intervals by
entering your own bin ranges to a column and then entering it into the
Bin Range in the Histogram dialog window.
8/2/2019 Lecture PPT Section1-1
14/22
8/16/20
How to Create a Histogram TI -83/84
How to Create a Histogram TI -83/84
8/2/2019 Lecture PPT Section1-1
15/22
8/16/20
How to Create a Histogram TI -83/84
Stem plots
How to make a stemplot:
1) Separate each observation into a stem, consisting of
all but the final (rightmost) digit, and a leaf, which is
that remaining final digit. Stems may have as many
digits as needed, but each leaf contains only a single
digit.
2) Write the stems in a vertical column with the smallest
value at the top, and draw a vertical line at the right
of this column.
3) Write each leaf in the row to the right of its stem, in
increasing order out from the stem.
STEM LEAVES
8/2/2019 Lecture PPT Section1-1
16/22
8/16/20
State Percent
Ala ba m a 1.5
Ala sk a 4.1
Ar izo na 25 .3
Ar ka ns as 2.8
Ca lifornia 32.4
Co lo rado 17.1
Connectic ut 9.4
De la w are 4.8
Florida 16.8
Georgia 5.3Haw ai i 7.2
Idaho 7.9
Illino is 10.7
Ind ian a 3.5
Iow a 2.8
Ka nsas 7
Ke ntuck y 1.5
Lou is iana 2.4
M aine 0.7
M aryland 4.3
M as sac hu se tt 6.8
M ic higan 3.3
M innes ota 2.9
M is s is s ipp i 1.3
M is souri 2.1
M ontana 2
Nebrask a 5.5
Nevada 19.7
New Ha mp sh ir 1.7
New Jersey 13.3
New Mexico 42.1
New York 15.1
N orth Ca ro lin a 4 .7
NorthDako ta 1.2
Ohio 1.9
Okla hom a 5.2
Oregon 8
Pe nns ylvan ia 3.2
Rhode Is land 8.7S ou th Ca ro lin a 2 .4
So uthDa ko ta 1.4
Tennes see 2
Texas 32
Utah 9
Ve rm on t 0.9
Virginia 4.7
W ash ing ton 7.2
W e stV irgin ia 0.7
W iscons in 3.6
W yo m ing 6.4
Percent of Hispanic residents
in each of the 50 states
Step 2:
Assign the
values to
stems and
leaves
Step 1:
Sort the
data
S tate Pe rc ent
M a in e 0 .7
W e stV irg in ia 0 .7
V erm o n t 0 .9
N orthD ak o ta 1 .2
M is s is s ip p i 1 .3
S ou th D ako ta 1 .4
Al ab am a 1. 5
K en tu c ky 1 .5
N ew H am psh ir 1 .7
O h io 1 .9M o n ta n a 2
Tenne s s ee 2
M is sou ri 2 .1
L ou is ian a 2 .4
S ou th Ca ro lin a 2 .4
Ar ka ns as 2. 8
Io w a 2 .8
M inn e so ta 2 .9
P en ns ylva nia 3 .2
M ic h igan 3 .3
In d iana 3 .5
W is c ons in 3 .6
Al as ka 4. 1
M a ryla nd 4 .3
N orth Ca ro lin a 4 .7
V irg in ia 4 .7
D e law a re 4 .8
O k lah o m a 5 .2
G eo rg ia 5 .3
N ebras k a 5 .5
W yom ing 6 .4
M a ss ach us et t 6 .8
K an s as 7
H aw a ii 7 .2
W a sh ing ton 7 .2
Id aho 7 .9
O re go n 8
R hod e Is la nd 8 .7
U tah 9C onn ec tic u t 9 .4
Ill ino is 10 .7
N ew J ers e y 13 .3
N ew Yo rk 15 .1
F lo rida 16 .8
C o lo rado 1 7 .1
N evada 19 .7
Ar izo na 25 .3
Texas 32
C a lifo rn ia 3 2 .4
N ew M exic o 4 2 .1
Stem Plot
To compare two related distributions, a back-to-back stem plot with
common stems is useful.
Stem plots do not work well for large datasets.
When the observed values have too many digits, trim the numbers
before making a stem plot.
When plotting a moderate number of observations, you can split
each stem.
8/2/2019 Lecture PPT Section1-1
17/22
8/16/20
Stemplots are quick and dirty histograms that can easily be done byhand, and therefore are very convenient for back of the envelope
calculations. However, they are rarely found in scientific or laymen
publications.
Stemplots versus histograms
Interpreting histogramsWhen describing the distribution of a quantitative variable, we look for the
overall pattern and for striking deviations from that pattern. We can describe
the overallpattern of a histogram by its shape, center, and spread.
Histogram with a line connecting
each column too detailed
Histogram with a smoothed curve
highlighting the overall pattern of
the distribution
8/2/2019 Lecture PPT Section1-1
18/22
8/16/20
Most common distribution shapes
A distribution is symmetric if the right and left
sides of the histogram are approximately mirror
images of each other.
Symmetric
distribution
Complex,
multimodal
distribution
Not all distributions have a simple overall shape,
especially when there are few observations.
Skewed
distribution
A distribution is skewed to the right if the right
side of the histogram (side with larger values)
extends much farther out than the left side. It is
skewed to the left if the left side of the histogram
extends much farther out than the right side.
Alaska Florida
Outliers
An important kind of deviation is an outlier. Outliers are observations
that lie outside the overall pattern of a distribution. Always look for
outliers and try to explain them.
The overall pattern is fairly
symmetrical except for 2
states that clearly do not
belong to the main trend.
Alaska and Florida have
unusual representation of
the elderly in theirpopulation.
A large gap in the
distribution is typically a
sign of an outlier.
8/2/2019 Lecture PPT Section1-1
19/22
8/16/20
2
How to create a histogram
It is an iterative process try and try again.
What bin size should you use?
Not too many bins with either 0 or 1 counts
Not overly summarized that you loose all the information
Not so detailed that it is no longer summary
rule of thumb: start with 5 to 10 bins
Look at the distribution and refine your bins
(There isnt a unique or perfect solution)
Not
summarized
enough
Too summarized
Same data set
8/2/2019 Lecture PPT Section1-1
20/22
8/16/20
2
IMPORTANT NOTE:
Your data are the way they are.
Do not try to force them into a
particular shape.
It is a common misconception
that if you have a large enough
data set, the data will eventuallyturn out nice and symmetrical.
Histogram of Drydays in 1995
Line graphs: time plots
A trend is a rise or fall that
persists over time, despite
small irregularities.
In a time plot, time always goes on the horizontal, x axis.
We describe time series by looking for an overall pattern and for striking
deviations from that pattern. In a time series:
A pattern that repeats itself
at regular intervals of time is
called seasonal variation.
8/2/2019 Lecture PPT Section1-1
21/22
8/16/20
2
Retail price of fresh oranges
over time
This time plot shows a regular pattern of yearly variations. These are seasonal
variations in fresh orange pricing most likely due to similar seasonal variations in
the production of fresh oranges.
There is also an overall upward trend in pricing over time. It could simply be
reflecting inflation trends or a more fundamental change in this industry.
Time is on the horizontal, x axis.
The variable of interesthere
retail price of fresh oranges
goes on the vertical, y axis.
1918 influenza epidemic
Date # Cases # Deaths
week 1 36 0
week 2 531 0
week 3 4233 130
week 4 8682 552
week 5 7164 738
week 6 2229 414
week 7 600 198
week 8 164 90
week 9 57 56
week 10 722 50
week 11 1517 71week 12 1828 137
week 13 1539 178
week 14 2416 194week 15 3148 290week 16 3465 310
week 17 1440 149
01000
2000300040005000
600070008000
900010000
we
ek1
we
ek3
we
ek5
we
ek7
we
ek9
wee
k11
wee
k13
wee
k15
wee
k17
#casesdiagnosed
0
10 0
20 0
30 0
40 0
50 0
60 0
70 0
80 0
#
deathsreported
# C as es # Deaths
A time plot can be used to compare two or more
data sets covering the same time period.
The pattern over time for the number of flu diagnoses closely resembles that for the
number of deaths from the flu, indicating that about 8% to 10% of the people
diagnosed that year died shortly afterward, from complications of the flu.
8/2/2019 Lecture PPT Section1-1
22/22
8/16/20
Death rates from cancer (US, 1945-95)
0
50
100
150
200
250
1 940 1950 1960 1970 1980 1990 20 00
Years
Deathrate(per
thousand)
Death rates from cancer (US, 1945-95)
0
50
100
150
200
250
1940 1960 1980 2000
Years
Deathrate
(perthousa
nd)
Death rates from cancer (US, 1945-95)
0
50
100
150
200
250
1940 1960 1980 2000
Years
Deathrate(perthousand)
A picture is worth a
thousand words,
BUT
There is nothing like
hard numbers.
Look at the scales.
Scales matter
How you stretch the axes and choose your
scales can give a different impression.
Death rates from cancer (US, 1945-95)
120
140
160
180
200
220
1940 1960 1980 2000
Years
Deathrate(per
thousand)
Lesson SummaryKey Concepts Individuals
Variables
o Categorical Variables
o Quantitative Variables
Graphs for Categorical Variables
o Bar graphs
o Pie charts
Graphs for Quantitative Variables
o Stemplots
o Histograms
o Timeplots
Examining distributions
o Overall pattern (Shape, center,
spread)
o Deviations from the pattern (Outliers)
o Trends
o Seasonal Variation
Skills Learned
Displaying data graphically
o Bar graphs
o Pie charts
o Stemplots
o Histograms
o Timeplots
Interpreting graphs
Describing distributions