Biostatistics Introduction & Data Presentation (Short Version)
Intro - 1
1
Biostatistics
Andy ChangYoungstown State University
2
Statistics is a field of study concerned with the
Producing Data
1) data collection, [Producing data]
Exploring Data
2) organization, summarization, examinationand providing an overview of the general features of data, [Exploring Data]
Statistical Inference
3) and the drawing of inferences about a body of data (population) based on the properties of a part of the data (sample) observed. [Statistical Inference]
Statistics in Broader Sense
3
In health and medical (or clinical) study, researchers investigate a sample of subjects to understand the effectiveness of a treatment or an intervention on target population.
Therefore, statistical inference is very important in health and medical research.
4
Public health is fundamentally concerned with preventing disease, disability, and premature death in human population or community.Therefore, statistical inference is very important in public health research.
Goal 2 of the Healthy People 2010: Eliminate Health Disparities
5
Individuals (subjects, experimental unit): the entities on which data are collected.
Variable: a characteristic of interest for the individual which takes on different values in different individual.
Basic Terms In Statistics
Examine a variableExamine correlation between two or more variables
Purpose of Statistics
6
Variable Types
Quantitative Variables (numeric) [height, number of subscriptions, ...]– Continuous: a variable that has an
uncountable number of possible values. (measurements)
– Discrete: a variable that has a countable number of possible values.(counts)
Qualitative (Categorical) Variables[hair color, gender, ...]
Biostatistics Introduction & Data Presentation (Short Version)
Intro - 2
7
Measurement Scales
• Nominal: consists of labels, names or categories.• Ordinal: data that the order or rank is meaningful.• Interval: numerical data that arithmetic operations
are meaningful.• Ratio: data that the ratio of two data is
meaningful.
8
Producing Data
9
Data in Public Health :
Vital Statistics and the CensusPublic Health SurveillanceSurveyRegistriesEpidemic InvestigationsResearchProgram Evaluations
10
Descriptive StatisticsData Presentatioin– Grouping tables– graphical summary
Numerical Summary– Center– Dispersion (Spread)
Exploratory Data Analysis
11
What type of statistical technique is appropriate for Data Presentation?
Categorical variable?Quantitative variable?
Data Presentation
12
Data Sheet (Raw data)
ID Height(in) Weight(lb) BirthMonth Exp. Gender1 6 135 4 H F 2 63 119 9 H F 3 72 175 11 T M 4 60 106 9 H F 5 65 135 8 T F 6 72 170 10 H M 7 64 180 8 H F 8 71 205 10 H M ...
Biostatistics Introduction & Data Presentation (Short Version)
Intro - 3
13
(A complete list)ID Height Weight BirthMonth Exp. Gender1 6 135 4 H F 2 63 119 9 H F 3 72 175 11 T M 4 60 106 9 H F 5 65 135 8 T F 6 72 170 10 H M 7 64 180 8 H F 8 71 205 10 H M 9 75 195 6 T M 10 71 185 8 H M 11 71 182 6 T M 12 65 108 8 T F 13 73 150 4 H M 14 67 128 6 T F 15 74 175 6 H M 16 66 160 9 H F 17 65 143 9 T F 18 72 190 11 T M 19 64 180 2 H M 20 61 195 5 T M 21 72 220 7 H M 22 69 285 7 H M
14
Grouping and Displaying A Categorical Variable
15
Frequency Table and Charts(One Categorical Variable)
Class Frequency Relative Frequency Female 9 9/22 = .409 = 40.9% Male 13 13/22 = .591 = 59.1% Total 22 100%
sex
MaleF emale
Perc
ent
70
60
50
40
30
20
10
0
59.1%
40.9%
Male
F emale
16
How do you describe yourself
N. Hawaiin/Other Pac
Am. Indian or Alaska
Multiple - Non-hispa
Multiple - Hispanic
AsianHispanic or Latino
Black or African Ame
White
Cou
nt
20000
10000
0
Percent
100
50
0
35473590
6585
Pareto chart
* Bars arranged according to their frequencies.
17
Grouping and DisplayingA Quantitative Data
18
Frequency Distribution Table
Class Tally Frequency 100 - <120 120 - <140 140 - <160 160 - <180 180 - <200 200 - <220 220 - <240 240 - <260 260 - <280 280 - <300
Total
Data: 135, 119, 175, 106, 135, 170, 180, 205, 195, …
280 to less than 300
Biostatistics Introduction & Data Presentation (Short Version)
Intro - 4
19
Frequency Distribution Table
Class Tally Frequency 100 - <120 3 120 - <140 3 140 - <160 2 160 - <180 4 180 - <200 7 200 - <220 1 220 - <240 1 240 - <260 0 260 - <280 0 280 - <300 1
Total 22
Data: 135, 119, 175, 106, 135, 170, 180, 205, 195, …
280 to less than 300 20
Frequency Distribution Table
(From data sheet)
Class Frequency Relative Freq. Cumulative R.F. 100 - <120 3 3/22 = .136 3/22 120 - <140 3 3/22 = .136 6/22 140 - <160 2 2/22 = .091 8/22 160 - <180 4 4/22 = .182 12/22 180 - <200 7 7/22 = .318 19/22 200 - <220 1 1/22 = .045 20/22 220 - <240 1 1/22 = .045 21/22 240 - <260 0 0/22 = .000 21/22 260 - <280 0 0/22 = .000 21/22 280 - <300 1 1/22 = .045 22/22
Total 22 1.000
21
Classes: Categories for grouping data.Frequency (class frequency): The number of data values
in a class.Relative frequency: The ratio of the frequency of a class
to the total number of pieces of data.Frequency distribution: A listing of classes and their
frequencies.Relative Frequency distribution: A listing of classes
and their relative frequencies.Upper class limit: The largest value that can go in a class.Lower class limit: The smallest value that can go in a class.Class width: The difference between the lower class limit
of the given class and the lower class limit of the next higher class.
Class midpoint (class mark): The midpoint of a class.22
Guidelines for grouping data:(for quantitative variable)
There should be between five and twenty classes.Each piece of data must belong to one, and only one, class.(Mutually Exclusive)Whenever feasible, all classes should have the same width.
23
To build a Frequency Table:Find the range of the data: Range = Largest value – smallest valueUse the range and try different class width to determine how many classes you need to make frequency table or histogram.
Student data example:Range = 285 – 106 = 179/20 ≈ 9If using a class width of 20, there’ll be about 9 classes which is good.
24
Frequency Distribution Table
(From data sheet)
Class Frequency Relative Freq. Cumulative R.F. 100< - 120 3 3/22 = .136 3/22 120< - 140 3 3/22 = .136 6/22 140< - 160 3 3/22 = .136 9/22 160< - 180 5 5/22 = .227 14/22 180< - 200 5 5/22 = .227 19/22 200< - 220 2 2/22 = .091 21/22 220< - 240 0 0/22 = .000 21/22 240< - 260 0 0/22 = .000 21/22 260< - 280 0 0/22 = .000 21/22 280< - 300 1 1/22 = .045 22/22
Total 22 1.000
Biostatistics Introduction & Data Presentation (Short Version)
Intro - 5
25
Histogram (SPSS)
26
Polygon (SPSS)
27
Polygon (SPSS)
28
100 120 140 160 180 200 220 240 260 280 300
100%
50%
Cumulative R. F. Histogram
29
90 110 130 150 170 190 210 230 250 270 290
100%
50%
Cumulative R. F. Polygon(Ogive)
30
What to observe in Histograms?
Outliers: observations that stand out from the rest for some reason.Center: the “middle” of the data.Spread: the range; the extent of the data; how far the values are from each other.Shape: distribution pattern. [Skewness, symmetry, uniform, Normal, ...]
Biostatistics Introduction & Data Presentation (Short Version)
Intro - 6
31
Symmetric (Bell) shape
Uniform Skewed to the left, or negatively skewed
Bimodal
Skewed to the right, or positively skewed
32
Histogram & Density Curve
Density function, f (x)
A smooth curve that describes the distribution
Percent
Use a mathematical model to describe the variable.
33
Stemplots (or Stem-and-leaf plots)
-- leading digits are called stems-- final digits are called leaves
34
Example: (number of hysterectomies performed by 15 male doctors)
27, 50, 33, 25, 86, 25, 85, 31, 37, 44, 20, 36, 59, 34, 28
2 | 3 |4 |5 |6 | 7 | 8 |
7
0
35
6
2 | 755083 | 31764
→ 4 | 45 | 096 | 7 | 8 | 65
Stemplot
35
Example: (number of hysterectomies performed by 15 male doctors)
27, 50, 33, 25, 86, 25, 85, 31, 37, 44, 20, 36, 59, 34, 28
2 | 055783 | 13467
→ 4 | 45 | 096 | 7 | 8 | 56
Ordered Stemplot36
Example: [Back-to-back stem-plot]Number of hysterectomies performed by 15 male doctors:27, 50, 33, 25, 86, 25, 85, 31, 37, 44, 20, 36, 59, 34, 28by 10 female doctors, the numbers are:
5, 7, 10, 14, 18, 19, 25, 29, 31, 33
(Female) (Male) 75 | 0 |
9840 | 1 |95 | 2 | 0557831 | 3 | 13467
| 4 | 4| 5 | 09| 6 | | 7 | | 8 | 56
Biostatistics Introduction & Data Presentation (Short Version)
Intro - 7
37
Box Plot
21N =HEIGHT
80
70
60
50
38
Examine Bivariate Data(Bivariate Analysis)Examine the relation between two variables
39
Odds of smoker to have cancer: 20/30 = 6/9Odds of nonsmoker to have cancer: 5/45 = 1/9Odds Ratio = (6/9)/(1/9) = 6
Contingency Table
Cancer No cancer Row Total
Smoker 20
(20%) 30
(30%) 50
(50%)
Non- Smoker 5
(5%) 45
(45%) 50
(50%)
Column Total 25 (25%)
75 (75%)
100
Two Categorical Variables
40
Cluster bar chartTwo Categorical Variables
41
Temperature Mortality Index34 5240 6842 6342 8343 7244 8145 8946 7747 8848 9449 8650 9551 10551 10052 102
Data:
Two Quantitative Variables
42
Two Quantitative Variables
Average Temperature
60504030
Mor
talit
y In
dex
110
100
90
80
70
60
50
Average annual temperature and the mortality index for a type of breast cancer in women in certain region of Europe.
Biostatistics Introduction & Data Presentation (Short Version)
Intro - 8
43
Response/Explanatory Variables
Response (Dependent, Outcome) Variable– Lung Cancer, Mortality Index
Explanatory (Independent, Predictor) Variable– Smoking, Average Temperature
44
A Categorical & A Quantitative Variables
138N =
sex
MaleFemale
HE
IGH
T
80
70
60
50
21
18
19
148
3
Side-by-side Boxplot
45
Time Plot
Time Rate
1995 15.2
1996 15.1
1997 14.9
1998 16.2
1999 14.3
2000 13.2
2001 13.5 1995 1996 1997 1998 1999 2000 2001
15
10
5
Time
Rate
46
Youngstown Homicide Rate by Year
YEAR
19951993199119891987198519831981197919771975197319711969
Value
HRA
TE
100
90
80
70
60
50
40
30
20
100
Time Plot
47
National Homicide Rate By Year
YEAR
19921988198419801976197219681964
Nat
iona
l Hom
icid
e R
ate
100
90
80
70
60
50
40
30
20
10
0
Time Plot
National Homicide Rate By Year
YEAR
19921988198419801976197219681964
Nat
iona
l Hom
icid
e R
ate
12
11
10
9
8
7
6
5
4
3
2
10
Different Scales48
Gender
MaleFemale
Cou
nt
270
260
250
240
230
220
210
Misleading Chart
Gender
MaleFemale
Cou
nt
300
250
200
150
100
50
0
Different Scales
Biostatistics Introduction & Data Presentation (Short Version)
Intro - 9
49
Incorrect and Misleading Chart
50
Type of Statistical Studies
Observational Study: conditions to which subjects are exposed are not controlled by the investigator. (no attempt is made to control or influence the variables of interest)
Experimental (Controlled) Study: conditions to which subjects are exposed to are controlled by the investigator. (treatments are used in order to observe the response) (Randomization, Replications)
51
Drug A: 44/100 = 44%Drug B: 29/100 = 29%
20012773Total
1007129Drug B
1005644Drug A
TotalNoYesTreatment
Hypertension
Results from observing behavior and outcomes from the use of medicine for
200 randomly selected patients.(Patients chose their medicine)
52
OR <65: Drug A: 5/23 = 22%Drug B: 17/77 = 22%
OR 65+: Drug A: 39/77 = 51%Drug B: 12/23 = 52%
231112776017Drug B
77383923185Drug A
TotalNoYesTotalNoYesTreatment
65+Below 65
Hypertension
* Older patients prefer Drug A
Simpson’s Paradox
53
Confounding variables
Treatment 1 &
Treatment 2
Patient’s SurvivalCause?
Patient’s Age & Health Condition
54
Confounding Effect
Variables, whether part of a study or not, are said to be confounded when their effects on the outcome cannot be distinguished from each otherAge may affect the reaction to drug and may also affect drug choosing decision.
Descriptive Statistics (Short Version)
Descriptive Stat - 1
1
Example: Birth weights (in lb) of 5 babies born from two groups of women under different care programs.
Group 1: 7, 6, 8, 7, 7Group 2: 3, 4, 8, 9, 11
6 7 8
2
Describe Distribution with Numbers
Numerical Summary Measures
• Measure of Center• Measure of Variation• Measure of Position
3
Measure of Central Tendency
Mean: the average value of the data.
If n observations are denoted by x1, x2, ..., xn, their (sample) mean is
n
x
nxxxx
n
ii
n∑==
+++= 121 ...
4
Example: Birth weights (in lb) of 5 babies born from a group of women under certain diet.
7, 6, 8, 7, 7
Sol:
[near the center of the data set]
75
355
77867==
++++=mean
5
Median: of a data set is
the data value exactly in the middle of its ordered list if the number of pieces of data is odd,the mean of the two middle data values in its ordered list if the number of pieces of data is even.
[median is not influenced by outliers and is best for non-symmetric distribution]
6
Example: (number of hysterectomies performed by 15 doctors)27, 50, 33, 25, 86, 25, 85, 31, 37, 44, 20,
36, 59, 34, 28
ordered list => 20, 25, 25, 27, 28, 31, 33, 34, 36, 37, 44, 50, 59, 85, 86
median = 34
Descriptive Statistics (Short Version)
Descriptive Stat - 2
7
Example: (Birth weights for 6 infants.)
5, 7, 6, 8, 5, 9
ordered list => 5, 5, 6, 7, 8, 9
median = (6+7) / 2 = 6.5
8
Mode: of a data set is the observation that occurs most frequently.
9
Example 1: (number of times visited class website by 15 students)27, 50, 33, 25, 86, 25, 85, 31, 37, 44, 20, 36, 59, 34, 28
ordered list => 20, 25, 25, 27, 28, 31, 33, 34, 36, 37, 44, 50, 59, 85, 86 Mode = 25
Example 2: (Blood type of 15 students)A, B, A, A, O, AB, A, A, B, B, O, O, A, A, A
Mode = AA – 8B – 3O – 3AB – 1 10
Mean ?
Median ?
Mode ?
Skewed to the Right
11
Range = largest data value − smallest data value
Sample from group I (diet program I): 7, 6, 8, 7, 7 => mean = (7 + 6 + 8 +7 + 7) / 5 = 35/5 = 7
Sample from group II (diet program II): 3, 4, 8, 9, 11=> mean = (3 + 4 + 8 + 9 + 11) / 5 = 35/5 = 7
Does the mother’s diet program affect the birth weights of babies?
Measure of Dispersion (Variability)
12
Is there any difference between the two samples?
range of sample I = 8 - 6 = 2
range of sample II = 11 - 3 = 8
Descriptive Statistics (Short Version)
Descriptive Stat - 3
13
Variance and Standard Deviation
Measure the spread of the data around the center of the data.
14
Example: Birth weights (in lb) of 5 babies born from a group of women under diet program II.
3, 4, 8, 9, 11 ⇒ mean = = 7
Sample Variance = 46/4 = 11.5 lb, Sample Standard Deviation = = 3.39 lb.4/46
ix
Total119843
Data Valuexxi −
011 – 7 = 49 – 7 = 28 – 7 = 1
4 – 7 = – 33 – 7 = – 4
Deviation from mean2)( xxi −
4616419
16
Squared Dev.
x
15
5.114
535291
12
1
2
12
2
=−
=
−
∑⎟⎠⎞
⎜⎝⎛ ∑
−= =
=
nn
xx
s
n
i
n
ii
i
A Short Cut formula:
291351211181964816493x2Data, x
16
Data: 7, 6, 8, 7, 7
s2 = (0+1+1+0+0)/(5-1) = ½
s =
Does the mother’s diet program affect the birth weights of babies?
71.021 =
Diet I: mean = 7, s = 0.71Diet II: mean = 7, s = 3.39
What is the standard deviation of the weights of babies from the sample of mothers who received diet program I?
17
If n observations are denoted by x1, x2, ..., xn, their variance and standard deviation are
Sample Variance:
(unbiased estimator for variance of an infinite population.)
Sample Standard Deviation:
Sample Mean:
1
)(1
2
2
−
−=∑=
n
xxs
n
ii
1
)(1
2
−
−=
∑=
n
xxs
n
ii
∑=
=+++=n
ii
n xnn
xxxx1
21 1...
18
If N observations are denoted by x1, x2, ..., xn, are all the observation in a finite population, their mean, μ , variance σ 2, and standard deviation, σ , are
Population Mean:
Population Variance:
Population Standard Deviation:
∑=
=+++
=n
ii
n xNN
xxx1
21 1...μ
N
xn
ii∑
=
−= 1
2
2)( μ
σ
N
xn
ii∑
=
−= 1
2)( μσ
Population Parameters
Descriptive Statistics (Short Version)
Descriptive Stat - 4
19
About s (sample standard deviation) :
s measures the spread around the mean.the larger s is, the more spread out the data are. if s = 0, then all the observations must be equal.s is strongly influenced by outliers.
20
The Use of Mean and Standard Deviation
Describe distributionUnderstand the center and the spread of the distribution
21
Bone Density Data
20118.4Male
25102.2Female
Standard Deviation, sMean,
Unit: mg/mlx
22
Many distributions can be described by a mathematical function with specific parameters, such as mean and standard deviation.
Example: Normal Distribution (Bell-shaped)
μ
σ
23
Empirical RuleProperties of a symmetric and bell-shaped (Normal) distribution:
The distribution is symmetric about it mean (μ),68% of the area between μ − σ and μ + σ , 95% of the area between μ − 2σ and μ + 2σ ,99.7% of the area between μ − 3σ and μ + 3σ .
μ − 3σ μ μ + 3σ 2466 70 74
Heart rates for a certain population at a certain condition follow a bell shape symmetric distribution with mean 70 and standard deviation 2.
What percentage of people in this population will have heart rates between 66 and 74?
?%95%
Descriptive Statistics (Short Version)
Descriptive Stat - 5
25
Chebychev’s inequality
There is at least 1 – (1/k2) of the data in a data set lie within kstandard deviation of their mean.
Chebychev’s Rule
26
Example: Heart rates for asthmatic patients in a state of respiratory arrest has a mean of 140 beats per minute and a standard deviation of 35.5 beats per minute. What percentage of the population of this type of patients have heart rates lie between two standard deviations of the mean in a state of respiratory arrest?
It will be at least 75%, because k = 2, and1 – (1/22) = ¾ = 75%.
27
69 144 211
Heart rates example: mean=144, s.d.=35.5
At least 75%
75% = 1 − (1/22)
140 - 2x35.5 = 69 140 + 2x35.5 = 211
k = 2
28
33.5 144 246.5
At least ?%
What about within three standard deviations? Heart rates example: mean=144, s.d.=35.5
?% ≈ 1 − (1/32)
144 - 3x35.5 = 33.5 144 + 3x35.5 = 246.5
k = 3
At least 89%
89% ≈ 1 − (1/32)
29
Measure of Position
Standard Score, Percentile, Quartile
30
If x is an observation from a distribution that has mean μ , and standard deviation σ , the standardized value of x is,
z-score of x :
“μ + 3σ” has a z-score 3, since it is 3 s.d. from mean.
deviationstandard mean xxz −
=−
=σμ
Z-score (Standard Score)
Population z-score
Descriptive Statistics (Short Version)
Descriptive Stat - 6
31
If a distribution has a mean 10 and a s.d. 2, the value 7 has a z-score –1.5.
z-score = (7 – 10)/2 = – 1.5.
6 8 10 12 14
1.5 s.d.
32
Sample z-score
Example: If the mean of a random sample is 5 and the standard deviation is 2, what would be the sample z-score of the value 6?
sxxz −
=
5.021
256
==−
=z
6 ,2 ,5 === xsx
33
Example: Bone Mineral Density
The WHO Working Group defines osteoporosis according to measurements of bone mineral density (BMD) using dual-energy X-ray absorptiometry (DEXA).Thus osteoporosis is defined as a bone density T score at or below 2.5 standard deviations (T score) below normal peak values for young adults.
34
These criteria were initially established for the assessment of osteoporosis in Caucasian women.
BMD reports may include a “Z score” which is the number of standard deviations by which the subject of interest differs from the mean for their age.
Severe osteoporosis T score < - 2.5 SD with 1 or more fragility fractures
OsteoporosisT score < - 2.5 SD
OsteopaeniaT score between –1.0 and –2.5 SD
Normal bone mineral density T score > -1.0 S.D
DefinitionDEXA BMD Values
35
Quartiles: (Measure of Position)
• The first quartile, Q1, or 25th percentile, is the median of the lower half of the list of ordered observations.
• The third quartile, Q3, or 75th percentile, is the median of the upper half of the list of ordered observations.
36
Example: [odd number of data values] (n = 21)
60,61,63,64,64,65,65,65,66,67,69,71,71,71,72,72,72,72,73,74,75
Q1 = ? Median = 69 Q3 = ?
Measure of spread:Interquartile range (IQR) = Q3 − Q1
IQR = 72 – 64.5 = 7.5
64.5 72
Descriptive Statistics (Short Version)
Descriptive Stat - 7
37
Example: [even number of data] (n = 22)
6, 60,61,63,64,64,65,65,65,66,67,69,71,71,71,72,72,72,72,73,74,75
Q1 = ? Median = ? Q3 = ?
Measure of spread:Interquartile range (IQR) = Q3 − Q1
IQR = 72 - 64 = 8
6864 72
38
The five-number summary
.Minimum value
.Q1
.Median
.Q3
.Maximum value
39
Example: (data sheet without outlier “6”)60,61,63,64,64,65,65,65,66,67,69,71,71,71,72,72,72,72,73,74,75
Min = 60, Q1 = 64.5, Median = 69, Q3 = 72, Max = 75.
21N =
HEIGHT
80
70
60
50
4022N =
HEIGHT
80
60
40
20
0
1
With 6 in the data:6, 60,61,63,64,64,65,65,65,66,67,69,71,71,71,72,72,72,72,73,74,75
Q1 = 64 Median = 68 Q3 = 72
IQR = 72 - 64 = 8
41
Inner and outer fences for outliers• The inner fences are located at
a distance of 1.5 IQR below Q1(lower inner fence = Q1 - 1.5 x IQR )
and at a distance of 1.5 IQR above Q3 (upper inner fence = Q3 + 1.5 x IQR ).
• The outer fences are located at a distance of 3 IQR below Q1
(lower outer fence = Q1 – 3 x IQR )and at a distance of 3 IQR above Q3
(upper outer fence = Q3 + 3 x IQR ) .42
• The inner fences are located at a distance of 1.5 IQR below Q1
(lower inner fence = 64 - 1.5 x 8 = 52 ) and at a distance of 1.5 IQR above Q3
(upper inner fence = 72 + 1.5 x 8 = 84). • The outer fences are located at
a distance of 3 IQR below Q1(lower outer fence = 64 – 3 x 8 = 40)
and at a distance of 3 IQR above Q3(upper outer fence = 72 + 3 x 8 = 96) .
IQR = 72 – 64 = 8; Q1 = 64; Q3 = 72
Descriptive Statistics (Short Version)
Descriptive Stat - 8
43
22N =
HEIGHT
80
60
40
20
0
1
IQR
LIF: 64 - 1.5 x 8 = 5252
UIF: 72 + 1.5 x 8 = 8484
Q1 = 64; Q3 = 72; IQR = 72 – 64 = 8
Inner fence
Inner fence
44
22N =
HEIGHT
80
60
40
20
0
1
IQR
Outer fence
Inner fence
Inner fence
Outer fence
LOF: 64 - 3 x 8 = 4040
UOF:72 + 3 x 8 = 9696
45
Mild and Extreme outliers
Data values falling between the inner and outer fences are considered mild outliers.Data values falling outside the outer fences are considered extreme outliers.
When outliers exist, the whisker extended to the smallest and largest data values within the inner fence.
46
Side-by-side Box Plot
138N =
sex
MaleFemale
HE
IGH
T
80
70
60
50
21
18
19
148
3
47
Remarks:
If the distribution of the data is symmetric, then the mean and median will be about the same.The five-number summary is best for non-symmetric data. The median, quartiles, inter-quartile range are not influenced by outliers.The mean and standard deviation are most appropriate to use only if the data are symmetric because both of these measures are easily influenced by outliers.
48
Boxplot
For the following data:13 72 78 40 50 56 50 52 57 69 130 142 51 52
Find the five-number-summary & IRQMake a boxplotFind the 60th percentile.
Probability (Short Version)
Probability - 1
1
A researcher claims that 10% of a large population have disease H.
A random sample of 100 people is taken from this population and examined.
If 20 people in this random sample have the disease, what does it mean? How likelywould this happen if the researcher is right?
Probability and Counting Rules
2
Sample Space and Probability
Random Experiment: (Probability Experiment) an experiment whose outcomes depend on chance.Sample Space (S): collection of all possible outcomes in random experiment.Event (E): a collection of outcomes of interest in a random experiment.
3
Sample Space and EventSample Space:
S = {Head, Tail}S = {Life span of a human} = {x | x≥0,
x∈R}
Event:E = {Head}E = {Life span of a human is less than 3
years}4
A Simple Example
What’s the probability of getting a head on the toss of a single fair coin? Use a scale from 0 (no way) to 1(sure thing).So toss a coin twice. Do it! Did you get one head & one tail? What’s it all mean?
5
Definition of ProbabilityA rough definition: (frequentist definition)
Probability of a certain outcome to occur in a random experiment is the proportion of timesthat the this outcome would occur in a very long series of repetitions of the random experiment.
Number of TossesNumber of Tosses
Total Heads / Number of TossesTotal Heads / Number of Tosses
0.000.000.250.250.500.500.750.751.001.00
00 2525 5050 7575 100100 1251256
Determining Probability
How to determine probability?
Empirical ProbabilityTheoretical Probability(Subjective approach)
Probability (Short Version)
Probability - 2
7
Empirical Probability Assignment
Empirical study: (Don’t know if it is a balanced Coin?)
1000Total
488Tail
512Head
FrequencyOutcome
8
Empirical Probability Assignment
nm
=
=repeated is experiment times of Number
occurs E event times of Number P(E)
Empirical probability assignment:
Probability of Head:
P(Head) = 5121000 = .512 = 51.2%
9
Empirical Probability Distribution
Empirical study:
1000
488
512
Frequency
1.0
.488
.512
Probability
Total
Tail
Head
Outcome
Empirical Probability Distribution10
Theoretical Probability Assignment
Make a reasonable assumption:
What is the probability distribution in tossing a coin?
Assumption:
We have a balanced coin!
11
Theoretical Probability Assignment
)()(
SnEn
=
=space sample the of Size
E event in outcomeslikely equally of Number P(E)
Theoretical probability assignment:
Probability of Head:
P(Head) = 12
= .5 = 50%
12
Theoretical Probability Distribution (Model)
Empirical study:
1.0
.50
.50
Probability
Total
Tail
Head
Outcome
Empirical Probability Distribution
Probability (Short Version)
Probability - 3
13
Relative Frequency and Probability Distributions
Class Frequency RelativeFrequency
0 54 .181 117 .392 72 .243 42 .144 12 .045 3 .01
Total 300 1.00
Number of times visited a doctor from a random sample of 300 individuals from a community
P(0) = .18P(1) = .39P(2) = .24P(3) = .14P(4) = .04P(5) = .01
14
Relative Frequency Distribution
00.10.20.30.40.5
0 1 2 3 4 5
Discrete Distribution
15
Relative Frequency and Probability
When selecting one individual at random from a population, the probability distribution and the relative frequency distribution are the same.
16
Probability for the Discrete CaseIf an individual is randomly selected from this group 300, what is the probability that this person visited doctor 3 times?
Class Frequency Relative Frequency
0 54 .18 1 117 .39 2 72 .24 3 42 .14 4 12 .04 5 3 .01
Total 300 1.00
P(3 times) = (42)/300 = .14 or 14%
17
Discrete DistributionIf an individual is randomly selected from this group 300, what is the probability that this person visited doctor 4 or 5 times?
Class Frequency Relative Frequency
0 54 .18 1 117 .39 2 72 .24 3 42 .14 4 12 .04 5 3 .01
Total 300 1.00
P(4 or 5 times) = P(4) + P(5)= .04 + .01= .05
It would be an empirical probability distribution, if the sample of 300 individuals is utilized for understanding a large population.
18
Properties of Probability
• Probability is always a value between 0 and 1.
• Total probability (all outcomes together) equals 1.
• Probability of either one of the disjoint events A or B to occur is the sum of their individual probabilities. P(A or B) = P(A) + P(B)
Probability (Short Version)
Probability - 4
19
Complementation Rule
For any event E,
P(E does not occur) = 1 – P(E)
Complement of E = E E
* Some places use Ec or E’EP(E)
P(E)
20
Complementation Rule
If an unbalanced coin has a probability of 0.7 to turn up Head each time tossing this coin. What is the probability of not getting a Head for a random toss?
P(not getting Head) = 1 – 0.7= 0.3
21
Complementation Rule
If the chance of a randomly selected individual living in community A to have disease H is .001, what is the probability that this person does not have disease H?
P(having disease H) = .001P(not having disease H)
= 1 – P(having disease H) = 1 – 0.001 = 0.999
22
Birthday Problem
In a group of randomly select 23 people, what is the probability that at least two people have the same birth date? (Assume there are 365 days in a year.)P(at least two people have the same birth date)
Too hard !!!= 1 – P(everybody has different birth date)= 1 – [365x364x…x(365-23+1)] / 36523
23
Intersection of events:A ∩ B <=> A and B
Example: A ∩ B = {3}
S = {1, 2, 3, 4, 5, 6}
A = {1, 2, 3}
B = {3, 6}
1 2 3
4 5 6
A
B
S
Venn Diagram
(with elements listed)
Union of events:A ∪ B <=> A or B
Example: A ∪ B = {1, 2, 3, 6}24
Venn Diagram (with counts)
A ∩ Bn(A ∩ B) = 20
?
A=Smokers, n(A) = 50
A B
B=Lung Cancer, n(B) = 25
Joint Event
Given total of 100 subjects
3020
5
45
n(A ∪ B) = ?55
Probability (Short Version)
Probability - 5
25
Venn Diagram (with relative frequencies)
A ∩ BP(A ∩ B) = .20
?
A=Smokers, P(A) = .50
A B
B=Lung Cancer, P(B) = .25
Joint Event
Given a sample space
.3.20
.05
.45
P(A ∪ B) = .55
26
Contingency Table
1007525
50455Not Smoke, Ac
503020Smoke, A
TotalNo Cancer, BcCancer, B
A BVenn Diagram
27
Conditional ProbabilityThe conditional probability of event A to occur given event B has occurred (or given the condition B) is denoted as P(A|B) and is, if P(B) is not zero, n(E) = # of equally likely outcomes in E,
or)(
)()|(BP
BAPBAP ∩=
)()()|(
BnBAnBAP ∩
=
A B
28
Conditional Probability
1007525
50455Not SmokeS’
503020SmokeS
TotalNo CancerC’
CancerC
P(C|S' ) = 5/50 = .1P(C|S) = 20/50 = .4
)()()|(
BnBAnBAP ∩
=
29
Conditional Probability
100(1.0)
75P(C)=(.75)
25P(C) =(.25)
50P(S’) =(.5)
45(.45)
5(.05)
Not SmokeS’
50P(S) =(.5)
30(.3)
20(.2)
SmokeS
TotalNo CancerC’
CancerC
P(C|S' ) = .05/.5 = .1P(C|S) = .2/.5 = .4
What is =?P(C|S)P(C|S’) 4
(Relative Risk )
)()()|(
BPBAPBAP ∩
=
30
Independent Events
Events A and B are independent ifP(A|B) = P(A)
or P(B|A) = P(B)
or P(A and B) = P(A) · P(B)
Probability (Short Version)
Probability - 6
31
Example
If a balanced die is rolled twice, what is the probability of having two 6’s?
61 = the event of getting a 6 on the 1st trial
62 = the event of getting a 6 on the 2nd trial
P(61) = 1/6,
P(62) = 1/6, 61 and 62 are independent events
P(61 and 62) = P(61) P(62) = (1/6)(1/6) = 1/36 32
Independent Events“10%” of the people in a large population has disease H. If a random sample of two subjects was selected from this population, what is the probability that both subjects have disease H?
Hi : Event that the i-th randomly selected subject has disease H.
P(H2|H1) = P(H2) [Events are almost independent]
P(H1 ∩ H2 ) = ? P(H1) P(H2) = .1 x .1 = .01
33
Independent Events
If events A1, A2, …, Ak are independent, then
P(A1 and A2 and … and Ak)= P(A1) · P(A2) · … · P(Ak)
What is the probability of getting all heads in tossing a balanced coin four times experiment?P(H1) · P(H2) · P(H3) · P(H4) = (.5)4 = .0625
34
Binomial ProbabilityWhat is the probability of getting two 6’s in casting a balanced die 5 times experiment? P(S∩S∩S’∩S’∩S’) = (1/6)2 x (5/6)3
P(S∩S’∩S∩S’∩S’) = (1/6)2 x (5/6)3
P(S∩S’∩S’∩S∩S’) = (1/6)2 x (5/6)3
···How many of them? 10
!3!2!5
25
=⋅
=⎟⎠
⎞⎜⎝
⎛
= 0.016
Probability (two 6’s) = 0.016 x 10 = 0.16
35
Multiplication Rule 2(General Multiplication Rule)
For any two events A and B,
P(A and B) = P(A|B) P(B) = P(B|A) P(A)
P(A and B) P(A)
P(B|A) =
P(A and B) P(B)
P(A|B) =
36
P(S) = 50% of the subjects smokedP(C|S) = 40% of the smokers have cancer
Multiplication Rule 2
P(C and S) = P(C|S) P(S) = .4 x .5 = .2
If in the population, 50% of the people smoked, and 40% of the smokers have lung cancer, what percentage of the population that are smoker and have lung cancer?