Download - Biostatistics Introduction & Data Presentation (Short …gchang.people.ysu.edu/class/mph/note/07_1_Week2.pdf · Biostatistics Introduction & Data Presentation (Short Version) Intro

Biostatistics Introduction & Data Presentation (Short Version)

Intro - 1

1

Biostatistics

Andy ChangYoungstown State University

2

Statistics is a field of study concerned with the

Producing Data

1) data collection, [Producing data]

Exploring Data

2) organization, summarization, examinationand providing an overview of the general features of data, [Exploring Data]

Statistical Inference

3) and the drawing of inferences about a body of data (population) based on the properties of a part of the data (sample) observed. [Statistical Inference]

Statistics in Broader Sense

3

In health and medical (or clinical) study, researchers investigate a sample of subjects to understand the effectiveness of a treatment or an intervention on target population.

Therefore, statistical inference is very important in health and medical research.

4

Public health is fundamentally concerned with preventing disease, disability, and premature death in human population or community.Therefore, statistical inference is very important in public health research.

Goal 2 of the Healthy People 2010: Eliminate Health Disparities

5

Individuals (subjects, experimental unit): the entities on which data are collected.

Variable: a characteristic of interest for the individual which takes on different values in different individual.

Basic Terms In Statistics

Examine a variableExamine correlation between two or more variables

Purpose of Statistics

6

Variable Types

Quantitative Variables (numeric) [height, number of subscriptions, ...]– Continuous: a variable that has an

uncountable number of possible values. (measurements)

– Discrete: a variable that has a countable number of possible values.(counts)

Qualitative (Categorical) Variables[hair color, gender, ...]


Intro - 2

7

Measurement Scales

• Nominal: consists of labels, names or categories.• Ordinal: data that the order or rank is meaningful.• Interval: numerical data that arithmetic operations

are meaningful.• Ratio: data that the ratio of two data is

meaningful.

8

Producing Data

9

Data in Public Health :

Vital Statistics and the CensusPublic Health SurveillanceSurveyRegistriesEpidemic InvestigationsResearchProgram Evaluations

10

Descriptive StatisticsData Presentatioin– Grouping tables– graphical summary

Numerical Summary– Center– Dispersion (Spread)

Exploratory Data Analysis

11

What type of statistical technique is appropriate for Data Presentation?

Categorical variable?Quantitative variable?

Data Presentation

12

Data Sheet (Raw data)

ID Height(in) Weight(lb) BirthMonth Exp. Gender1 6 135 4 H F 2 63 119 9 H F 3 72 175 11 T M 4 60 106 9 H F 5 65 135 8 T F 6 72 170 10 H M 7 64 180 8 H F 8 71 205 10 H M ...


Intro - 3

13

(A complete list)ID Height Weight BirthMonth Exp. Gender1 6 135 4 H F 2 63 119 9 H F 3 72 175 11 T M 4 60 106 9 H F 5 65 135 8 T F 6 72 170 10 H M 7 64 180 8 H F 8 71 205 10 H M 9 75 195 6 T M 10 71 185 8 H M 11 71 182 6 T M 12 65 108 8 T F 13 73 150 4 H M 14 67 128 6 T F 15 74 175 6 H M 16 66 160 9 H F 17 65 143 9 T F 18 72 190 11 T M 19 64 180 2 H M 20 61 195 5 T M 21 72 220 7 H M 22 69 285 7 H M

14

Grouping and Displaying A Categorical Variable

15

Frequency Table and Charts(One Categorical Variable)

Class Frequency Relative Frequency Female 9 9/22 = .409 = 40.9% Male 13 13/22 = .591 = 59.1% Total 22 100%

sex

MaleF emale

Perc

ent

70

60

50

40

30

20

10

0

59.1%

40.9%

Male

F emale

16

How do you describe yourself

N. Hawaiin/Other Pac

Am. Indian or Alaska

Multiple - Non-hispa

Multiple - Hispanic

AsianHispanic or Latino

Black or African Ame

White

Cou

nt

20000

10000

0

Percent

100

50

0

35473590

6585

Pareto chart

* Bars arranged according to their frequencies.

17

Grouping and DisplayingA Quantitative Data

18

Frequency Distribution Table

Class Tally Frequency 100 - <120 120 - <140 140 - <160 160 - <180 180 - <200 200 - <220 220 - <240 240 - <260 260 - <280 280 - <300

Total

Data: 135, 119, 175, 106, 135, 170, 180, 205, 195, …

280 to less than 300


Intro - 4

19


Class Tally Frequency 100 - <120 3 120 - <140 3 140 - <160 2 160 - <180 4 180 - <200 7 200 - <220 1 220 - <240 1 240 - <260 0 260 - <280 0 280 - <300 1

Total 22

Data: 135, 119, 175, 106, 135, 170, 180, 205, 195, …

280 to less than 300 20


(From data sheet)

Class Frequency Relative Freq. Cumulative R.F. 100 - <120 3 3/22 = .136 3/22 120 - <140 3 3/22 = .136 6/22 140 - <160 2 2/22 = .091 8/22 160 - <180 4 4/22 = .182 12/22 180 - <200 7 7/22 = .318 19/22 200 - <220 1 1/22 = .045 20/22 220 - <240 1 1/22 = .045 21/22 240 - <260 0 0/22 = .000 21/22 260 - <280 0 0/22 = .000 21/22 280 - <300 1 1/22 = .045 22/22

Total 22 1.000

21

Classes: Categories for grouping data.Frequency (class frequency): The number of data values

in a class.Relative frequency: The ratio of the frequency of a class

to the total number of pieces of data.Frequency distribution: A listing of classes and their

frequencies.Relative Frequency distribution: A listing of classes

and their relative frequencies.Upper class limit: The largest value that can go in a class.Lower class limit: The smallest value that can go in a class.Class width: The difference between the lower class limit

of the given class and the lower class limit of the next higher class.

Class midpoint (class mark): The midpoint of a class.22

Guidelines for grouping data:(for quantitative variable)

There should be between five and twenty classes.Each piece of data must belong to one, and only one, class.(Mutually Exclusive)Whenever feasible, all classes should have the same width.

23

To build a Frequency Table:Find the range of the data: Range = Largest value – smallest valueUse the range and try different class width to determine how many classes you need to make frequency table or histogram.

Student data example:Range = 285 – 106 = 179/20 ≈ 9If using a class width of 20, there’ll be about 9 classes which is good.

24


(From data sheet)

Class Frequency Relative Freq. Cumulative R.F. 100< - 120 3 3/22 = .136 3/22 120< - 140 3 3/22 = .136 6/22 140< - 160 3 3/22 = .136 9/22 160< - 180 5 5/22 = .227 14/22 180< - 200 5 5/22 = .227 19/22 200< - 220 2 2/22 = .091 21/22 220< - 240 0 0/22 = .000 21/22 240< - 260 0 0/22 = .000 21/22 260< - 280 0 0/22 = .000 21/22 280< - 300 1 1/22 = .045 22/22

Total 22 1.000


Intro - 5

25

Histogram (SPSS)

26

Polygon (SPSS)

27

Polygon (SPSS)

28

100 120 140 160 180 200 220 240 260 280 300

100%

50%

Cumulative R. F. Histogram

29

90 110 130 150 170 190 210 230 250 270 290

100%

50%

Cumulative R. F. Polygon(Ogive)

30

What to observe in Histograms?

Outliers: observations that stand out from the rest for some reason.Center: the “middle” of the data.Spread: the range; the extent of the data; how far the values are from each other.Shape: distribution pattern. [Skewness, symmetry, uniform, Normal, ...]


Intro - 6

31

Symmetric (Bell) shape

Uniform Skewed to the left, or negatively skewed

Bimodal

Skewed to the right, or positively skewed

32

Histogram & Density Curve

Density function, f (x)

A smooth curve that describes the distribution

Percent

Use a mathematical model to describe the variable.

33

Stemplots (or Stem-and-leaf plots)

-- leading digits are called stems-- final digits are called leaves

34

Example: (number of hysterectomies performed by 15 male doctors)

27, 50, 33, 25, 86, 25, 85, 31, 37, 44, 20, 36, 59, 34, 28

2 | 3 |4 |5 |6 | 7 | 8 |

7

0

35

6

2 | 755083 | 31764

→ 4 | 45 | 096 | 7 | 8 | 65

Stemplot

35

Example: (number of hysterectomies performed by 15 male doctors)

27, 50, 33, 25, 86, 25, 85, 31, 37, 44, 20, 36, 59, 34, 28

2 | 055783 | 13467

→ 4 | 45 | 096 | 7 | 8 | 56

Ordered Stemplot36

Example: [Back-to-back stem-plot]Number of hysterectomies performed by 15 male doctors:27, 50, 33, 25, 86, 25, 85, 31, 37, 44, 20, 36, 59, 34, 28by 10 female doctors, the numbers are:

5, 7, 10, 14, 18, 19, 25, 29, 31, 33

(Female) (Male) 75 | 0 |

9840 | 1 |95 | 2 | 0557831 | 3 | 13467

| 4 | 4| 5 | 09| 6 | | 7 | | 8 | 56


Intro - 7

37

Box Plot

21N =HEIGHT

80

70

60

50

38

Examine Bivariate Data(Bivariate Analysis)Examine the relation between two variables

39

Odds of smoker to have cancer: 20/30 = 6/9Odds of nonsmoker to have cancer: 5/45 = 1/9Odds Ratio = (6/9)/(1/9) = 6

Contingency Table

Cancer No cancer Row Total

Smoker 20

(20%) 30

(30%) 50

(50%)

Non- Smoker 5

(5%) 45

(45%) 50

(50%)

Column Total 25 (25%)

75 (75%)

100

Two Categorical Variables

40

Cluster bar chartTwo Categorical Variables

41

Temperature Mortality Index34 5240 6842 6342 8343 7244 8145 8946 7747 8848 9449 8650 9551 10551 10052 102

Data:

Two Quantitative Variables

42

Two Quantitative Variables

Average Temperature

60504030

Mor

talit

y In

dex

110

100

90

80

70

60

50

Average annual temperature and the mortality index for a type of breast cancer in women in certain region of Europe.


Intro - 8

43

Response/Explanatory Variables

Response (Dependent, Outcome) Variable– Lung Cancer, Mortality Index

Explanatory (Independent, Predictor) Variable– Smoking, Average Temperature

44

A Categorical & A Quantitative Variables

138N =

sex

MaleFemale

HE

IGH

T

80

70

60

50

21

18

19

148

3

Side-by-side Boxplot

45

Time Plot

Time Rate

1995 15.2

1996 15.1

1997 14.9

1998 16.2

1999 14.3

2000 13.2

2001 13.5 1995 1996 1997 1998 1999 2000 2001

15

10

5

Time

Rate

46

Youngstown Homicide Rate by Year

YEAR

19951993199119891987198519831981197919771975197319711969

Value

HRA

TE

100

90

80

70

60

50

40

30

20

100

Time Plot

47

National Homicide Rate By Year

YEAR

19921988198419801976197219681964

Nat

iona

l Hom

icid

e R

ate

100

90

80

70

60

50

40

30

20

10

0

Time Plot

National Homicide Rate By Year

YEAR

19921988198419801976197219681964

Nat

iona

l Hom

icid

e R

ate

12

11

10

9

8

7

6

5

4

3

2

10

Different Scales48

Gender

MaleFemale

Cou

nt

270

260

250

240

230

220

210

Misleading Chart

Gender

MaleFemale

Cou

nt

300

250

200

150

100

50

0

Different Scales


Intro - 9

49

Incorrect and Misleading Chart

50

Type of Statistical Studies

Observational Study: conditions to which subjects are exposed are not controlled by the investigator. (no attempt is made to control or influence the variables of interest)

Experimental (Controlled) Study: conditions to which subjects are exposed to are controlled by the investigator. (treatments are used in order to observe the response) (Randomization, Replications)

51

Drug A: 44/100 = 44%Drug B: 29/100 = 29%

20012773Total

1007129Drug B

1005644Drug A

TotalNoYesTreatment

Hypertension

Results from observing behavior and outcomes from the use of medicine for

200 randomly selected patients.(Patients chose their medicine)

52

OR <65: Drug A: 5/23 = 22%Drug B: 17/77 = 22%

OR 65+: Drug A: 39/77 = 51%Drug B: 12/23 = 52%

231112776017Drug B

77383923185Drug A

TotalNoYesTotalNoYesTreatment

65+Below 65

Hypertension

* Older patients prefer Drug A

Simpson’s Paradox

53

Confounding variables

Treatment 1 &

Treatment 2

Patient’s SurvivalCause?

Patient’s Age & Health Condition

54

Confounding Effect

Variables, whether part of a study or not, are said to be confounded when their effects on the outcome cannot be distinguished from each otherAge may affect the reaction to drug and may also affect drug choosing decision.

Descriptive Statistics (Short Version)

Descriptive Stat - 1

1

Example: Birth weights (in lb) of 5 babies born from two groups of women under different care programs.

Group 1: 7, 6, 8, 7, 7Group 2: 3, 4, 8, 9, 11

6 7 8

2

Describe Distribution with Numbers

Numerical Summary Measures

• Measure of Center• Measure of Variation• Measure of Position

3

Measure of Central Tendency

Mean: the average value of the data.

If n observations are denoted by x1, x2, ..., xn, their (sample) mean is

n

x

nxxxx

n

ii

n∑==

+++= 121 ...

4

Example: Birth weights (in lb) of 5 babies born from a group of women under certain diet.

7, 6, 8, 7, 7

Sol:

[near the center of the data set]

75

355

77867==

++++=mean

5

Median: of a data set is

the data value exactly in the middle of its ordered list if the number of pieces of data is odd,the mean of the two middle data values in its ordered list if the number of pieces of data is even.

[median is not influenced by outliers and is best for non-symmetric distribution]

6

Example: (number of hysterectomies performed by 15 doctors)27, 50, 33, 25, 86, 25, 85, 31, 37, 44, 20,

36, 59, 34, 28

ordered list => 20, 25, 25, 27, 28, 31, 33, 34, 36, 37, 44, 50, 59, 85, 86

median = 34



7

Example: (Birth weights for 6 infants.)

5, 7, 6, 8, 5, 9

ordered list => 5, 5, 6, 7, 8, 9

median = (6+7) / 2 = 6.5

8

Mode: of a data set is the observation that occurs most frequently.

9

Example 1: (number of times visited class website by 15 students)27, 50, 33, 25, 86, 25, 85, 31, 37, 44, 20, 36, 59, 34, 28

ordered list => 20, 25, 25, 27, 28, 31, 33, 34, 36, 37, 44, 50, 59, 85, 86 Mode = 25

Example 2: (Blood type of 15 students)A, B, A, A, O, AB, A, A, B, B, O, O, A, A, A

Mode = AA – 8B – 3O – 3AB – 1 10

Mean ?

Median ?

Mode ?

Skewed to the Right

11

Range = largest data value − smallest data value

Sample from group I (diet program I): 7, 6, 8, 7, 7 => mean = (7 + 6 + 8 +7 + 7) / 5 = 35/5 = 7

Sample from group II (diet program II): 3, 4, 8, 9, 11=> mean = (3 + 4 + 8 + 9 + 11) / 5 = 35/5 = 7

Does the mother’s diet program affect the birth weights of babies?

Measure of Dispersion (Variability)

12

Is there any difference between the two samples?

range of sample I = 8 - 6 = 2

range of sample II = 11 - 3 = 8



13

Variance and Standard Deviation

Measure the spread of the data around the center of the data.

14

Example: Birth weights (in lb) of 5 babies born from a group of women under diet program II.

3, 4, 8, 9, 11 ⇒ mean = = 7

Sample Variance = 46/4 = 11.5 lb, Sample Standard Deviation = = 3.39 lb.4/46

ix

Total119843

Data Valuexxi −

011 – 7 = 49 – 7 = 28 – 7 = 1

4 – 7 = – 33 – 7 = – 4

Deviation from mean2)( xxi −

4616419

16

Squared Dev.

x

15

5.114

535291

12

1

2

12

2

=−

=

−

∑⎟⎠⎞

⎜⎝⎛ ∑

−= =

=

nn

xx

s

n

i

n

ii

i

A Short Cut formula:

291351211181964816493x2Data, x

16

Data: 7, 6, 8, 7, 7

s2 = (0+1+1+0+0)/(5-1) = ½

s =

Does the mother’s diet program affect the birth weights of babies?

71.021 =

Diet I: mean = 7, s = 0.71Diet II: mean = 7, s = 3.39

What is the standard deviation of the weights of babies from the sample of mothers who received diet program I?

17

If n observations are denoted by x1, x2, ..., xn, their variance and standard deviation are

Sample Variance:

(unbiased estimator for variance of an infinite population.)

Sample Standard Deviation:

Sample Mean:

1

)(1

2

2

−

−=∑=

n

xxs

n

ii

1

)(1

2

−

−=

∑=

n

xxs

n

ii

∑=

=+++=n

ii

n xnn

xxxx1

21 1...

18

If N observations are denoted by x1, x2, ..., xn, are all the observation in a finite population, their mean, μ , variance σ 2, and standard deviation, σ , are

Population Mean:

Population Variance:

Population Standard Deviation:

∑=

=+++

=n

ii

n xNN

xxx1

21 1...μ

N

xn

ii∑

=

−= 1

2

2)( μ

σ

N

xn

ii∑

=

−= 1

2)( μσ

Population Parameters



19

About s (sample standard deviation) :

s measures the spread around the mean.the larger s is, the more spread out the data are. if s = 0, then all the observations must be equal.s is strongly influenced by outliers.

20

The Use of Mean and Standard Deviation

Describe distributionUnderstand the center and the spread of the distribution

21

Bone Density Data

20118.4Male

25102.2Female

Standard Deviation, sMean,

Unit: mg/mlx

22

Many distributions can be described by a mathematical function with specific parameters, such as mean and standard deviation.

Example: Normal Distribution (Bell-shaped)

μ

σ

23

Empirical RuleProperties of a symmetric and bell-shaped (Normal) distribution:

The distribution is symmetric about it mean (μ),68% of the area between μ − σ and μ + σ , 95% of the area between μ − 2σ and μ + 2σ ,99.7% of the area between μ − 3σ and μ + 3σ .

μ − 3σ μ μ + 3σ 2466 70 74

Heart rates for a certain population at a certain condition follow a bell shape symmetric distribution with mean 70 and standard deviation 2.

What percentage of people in this population will have heart rates between 66 and 74?

?%95%



25

Chebychev’s inequality

There is at least 1 – (1/k2) of the data in a data set lie within kstandard deviation of their mean.

Chebychev’s Rule

26

Example: Heart rates for asthmatic patients in a state of respiratory arrest has a mean of 140 beats per minute and a standard deviation of 35.5 beats per minute. What percentage of the population of this type of patients have heart rates lie between two standard deviations of the mean in a state of respiratory arrest?

It will be at least 75%, because k = 2, and1 – (1/22) = ¾ = 75%.

27

69 144 211

Heart rates example: mean=144, s.d.=35.5

At least 75%

75% = 1 − (1/22)

140 - 2x35.5 = 69 140 + 2x35.5 = 211

k = 2

28

33.5 144 246.5

At least ?%

What about within three standard deviations? Heart rates example: mean=144, s.d.=35.5

?% ≈ 1 − (1/32)

144 - 3x35.5 = 33.5 144 + 3x35.5 = 246.5

k = 3

At least 89%

89% ≈ 1 − (1/32)

29

Measure of Position

Standard Score, Percentile, Quartile

30

If x is an observation from a distribution that has mean μ , and standard deviation σ , the standardized value of x is,

z-score of x :

“μ + 3σ” has a z-score 3, since it is 3 s.d. from mean.

deviationstandard mean xxz −

=−

=σμ

Z-score (Standard Score)

Population z-score



31

If a distribution has a mean 10 and a s.d. 2, the value 7 has a z-score –1.5.

z-score = (7 – 10)/2 = – 1.5.

6 8 10 12 14

1.5 s.d.

32

Sample z-score

Example: If the mean of a random sample is 5 and the standard deviation is 2, what would be the sample z-score of the value 6?

sxxz −

=

5.021

256

==−

=z

6 ,2 ,5 === xsx

33

Example: Bone Mineral Density

The WHO Working Group defines osteoporosis according to measurements of bone mineral density (BMD) using dual-energy X-ray absorptiometry (DEXA).Thus osteoporosis is defined as a bone density T score at or below 2.5 standard deviations (T score) below normal peak values for young adults.

34

These criteria were initially established for the assessment of osteoporosis in Caucasian women.

BMD reports may include a “Z score” which is the number of standard deviations by which the subject of interest differs from the mean for their age.

Severe osteoporosis T score < - 2.5 SD with 1 or more fragility fractures

OsteoporosisT score < - 2.5 SD

OsteopaeniaT score between –1.0 and –2.5 SD

Normal bone mineral density T score > -1.0 S.D

DefinitionDEXA BMD Values

35

Quartiles: (Measure of Position)

• The first quartile, Q1, or 25th percentile, is the median of the lower half of the list of ordered observations.

• The third quartile, Q3, or 75th percentile, is the median of the upper half of the list of ordered observations.

36

Example: [odd number of data values] (n = 21)

60,61,63,64,64,65,65,65,66,67,69,71,71,71,72,72,72,72,73,74,75

Q1 = ? Median = 69 Q3 = ?

Measure of spread:Interquartile range (IQR) = Q3 − Q1

IQR = 72 – 64.5 = 7.5

64.5 72



37

Example: [even number of data] (n = 22)

6, 60,61,63,64,64,65,65,65,66,67,69,71,71,71,72,72,72,72,73,74,75

Q1 = ? Median = ? Q3 = ?

Measure of spread:Interquartile range (IQR) = Q3 − Q1

IQR = 72 - 64 = 8

6864 72

38

The five-number summary

.Minimum value

.Q1

.Median

.Q3

.Maximum value

39

Example: (data sheet without outlier “6”)60,61,63,64,64,65,65,65,66,67,69,71,71,71,72,72,72,72,73,74,75

Min = 60, Q1 = 64.5, Median = 69, Q3 = 72, Max = 75.

21N =

HEIGHT

80

70

60

50

4022N =

HEIGHT

80

60

40

20

0

1

With 6 in the data:6, 60,61,63,64,64,65,65,65,66,67,69,71,71,71,72,72,72,72,73,74,75

Q1 = 64 Median = 68 Q3 = 72

IQR = 72 - 64 = 8

41

Inner and outer fences for outliers• The inner fences are located at

a distance of 1.5 IQR below Q1(lower inner fence = Q1 - 1.5 x IQR )

and at a distance of 1.5 IQR above Q3 (upper inner fence = Q3 + 1.5 x IQR ).

• The outer fences are located at a distance of 3 IQR below Q1

(lower outer fence = Q1 – 3 x IQR )and at a distance of 3 IQR above Q3

(upper outer fence = Q3 + 3 x IQR ) .42

• The inner fences are located at a distance of 1.5 IQR below Q1

(lower inner fence = 64 - 1.5 x 8 = 52 ) and at a distance of 1.5 IQR above Q3

(upper inner fence = 72 + 1.5 x 8 = 84). • The outer fences are located at

a distance of 3 IQR below Q1(lower outer fence = 64 – 3 x 8 = 40)

and at a distance of 3 IQR above Q3(upper outer fence = 72 + 3 x 8 = 96) .

IQR = 72 – 64 = 8; Q1 = 64; Q3 = 72



43

22N =

HEIGHT

80

60

40

20

0

1

IQR

LIF: 64 - 1.5 x 8 = 5252

UIF: 72 + 1.5 x 8 = 8484

Q1 = 64; Q3 = 72; IQR = 72 – 64 = 8

Inner fence

Inner fence

44

22N =

HEIGHT

80

60

40

20

0

1

IQR

Outer fence

Inner fence

Inner fence

Outer fence

LOF: 64 - 3 x 8 = 4040

UOF:72 + 3 x 8 = 9696

45

Mild and Extreme outliers

Data values falling between the inner and outer fences are considered mild outliers.Data values falling outside the outer fences are considered extreme outliers.

When outliers exist, the whisker extended to the smallest and largest data values within the inner fence.

46

Side-by-side Box Plot

138N =

sex

MaleFemale

HE

IGH

T

80

70

60

50

21

18

19

148

3

47

Remarks:

If the distribution of the data is symmetric, then the mean and median will be about the same.The five-number summary is best for non-symmetric data. The median, quartiles, inter-quartile range are not influenced by outliers.The mean and standard deviation are most appropriate to use only if the data are symmetric because both of these measures are easily influenced by outliers.

48

Boxplot

For the following data:13 72 78 40 50 56 50 52 57 69 130 142 51 52

Find the five-number-summary & IRQMake a boxplotFind the 60th percentile.

Probability (Short Version)

Probability - 1

1

A researcher claims that 10% of a large population have disease H.

A random sample of 100 people is taken from this population and examined.

If 20 people in this random sample have the disease, what does it mean? How likelywould this happen if the researcher is right?

Probability and Counting Rules

2

Sample Space and Probability

Random Experiment: (Probability Experiment) an experiment whose outcomes depend on chance.Sample Space (S): collection of all possible outcomes in random experiment.Event (E): a collection of outcomes of interest in a random experiment.

3

Sample Space and EventSample Space:

S = {Head, Tail}S = {Life span of a human} = {x | x≥0,

x∈R}

Event:E = {Head}E = {Life span of a human is less than 3

years}4

A Simple Example

What’s the probability of getting a head on the toss of a single fair coin? Use a scale from 0 (no way) to 1(sure thing).So toss a coin twice. Do it! Did you get one head & one tail? What’s it all mean?

5

Definition of ProbabilityA rough definition: (frequentist definition)

Probability of a certain outcome to occur in a random experiment is the proportion of timesthat the this outcome would occur in a very long series of repetitions of the random experiment.

Number of TossesNumber of Tosses

Total Heads / Number of TossesTotal Heads / Number of Tosses

0.000.000.250.250.500.500.750.751.001.00

00 2525 5050 7575 100100 1251256

Determining Probability

How to determine probability?

Empirical ProbabilityTheoretical Probability(Subjective approach)


Probability - 2

7

Empirical Probability Assignment

Empirical study: (Don’t know if it is a balanced Coin?)

1000Total

488Tail

512Head

FrequencyOutcome

8

Empirical Probability Assignment

nm

=

=repeated is experiment times of Number

occurs E event times of Number P(E)

Empirical probability assignment:

Probability of Head:

P(Head) = 5121000 = .512 = 51.2%

9

Empirical Probability Distribution

Empirical study:

1000

488

512

Frequency

1.0

.488

.512

Probability

Total

Tail

Head

Outcome

Empirical Probability Distribution10

Theoretical Probability Assignment

Make a reasonable assumption:

What is the probability distribution in tossing a coin?

Assumption:

We have a balanced coin!

11

Theoretical Probability Assignment

)()(

SnEn

=

=space sample the of Size

E event in outcomeslikely equally of Number P(E)

Theoretical probability assignment:

Probability of Head:

P(Head) = 12

= .5 = 50%

12

Theoretical Probability Distribution (Model)

Empirical study:

1.0

.50

.50

Probability

Total

Tail

Head

Outcome

Empirical Probability Distribution


Probability - 3

13

Relative Frequency and Probability Distributions

Class Frequency RelativeFrequency

0 54 .181 117 .392 72 .243 42 .144 12 .045 3 .01

Total 300 1.00

Number of times visited a doctor from a random sample of 300 individuals from a community

P(0) = .18P(1) = .39P(2) = .24P(3) = .14P(4) = .04P(5) = .01

14

Relative Frequency Distribution

00.10.20.30.40.5

0 1 2 3 4 5

Discrete Distribution

15

Relative Frequency and Probability

When selecting one individual at random from a population, the probability distribution and the relative frequency distribution are the same.

16

Probability for the Discrete CaseIf an individual is randomly selected from this group 300, what is the probability that this person visited doctor 3 times?

Class Frequency Relative Frequency

0 54 .18 1 117 .39 2 72 .24 3 42 .14 4 12 .04 5 3 .01

Total 300 1.00

P(3 times) = (42)/300 = .14 or 14%

17

Discrete DistributionIf an individual is randomly selected from this group 300, what is the probability that this person visited doctor 4 or 5 times?

Class Frequency Relative Frequency

0 54 .18 1 117 .39 2 72 .24 3 42 .14 4 12 .04 5 3 .01

Total 300 1.00

P(4 or 5 times) = P(4) + P(5)= .04 + .01= .05

It would be an empirical probability distribution, if the sample of 300 individuals is utilized for understanding a large population.

18

Properties of Probability

• Probability is always a value between 0 and 1.

• Total probability (all outcomes together) equals 1.

• Probability of either one of the disjoint events A or B to occur is the sum of their individual probabilities. P(A or B) = P(A) + P(B)


Probability - 4

19

Complementation Rule

For any event E,

P(E does not occur) = 1 – P(E)

Complement of E = E E

* Some places use Ec or E’EP(E)

P(E)

20


If an unbalanced coin has a probability of 0.7 to turn up Head each time tossing this coin. What is the probability of not getting a Head for a random toss?

P(not getting Head) = 1 – 0.7= 0.3

21


If the chance of a randomly selected individual living in community A to have disease H is .001, what is the probability that this person does not have disease H?

P(having disease H) = .001P(not having disease H)

= 1 – P(having disease H) = 1 – 0.001 = 0.999

22

Birthday Problem

In a group of randomly select 23 people, what is the probability that at least two people have the same birth date? (Assume there are 365 days in a year.)P(at least two people have the same birth date)

Too hard !!!= 1 – P(everybody has different birth date)= 1 – [365x364x…x(365-23+1)] / 36523

23

Intersection of events:A ∩ B <=> A and B

Example: A ∩ B = {3}

S = {1, 2, 3, 4, 5, 6}

A = {1, 2, 3}

B = {3, 6}

1 2 3

4 5 6

A

B

S

Venn Diagram

(with elements listed)

Union of events:A ∪ B <=> A or B

Example: A ∪ B = {1, 2, 3, 6}24

Venn Diagram (with counts)

A ∩ Bn(A ∩ B) = 20

?

A=Smokers, n(A) = 50

A B

B=Lung Cancer, n(B) = 25

Joint Event

Given total of 100 subjects

3020

5

45

n(A ∪ B) = ?55


Probability - 5

25

Venn Diagram (with relative frequencies)

A ∩ BP(A ∩ B) = .20

?

A=Smokers, P(A) = .50

A B

B=Lung Cancer, P(B) = .25

Joint Event

Given a sample space

.3.20

.05

.45

P(A ∪ B) = .55

26

Contingency Table

1007525

50455Not Smoke, Ac

503020Smoke, A

TotalNo Cancer, BcCancer, B

A BVenn Diagram

27

Conditional ProbabilityThe conditional probability of event A to occur given event B has occurred (or given the condition B) is denoted as P(A|B) and is, if P(B) is not zero, n(E) = # of equally likely outcomes in E,

or)(

)()|(BP

BAPBAP ∩=

)()()|(

BnBAnBAP ∩

=

A B

28

Conditional Probability

1007525

50455Not SmokeS’

503020SmokeS

TotalNo CancerC’

CancerC

P(C|S' ) = 5/50 = .1P(C|S) = 20/50 = .4

)()()|(

BnBAnBAP ∩

=

29

Conditional Probability

100(1.0)

75P(C)=(.75)

25P(C) =(.25)

50P(S’) =(.5)

45(.45)

5(.05)

Not SmokeS’

50P(S) =(.5)

30(.3)

20(.2)

SmokeS

TotalNo CancerC’

CancerC

P(C|S' ) = .05/.5 = .1P(C|S) = .2/.5 = .4

What is =?P(C|S)P(C|S’) 4

(Relative Risk )

)()()|(

BPBAPBAP ∩

=

30

Independent Events

Events A and B are independent ifP(A|B) = P(A)

or P(B|A) = P(B)

or P(A and B) = P(A) · P(B)


Probability - 6

31

Example

If a balanced die is rolled twice, what is the probability of having two 6’s?

61 = the event of getting a 6 on the 1st trial

62 = the event of getting a 6 on the 2nd trial

P(61) = 1/6,

P(62) = 1/6, 61 and 62 are independent events

P(61 and 62) = P(61) P(62) = (1/6)(1/6) = 1/36 32

Independent Events“10%” of the people in a large population has disease H. If a random sample of two subjects was selected from this population, what is the probability that both subjects have disease H?

Hi : Event that the i-th randomly selected subject has disease H.

P(H2|H1) = P(H2) [Events are almost independent]

P(H1 ∩ H2 ) = ? P(H1) P(H2) = .1 x .1 = .01

33

Independent Events

If events A1, A2, …, Ak are independent, then

P(A1 and A2 and … and Ak)= P(A1) · P(A2) · … · P(Ak)

What is the probability of getting all heads in tossing a balanced coin four times experiment?P(H1) · P(H2) · P(H3) · P(H4) = (.5)4 = .0625

34

Binomial ProbabilityWhat is the probability of getting two 6’s in casting a balanced die 5 times experiment? P(S∩S∩S’∩S’∩S’) = (1/6)2 x (5/6)3

P(S∩S’∩S∩S’∩S’) = (1/6)2 x (5/6)3

P(S∩S’∩S’∩S∩S’) = (1/6)2 x (5/6)3

···How many of them? 10

!3!2!5

25

=⋅

=⎟⎠

⎞⎜⎝

⎛

= 0.016

Probability (two 6’s) = 0.016 x 10 = 0.16

35

Multiplication Rule 2(General Multiplication Rule)

For any two events A and B,

P(A and B) = P(A|B) P(B) = P(B|A) P(A)

P(A and B) P(A)

P(B|A) =

P(A and B) P(B)

P(A|B) =

36

P(S) = 50% of the subjects smokedP(C|S) = 40% of the smokers have cancer

Multiplication Rule 2

P(C and S) = P(C|S) P(S) = .4 x .5 = .2

If in the population, 50% of the people smoked, and 40% of the smokers have lung cancer, what percentage of the population that are smoker and have lung cancer?