1 Introduction to Biostatistics (BIO/EPI 540) Data Presentation Graphs and Tables Acknowledgement:...

Post on 17-Dec-2015

215 views 0 download

Tags:

transcript

1

Introduction to Biostatistics

(BIO/EPI 540)

Data Presentation Graphs and Tables

Acknowledgement: Thanks to Professor Pagano (Harvard School of Public Health) for lecture material

2

Class Plan

Data Presentation (Lec 2 overview)Example (hand/SAS)Mean and varianceDescribing Data (and in next class)Simulating Data (and in next class)

3

Outline• Descriptive Statistics – means of

organizing and summarizing observations

• Types of data

• Data presentation and numerical summary measures

4

Types of data

•Nominal Data

•Ordinal Data Rank Data

•Discrete Data

•Continuous Data

5

Types of data

•Nominal Data

1: male0:female

•Nominal data values fall into unordered categories or classes

6

Types of data

•Ordinal Data

•Observations with order among categories are referred to as ordinal

1.Mild2.Moderate3.Severe

7

8

Cause 1999 1998

Floodgates/Canal Lock

15 9

Human Related 8 6

Natural 43 21

Perinatal 52 53

Watercraft 82 66

Undetermined 69 76

Total 263 231

Example: Death of Manatees in Florida

Florida Fish and Wildlife Conservation Commission

Nominal categories

9

Cause 1999 1998 RankFloodgates/Canal Lock

15 9 4Human Related 8 6 5Natural 43 21 3Perinatal 52 53 2Watercraft 82 66 1Undetermined 69 76

Total 263 231

Example: Death of Manatees in Florida

Florida Fish and Wildlife Conservation Commission

Ranked data

10

Types of data

•Discrete Data

•Both order & magnitude important

•Data consists of restricted set of values

e.g. Data on number of children per subject

SubjectNumber of children

1 2

2 3

3 1

4 2

5 4

11

Types of data

•Continuous Data

•Data represents measurable quantities, but are not restricted to taking on specific values

•US adult heights

•US adult individual cholesterol measurements

12

Outline

• Descriptive Statistics – means of organizing and summarizing observations

• Types of data

• Data presentation and numerical summary measures

13

Data Presentation

• Nominal / Ordinal Data: – Frequency (relative frequency) tables– Bar charts

• Discrete/ Continuous Data: – Histogram (Frequency Polygon)– One way scatter plot

• Continuous Data:– Box plot– 2 way scatter plot– Line Graph

14

Example: Serum cholesterol level of men aged 25-34 years.

Cholesterol Level

(mg/100 ml)

Number ofMen

80—119 13

120—159 150

160—199 442

200—239 299

240—279 115

280—319 34

320—359 9

360—399 5

Total 1,067

Frequency Table

15

Example: Serum cholesterol level of men aged 25-34 years.

Cholesterol Level

(mg/100 ml)

Number ofMen

RelativeFrequency (%)

80—119 13 1.2

120—159 150 14.1

160—199 442 41.4

200—239 299 28.0

240—279 115 10.8

280—319 34 3.2

320—359 9 0.8

360—399 5 0.5

Total 1,067 100.0

Frequency Table

16

Bar Chart

http://www.ncsu.edu/labwrite/res/gh/gh-bargraph.html#horizbar

Label axes; Leave space between bars

Car defects in three factories

17

Data Presentation

• Nominal / Ordinal Data: – Frequency (relative frequency) tables– Bar charts

• Discrete/ Continuous Data: – Histogram (Frequency Polygon)

• Continuous Data:– Box plot– 2 way scatter plot– Line Graph

18

HistogramExample

19

Histogram

• Choosing the number of bins – depends on range of data

• Equal widths of bins recommended

• When data demands unequal bin widths, take care to plot area proportional to relative frequency

Key points

20

Histogram

• A histogram represents percentages by areas*

• Density scale (Y axis): the height of each block (bin) equals the percentage in that block (bin) divided by the bin width

• Total area of histogram = 100%

• When bin widths are equal – it is common for the histogram to show just the counts in each bin

Source: http://www.stat.berkeley.edu/users/rice/Stat2/Chapt3.pdf

Key points

21Source: http://www.stat.berkeley.edu/users/rice/Stat2/Chapt3.pdf

Histogram - example

22

Percent

Source: http://www.stat.berkeley.edu/users/rice/Stat2/Chapt3.pdf

Histogram - example

23

Histogram

Source: http://www.stat.berkeley.edu/users/rice/Stat2/Chapt3.pdf

24

HistogramConstructing a 100% area

histogram

Source: http://www.stat.berkeley.edu/users/rice/Stat2/Chapt3.pdf

25

Histogram

Constructing a 100% area histogram

Source: http://www.stat.berkeley.edu/users/rice/Stat2/Chapt3.pdf

26

Histogram

density

-2.0 -0.4 0.40 2.0

Constructing a 100% area histogram

Source: http://www.stat.berkeley.edu/users/rice/Stat2/Chapt3.pdf

27

Serum cholesterol level of men (1976-1980 survey)

Cholesterol Level

(mg/100 ml)

RelativeFrequency 25-34 yrs

(%)

RelativeFrequency 55-64 yrs

(%)

80—119 1.2 0.4

120—159 14.1 3.9

160—199 41.4 21.6

200—239 28.0 37.3

240—279 10.8 22.9

280—319 3.2 10.4

320—359 0.8 2.9

360—399 0.5 0.6

Total 100.0 100.0

Frequency Polygon - Example

28

Frequency polygon of cholesterol

05

1015202530354045

80-119

120-159

160-199

200-239

240-279

280-319

320-359

360-399

Levels

Per

cent

25-34 55-64

Frequency Polygon - Example

29

Serum choleterol level of men aged 25-34 years.

Cholesterol Level

(mg/100 ml)

RelativeFrequency (%)

Cumulative

80—119 1.2 1.2120—159 14.1 15.3160—199 41.4 56.7200—239 28.0 84.7240—279 10.8 95.5280—319 3.2 98.7320—359 0.8 99.5360—399 0.5 100.0

Total 100.0

Frequency Polygon - Example

30

Cumulative frequency polygon of cholesterol

0

20

40

60

80

100

80-119

120-159

160-199

200-239

240-279

280-319

320-359

360-399

Levels

Per

cent

25-34

Frequency Polygon - Example

31

Cumulative frequency polygon of cholesterol

0

20

40

60

80

100

80-119

120-159

160-199

200-239

240-279

280-319

320-359

360-399

Levels

Per

cent

25-34 55-64

Frequency Polygon - Example

32

Data Presentation

• Nominal / Ordinal Data: – Frequency (relative frequency) tables– Bar charts

• Discrete/ Continuous Data: – Histogram (Frequency Polygon)

• Continuous Data:– Box plot– 2 way scatter plot– Line Graph

33

Example - Dyslipidemia in HIV Cohort

Histogram reveals an asymmetric, skewed distribution

34

Example - Dyslipidemia in HIV Cohort

Natural log transformation of the dataresults in a more symmetric distribution

35

Box plotDyslipidemia in HIV Cohort

50th percentile

Natu

ral lo

g t

ransf

orm

ed

Tri

gly

ceri

de m

easu

rem

ents

25th percentile

75th percentile

UB

LB

UB (LB) = most extreme data point that is within 1.5 times box width (IQR) of the 75th (25th) percentile

Outliers

36

Box plotDyslipidemia in HIV Cohort

37

2 way scatter plotDyslipidemia in HIV Cohort

Reveals relationship between 2 continuous variables

38

Summary• Data Types:

– Nominal – Ordinal– Discrete– Continuous

• Data presentation (Nominal/Ordinal data):– Tables (Frequency, Relative Frequency) – Bar charts

• Data presentation (Discrete/Continuous)– Histogram (Frequency Polygon)

• Data presentation (Continuous) – Box plot, shapes of distributions– 2 way scatter plot

39

In-Class ExampleDistance willing to Travel to a

Household Hazardous waste site:Distance Freq< 1 mile 751>-2 miles 902>-5 miles 455>-10 miles 90

300

Histogram, Polygon, Cum % Dist.

40

In-Class ExampleDistance willing to Travel to a

Household Hazardous waste site:Distance Freq % %/mile< 1 mile 75 25 25>1-2 miles 90 30 30>2-5 miles 45 15 5>5-10 miles 90 30 6

300

Histogram, Polygon, Cum % Dist.

41

Histogram of Travel Distance (miles) for n=300

Densi

ty

Distance (Miles)0 1 2 3 4 5 10

42

Polygon of Travel Distance (miles) for n=300

Densi

ty

Distance (Miles)0 1 2 3 4 5 10

43

Cumulative % of Travel Distance (miles) for n=300

Cum

. Perc

ent

Distance (Miles)0 1 2 3 4 5 10

0

2

5

50

75

1

00