Examining Distributions - University of Virginia graphs and pie charts describe the distribution of...

Post on 15-Mar-2018

215 views 2 download

transcript

Examining Distributions - Introduction

Chapter 1

  A variable records characteristics of individuals (i.e., objects of interest) in its values.

  A variable’s distribution describes the counts or relative proportions of its values.

Variables

Examining Distributions - Describing Distributions with Graphs

Section 1.1

 Bar graphs and pie charts describe the distribution of a categorical variable.

 A Pareto chart is a bar graph with categories ordered by decreasing frequency.

 Histograms are essentially bar graphs of a quantitative variable.

 Stemplots are back-of-the-envelope histograms drawn with the digits of quantitative values.

 Time plots graph time series values by time.

Some graphical statistics

Histograms

Use equal bar-widths and “eyeball” for best picture

December 2004 state unemployment rates.

(Raw data in Table 1.1 of text.)

Interpreting histograms

Too much detail Visualize a smooth curve highlighting the overall pattern

Look for shape, center, and spread.

Distribution shapes Symmetric distribution

Right-skewed distribution

Complex, multimodal distribution

Interpreting histograms

Look for deviations, like outliers.

Alaska Florida

Stemplot

December 2004 state unemployment rates.

(Raw data in Table 1.1 of text.)

Stem Leaves

Split stem

Examining Distributions - Describing Distributions with Numbers

Section 1.2

Measure of center: the mean

Heights (in.) of 25 women

Measure of center: the median

Step 2.a: If n is odd, M = middle value

Step 1: Sort x1, …, xn.

Step 2.b: If n is even, M = avg. of two middle values

M = 3.4

M = (3.3+3.4)/2 = 3.35

Left skew Right skew

Comparisons Symmetry

Observe:

 The mean is “pulled” by outliers.

 The median is resistant to outliers.

M = 3.4

Q1= 2.2

Q3 = 4.35

Measure of spread: the quartiles

The first quartile, Q1, is the median of values below M.

The third quartile, Q3, is the median of values above M.

M = 3.4

Q3 = 4.35

Q1 = 2.2

Max = 6.1

Min = 0.6

Five-number summary and boxplot

Measure of spread: the standard deviation

Heights (in.) of 25 women

, where

Note: Calculate by computer

Summarizing distributions

M Q3

Q1

Max

Min

Five number summary Error bars

(Resistant) (Not resistant)

Examining Distributions - The Normal Distributions

Section 1.3

Density curves A density curve is a mathematical idealization of a histogram

Actual

Idealization

“Area under the curve” ≈ proportion of observations.

Other idealizations

Histogram Density curve

Median halves “area under the curve” The mean is the balance point

Examples

Have easy mathematical formulas

No easy formula

Normal distributions The normal curves:

x x Properties:

 Symmetric, single-peaked, and bell-shaped.

 Indexed by µ and σ, denoted N(µ, σ)

 µ ± σ mark inflection points.

“Exponential” function

Impact of µ and σ

Same µ, different σ

Different µ, same σ

The 68-95-99.7 Rule

If x is N(µ, σ):

 68% of obs. within µ ± σ

 95% of obs. within µ ± 2σ

 99.7% of obs. within µ ± 3σ

Standardization A z-score measures the location of x from µ in units

of σ,

Key property: If x is N(µ, σ) then z is N(0, 1).

Benefit: To calculate an “area under the curve” for N(µ, σ) translate to a z-score and use N(0, 1).

“Standard Normal” distribution

Example calculation: heights

Problem: Heights, x, is N(64.5, 2.5).

For what proportion of individuals is x < 67?

Solution:

Ask: How far is c = 67 from µ = 64.5 in units of σ = 2.5?

(c – µ) / σ = (67 – 64.5) / 2.5 = 1

Translate: z = (x – µ) / σ is N(0, 1)

For what proportion of individuals is z < 1?

Calculate: normsdist(1) = 0.84

Example calculation: heights (cont) 68-95-99.7 rule:

Proportion with -1 < z < 1 is 0.68

Equally divide remaining between z < -1 and z > 1

Proportion with z < 1 is 0.16 + 0.68 = 0.84

0.68

0.16 0.16

Calculation of “area between” Problem: Proportion with c1< z < c2

Solution: (prop. with z < c2) – (prop. with z < c1)

Example: Proportion with 1.4 < z < 2.2.

normsdist(2.2) – normsdist(1.4)

= 0.9861 – 0.9192 = 0.0669

Backward calculations Problem: For what c is p the proportion with z < c?

Solution: c = normsinv(p)

Examples:

normsinv(0.84) = 1

normsinv(0.16) = -1

0.68

0.16 0.16

Problem: MPG, x, of compact cars is N(25.7, 5.88).

For what c does 10% of compact cars have x > c?

Solution: First, normsinv(0.90) = 1.28

Translate: z = (x – µ) / σ is N(0, 1)

10% of compact cars

have z > 1.28 = (c – µ) / σ

Solve: 1.28 = (c – 25.7) / 5.88

⇒ c = 25.7 + (1.28)(5.88)

= 33.2

Example calculation: mpg

Examining Relationships Scatterplots

Section 2.1

Often, individuals are measured in more than one variable

Follow the same approach as before:

 Plot data and calculate numerical summaries

 Look for overall patterns and deviations

 Consider suitability of mathematical models (later)

Examining relationships

Examining relationships

Additional considerations:

 Do some variables tend to vary together?

 Do some variables explain variability in another?

Definitions:  A response variable measures or records an

outcome of a study. (Also: y, dependent variable.)

 An explanatory variable explains changes in the response variable. (Also: x, independent variable.)

Scatterplots

A scatterplot is a graph of two quantitative variables measured on the same set of individuals.

If appropriate:  response variable on y-axis

 explanatory variable on x-axis

Example Beers Drank

Blood Alcohol

5 0.10

2 0.03

9 0.19

7 0.10

3 0.07

3 0.02

4 0.07

5 0.09

8 0.12

3 0.04

5 0.06

5 0.05

6 0.10

7 0.09

1 0.01

4 0.05

Interpretation: form Linear

Nonlinear

No relationship

Interpretation: direction

Negative Positive

high x ↔ low y low x ↔ high y

high x ↔ high y low x ↔ low y

Interpretation: strength

A stronger relationship has points falling more closely to a clear from

Perfect linear Less strong

An outlier (of the relationship) is a point that falls off the trend

Outlier

Outlier in x and y but not of the relationship

Outlier of the relationship

Examining Relationships - Correlation

Section 2.2

Measure of direction and strength: correlation Beers Drank

Blood Alcohol

5 0.10

2 0.03

9 0.19

7 0.10

3 0.07

3 0.02

4 0.07

5 0.09

8 0.12

3 0.04

5 0.06

5 0.05

6 0.10

7 0.09

1 0.01

4 0.05

Note: Calculate by computer

Examples

Properties

  -1 ≤ r ≤ 1, always

  Response and explanatory variables are interchangeable

  Unitless, and independent of variables’ units.

  r is not resistant.

Properties (cont.)   Interprets only linear relationships

Linear Non-linear

r is appropriate r may mislead