Ch. 1: Data and Distributions - Purdue Universityxuanyaoh/stat350/xyFinal.pdf · 2012-04-24 · Ch....

Post on 30-Jun-2018

214 views 0 download

transcript

Ch. 1: Data and Distributions •  Populations vs. Samples •  How to graphically display data

–  Histograms, dot plots, stem plots, etc –  Helps to show how samples are distributed

•  Distributions of both continuous and discrete variables –  Density functions and Mass functions

•  Three basic properties –  Shows the distribution of the entire population or process

•  Some important distributions and associated Probability –  Continuous: Exponential, Normal, Uniform … –  Discrete: Binomial, Poisson …

4/24/12 1 H.X. Lecture 30: Final Summary

Ch. 2: Numerical Summary Measures

•  Measure of center of Data (Sample) –Sample mean –Sample median, midpoint –Trimmed means

• Measure of variability for Data (Sample)

–Sample variance –Sample Standard deviation

•  Quartiles; Five-number-Summary; IQR and Outliers •  Graphical Display: Boxplots; Modified Version; Side-

By-Side Boxplots

4/24/12 H.X. Lecture 30: Final Summary 2

∑=+++

= in x

nnxxxx 1...21

( )∑ −−

=−

−++−+−= 2

222

212

11

1)(...)()( xx

nnxxxxxxs i

n

2ss =

Ch. 2 (Cont.): Numerical Summary Measures

•  Measures of Center (Distributions) – Continuous: – Discrete:

• Measure of variability (Distributions)

– Continuous: – Discrete:

• Normal Quantile (QQ) plot

∫∞

∞−⋅= dxxfxX )(µ

∑ ⋅= )(xpxXµ

( )∫∞

∞−⋅−= dxxfx XX )(22 µσ

( )∑ ⋅−= )(22 xpx XX µσ

4/24/12 3 H.X. Lecture 30: Final Summary

Ch.3: Bivariate Data •  Scatterplots: Visually Display Bivariate data, y vs. x •  Pearson’s Correlation Coefficient (between X and Y, both

quantitative), r : –  r measures the strength and direction of the linear

relationship –  , other convenient formulas for Sxy, Sxx and Syy

–  Takes values between -1 and 1, inclusive •  Sign indicates type/direction of relationship (positive, negative) •  Value indicates strength: farther from 0 is stronger

–  If switch roles of X and Y à r doesn’t change –  Unit free—unaffected by linear transformations –  Affected by Outliers, Not a resistant measure –  Correlation ≠ Causaiton

4/24/12 H.X. Lecture 30: Final Summary 4

Ch. 3: LS (Least Square) Regression Line

•  Estimated straight line Equation: y = a + b x –  a is the intercept (where it crosses the y-axis) –  b is the slope (rate) – 

–  Predicted value of y –  Residual from the fit (or regression line) –  Breaking up Sum of Squares: SSR, SSE, SST

•  Coefficient of Determination: – Percent of variation explained by the linear

regression between Y and X

4/24/12 H.X. Lecture 30: Final Summary 5

⎟⎟⎠

⎞⎜⎜⎝

⎛=

x

y

ss

rb

SSTSSE

SSTSSRr −== 12

Ch. 3 (Cont.): MSE and Residual Plot

•  Mean Squared Error about the LS line:

•  Standard Deviation about the LS line:

–  Also called “root MSE” in SAS output. •  Residual: •  A residual plot, plotting the residuals against x.

–  The residual plot should not have any pattern but a random scattering of points

–  If a pattern is observed, the linear regression model is probably not appropriate.

ˆi i ie y y= −

4/24/12 6 H.X. Lecture 30: Final Summary

Ch. 5: Probability and Sampling Distributions

•  Chance Experiments: –  Simple Events: individual outcomes –  Events: collections of simple events –  Sample Space: –  Venn Diagrams –  Tree Diagrams

•  Complex Events: – Event A or B, Event A and B, – Event A’ (Complement of A) – Disjoint Events (Mutually Exclusive) –  Independent Events

4/24/12 H.X. Lecture 30: Final Summary 7

Probability Basic Rules •  Probability Axioms:

–  0 ≤ P(A) ≤ 1 for any event A –  P(S) = 1, where S is the sample space

•  Addition Rule - For any disjoint events A and B, P(A or B) = P(A)+P(B)

•  Complementary Events: P(A’) = 1 - P(A) •  General Addition Rule: (for any events A and B)

P(A or B) = P(A)+P(B)-P(A and B) •  Independence Rule: P (A and B) = P(A) P(B) •  Conditional Probability: P(A|B) = P (A and B) / P(B) •  Bayes Rule for Calculation of Conditional Probability, Tree Diagrams

4/24/12 H.X. Lecture 30: Final Summary 8

Random Variables and Sampling Distribution

•  Random Variables –  Discrete Distribution Table, Prob. Histogram –  Continuous Distribution Curve, density function –  Independent R.V.s

•  Sampling Distribution of a Sample Mean •  Sampling Distribution of a Sample Proportion

(rule of thumb for Normal Appox.) •  Central Limit Theorem •  Continuity Correction (from Binomial to Normal

Appox.)

4/24/12 H.X. Lecture 30: Final Summary 9

Ch 7: Estimation and Statistical Inference by C.I. s

•  (Unbiased, Consistent) Point Estimation •  Large-Sample C.I.s for a Population Mean (Normality

Assumption)

–  one-sided C.I.s: Upper or Lower bound C.I. –  Interpretation of Confidence Level. –  Necessary sample size for a desired Bound (round up):

•  Small-Sample C.I. –  t-crit is associated with d.f. = n -1 – Normailty Assumption still holds.

4/24/12 H.X. Lecture 30: Final Summary 10

ns value)critical (z ±X

2CritZ snB

⎛ ⎞= ⎜ ⎟⎝ ⎠

ns value)critical ( tX ±

C.I. for a Population Proportion •  Point Estimation for a Population Proportion •  Large-Sample C.I.s for a Population Proportion

– Necessary sample size for a desired Bound (round up for not-an-integer):

•  , or 0.5 if p-hat is unavailable.

•  Small-Sample C.I. replaces z-crit by t-crit

4/24/12 H.X. Lecture 30: Final Summary 11

ˆ ˆ(1 )ˆ p pp Zcritn−

±

2_*(1 *) z criticaln p pB

⎛ ⎞= − ⎜ ⎟⎝ ⎠

ˆ*p p=

C.I. for two Population Means’ Difference

•  Large-Sample C.I.s for Difference between two Population Means (Normality Assumption)

•  Small-Sample C.I. , Zcrit replaced by t-crit, with (round down for non-integer)

4/24/12 H.X. Lecture 30: Final Summary 12

2

22

1

21

21 ns

nsZcritXX +±−

( )( ) ( )

11 2

22

22

1

21

21

22

221

21

−+

+=

nns

nns

nsnsdf

t C.I. for Paired Data

4/24/12 H.X. Lecture 30: Final Summary 13

Ch. 8: Hypotheses Testing •  State Hypotheses

–  Both Null and Alternative (one or two-sided)

•  Determine an appropriate α level. If not specified, use 5% •  Type I error; Significance Level. Understand it.

•  Calculate the appropriate test statistic •  Find the P-value, the probability of the as extreme or more

extreme than the test statistic •  Reject H0, when the P-value is smaller than the significance

level α. –  Otherwise: Fail to reject H0

•  State a conclusion in layman’s terms

4/24/12 H.X. Lecture 30: Final Summary 14

One-sample t Test for a Population Mean: •  The null hypothesis is H0: µ = µ0 •  The alternative hypothesis could be:

Ha: µ ≠ µ0 (two-sided) Ha: µ > µ0 (one-sided) Ha: µ < µ0 (one-sided)

4/24/12 15 H.X. Lecture 30: Final Summary

•  Test Statistic

•  t ~ Student’s t-distribution •  df = n – 1

•  If n is large (≥30), CLT guarantees an approximate normal

distribution and the t can be replaced with z, where z follows a standard normal distribution.

nsXt 0µ−=

P-value tied to Ha

•  Two-sided (both tails) Ha: µ ≠ µ0

•  One-sided (right tail)

Ha: µ > µ0 •  One-sided (left tail)

Ha: µ < µ0

4/24/12 16 H.X. Lecture 30: Final Summary

Other Tests or Remarks •  Two-Sample z (or t, depending on sample sizes)

test for Two Population Means –  When using t, the d.f. calculation

•  One-Sample t Test with (Matched) Paired Data •  Focus on two population means’ difference

•  A two-sided significance test <-> A two-sided C.I. for the same parameter

–  If the claimed value is in the CI à fail to reject H0 –  If the claimed is not in the CI à reject H0 –  NOTE: must have “≠” in Ha!

•  Statistical Significance ≠Practical Sig.

4/24/12 H.X. Lecture 30: Final Summary 17

Cautions (for both C.I. and tests of significance):

•  Data: assume SRS (random sampling) •  Population need to be …

– If n < 30, have to check normality (by Normal QQ-plot)

– With n ≥ 30, CLT can give us approximate normality in most situations.

4/24/12 18 H.X. Lecture 30: Final Summary

Ch. 9: One Way ANOVA •  Hypotheses:

–  H0: µ1 = µ2 = … = µk vs. Ha: At least one µi is different •  F test statistic

•  ANOVA table

•  P-value is always the upper tail of the F distribution with (k – 1, n – k) degrees of freedom. Tables of critical values for F distribution: (Table VIII)

•  F statistic > F critical value <=> P-value < α => Reject H0 4/24/12 H.X. Lecture 30: Final Summary 19

variationsamples-within variationsamples-between statistictest =

Source DF SS MS

Model (Between)

k – 1 SSM (formula)

SSM/k – 1

Error (Within)

n – k SSE (formula)

SSE/n – k

Total n – 1 SST = SSM + SSE

Assumptions (prior to Running one-way ANOVA)

1.  Constant variance: The variances of the k populations are the same.

–  Check this with the ratio of the largest and smallest standard deviations, the ratio must be

< 2 2.  Each of the k populations follows a normal

distribution. –  Check this by looking at QQplots for each group

•  Remark: statistical significance ≠ practical

significance 4/24/12 H.X. Lecture 30: Final Summary 20

Ch. 9: Multiple Comparison

•  If insignificant in one-way ANOVA, we don’t have to try further steps…

•  Otherwise, run Multiple Comparison to see which explicitly means are different. – Tukey’s Mehtod (“cldiff” or “lines” format) – Dunnett’s Method (only if there’s a control

group)

4/24/12 H.X. Lecture 30: Final Summary 21

9.4: Randomized Complete Block Design

•  RCBD (both treatment and block factor must be categorical)

•  In RCBD, –  we are only interested in the treatment factor –  The block factor might affect response but that’s not of interest.

•  Two F tests –  Blocking Effect? Use test statistic and P-value to conclude… –  Treatment Effect? Use test statistic and P-value to conclude…

4/24/12 H.X. Lecture 30: Final Summary 22

Source DF SS MS Factor A

(treatment) a – 1 SSA MSA

Factor B (block)

b – 1 SSB MSB

Error (a – 1)(b – 1) SSE MSE

Total ab – 1 SST

Necessary Assumptions for RCBD

•  Similar to one-way ANOVA 1.  Constant variance 2.  Each of the k populations follows a normal

distribution •  One additional assumption

3.  There is no interaction between the treatment and blocking variables

•  Can assess just using common sense (Just ask: Do/should they interact?)

•  OR check by a Two-way ANOVA model “Interaction Plot”…

4/24/12 23 H.X. Lecture 30: Final Summary

Ch. 10: Two-Way ANOVA •  Testing Two factors and their interaction’s effect to the response

variable…

•  Test –  First, Interaction (of the most interest). –  Then Factor A and B, respectively.

•  If “Interaction” significant, still run slicing for Factor A and B. •  If “Interaction” insignificant while a single Factor significant, run one-way

ANOVA and multiple comparison.

4/24/12 H.X. Lecture 30: Final Summary 24

Source DF SS MS

Factor A a – 1 SSA MSA

Factor B b – 1 SSB MSB

AB interaction (a – 1)(b – 1) SSAB MSAB

Error ab(r – 1) SSE MSE

Total abr – 1 SST

Ch. 10 (Cont.): Two-Way ANOVA •  Interaction plot

–  Roughly speaking, there’s no “Interaction” effect if all lines are parallel to each other

•  In summary, for Ch. 9 and 10 we should know:

–  All of One-way ANOVA (Ch. 9) •  By hand and/or using SAS

–  Most of randomized Blocking design (Sec 9.4), Two-way ANOVA

(Ch. 10, Section 2) •  For both:

–  Complete ANOVA tables, calculate DFs and F test statistic –  Perform F tests using F table –  Interpret SAS output

•  Know the general concept of a higher order (multi-way) ANOVA model.

4/24/12 H.X. Lecture 30: Final Summary 25

Ch. 11: Inferential Methods in Regression and Slopes (Correlations)

•  Normal Error Regression Model –  Error Item (3 assumptions: Independence, Normality

and Constant Variance) •  SSE, MSE, and Root MSE •  Coefficient of Determination, R^2

–  % of variation explained by the regression model –  Simply by squaring r

•  Statistical Inference about the slope in SLR Model:

–  C.I. for β (the slope): b ± (t crit) * sb –  Hypotheses Testing w.r.t. the slope, i.e. test of Linear

Relationship –  Remark: t~Student’s t-distribution with d.f. = n – 2

4/24/12 H.X. Lecture 30: Final Summary 26

Using ANOVA table to test SLR

•  Remark: d.f. of F test statistic = (1, n – 2)

4/24/12 H.X. Lecture 30: Final Summary 27

Source DF SS MS

Model (Regression)

1 SSM (or SSR) SSM/1 = MSM (or MSR)

Error n – 2 SSE (or SSResid)

SSE/n – 2 = MSE

Total n – 1 SST = SSM + SSE

Multiple Linear Regression Model

•  MLR Model:

•  Test the above linear relationship –  H0: All βi’s = 0 vs. Hα: At least one βi ≠ 0 –  A rejection of the null indicates that collectively the Xs

do well at explaining Y; otherwise don’t have to run the following step

–  But it doesn’t show which explicit Xi’s are doing “the explaining”

•  Model Selection, especially Backward Elimination •  The Estimated Line, from SAS output

–  Use it to Predict Yi; –  Get residual by “Actual Y_i – Predicted Value”

4/24/12 H.X. Lecture 30: Final Summary 28

1 1 2 2 ...i p p iY X X X eα β β β= + + + + +

After Class… •  Review Notes, practices, Hw, Labs and previous tests. •  Wed, Lab#8 (optional) •  Final Exam (Close book, Close notes)

– Next Wed, 8-10am –  Student ID, a calculator (SAT policy, NO QWERTY

keyboard) and pencils, two-page crib sheet (8” by 11”) handwritten by yourself, two-sided.

•  SEE CALCULATOR POLICY and “crib sheet” (on Syllabus) from course website.

•  No electronics except a calculator. Not allowed to exchange calculator or crib sheet during the exam. Not allowed to type/print your crib sheet.

4/24/12 H.X. Lecture 30: Final Summary 29