Introduction to Nonparametric Statistical Methodsjkwon/classes/stat_2010/2010... · Web viewRandom...

STAT 2010, Business Stat 2006 Jaimie Kwon

STAT 2010, Elements of Statistics for Business and Economics

Lecture Notes

Prof. Jaimie KwonStatistics Dept

Cal State East Bay

DisclaimerThese lecture notes are for internal use of Prof. Jaimie Kwon, but are provided as a potentially helpful material for students taking the course. A few things to note:

The lecture in class always supersedes what’s in the notes These notes are provided “as-is” i.e. the accuracy and

relevance of the contents are not guaranteed The contents are fluid due to constant update during the

lecture The contents may contain announcements etc. that are not

relevant to the current quarter Students are free to report typos or make suggestions on

the notes via emailing or in person to improve the material, but they need to understand the above nature of the notes

Do not distribute these notes outside the class

Best Practice for note-taking in class I do not recommend students relying on this lecture notes

in place of actual notes he/she writes down Bring a notepad and write down materials that I go over in

the class, using this lecture notes as the independent reference; you don’t miss a thing by not having a printout of this lecture note in (and outside) the class

If you still want to print these notes, it’d be better to print them 4 pages on a single page (using “pages per sheet” feature in MS Word), preferably double sided (to save trees)

- 1 -


- 2 -


Some canonical examples: Benefit of low-fat diet (Jan 2006) # of supporters of Bush/Gore in Florida exit poll (Florida,

2000) Is driving an SUV more dangerous than driving a

passenger car? To cash in now and retire or keep working, for GM

workers (Mar 2006)? When do I have to leave home to be at school on time

(this morning)? Has consumer confidence in the US increased or

decreased from last to this month (March 2006)? Where do I put this $1,000? Google stock? Coca-Cola

stock? A mutual fund? Certificate of deposit (CD)? What are expected returns and risks? (pay day)

The number of mothers opting for cesarean birth is on the rise. On the other hand, cesarean babies have higher risk of breathing problem (March 30, 2006)

Arnold is back (almost). The Californian governor’s approval rating is 47% now, a 7% increase in a single month. (March 30, 2006)

What’s the daily number of reports related to statistics? Interval variable? Categorical?

What’s common in above examples: decision under uncertainty

- 3 -


1 What is statistics? Statistics: a way to extract information from data Descriptive statistics: methods of organizing, summarizing,

and presenting data in such a way that useful information is produced Graphical methods Numerical summary of data

Inferential statistics: a body of methods used to draw conclusions or inferences about characteristics of population based on sample data

Key paradigm of statistics Population: the group of all items of interest Parameter: a descriptive measure of a population Sample: a set of data drawn from the population Statistic: a descriptive measure of a sample Statistical inference: the process of making and estimate,

prediction or decision about a population based on sample data

Exercises 1.3, 4

2 Graphical and tabular descriptive statistics

2.1 Types of data Variable: some characteristic of a population or sample The values of the variable are the possible observations of

the variable. (Integers b/w 0-100, real numbers, M/F, A-F) Data are the observed values of a variable (plural for

datum) Types of data/variable

Interval data/variable are real numbers, a.k.a. quantitative or numerical

Nominal data/variable have categorical values without orders, a.k.a. qualitative or categorical

Ordinal data/variable are similar to nominal but their values can be ordered

- 4 -


(“Categorical variable” is the generic name for nominal and ordinal variables)

Hierarchy? (Course grade: score to letter grade to pass/fail) Exercises 2.1-2.32.2 Techniques for nominal data Frequency distribution: a table of the categories and

their counts Relative frequency distribution : shows the proportion

(not count) of each category A bar chart is used to display frequencies A pie chart shows relative frequencies Exercises 2.112.3 Graphical techniques for interval data How to visualize the data? Histogram E.g. Items with defects (Xr02-35) x=c(4, 9, 13, 7, 5, 8, 12, 15, 5, 7, 3, 8, 15, 17, 19, 6, 4, 10, 8, 22, 16, 9, 5, 3, 9, 19, 14, 13, 18, 7); hist(x) Example (recycle below): mean time spent on the internet;

0, 7, 12, 5, 33, 14, 8, 0, 9, 22 (hrs /month)x=c(0, 7, 12, 5, 33, 14, 8, 0, 9, 22); hist(x, nclass=4) We’ve all seen histograms. Here’s how you draw one:

Build class intervals, equally wide, non-overlapping intervals that cover the complete range of observations.

Create a frequency distribution, by counting the # of observations that fall into each class interval

Draw the histogram, rectangles whose bases are class intervals and heights are frequencies

How many class intervals? More class intervals for {more, less} data points. Table 2.6 for the rule of thumbs; Sturges’ formula: “1+3.3 log(n)” My favorite: eyeballing

How wide is each interval? Round (range/# of classes) to something convenient.

Reading histograms… Symmetry and Skewness (positively/negatively)

- 5 -


How many peaks? unimodal, bimodal Bell shape (symmetric, unimodal; important)

Which variables are likely to have A positively skewed distribution? A negatively skewed distribution? Symmetric distribution? Symmetric, bell shaped distribution? Bimodal distribution?

Stem-and-leaf display Ogive Ex. 2.33, 35(a)(c)2.4 Describing the relationship between two variables Bivariate methods are used to study the relationship

between two variables (Cf. Univariate methods) Dependent variable (Y) vs. independent variable (X) Four possible combinations: {categorical, integer} {X, Y}

variable Two categorical variables: E.g. Gender and choice of doctorate, 1998 (Ex. 2.56, Xr02-

56) Example: Blue collar/white collar/professional vs

NYTimes/USA today/SF Chronicles; ad targeting A contingency table lists the frequency of each

combination of the values of two categorical variables To study the differences in the row variable among the

column variable; compute the column totals and divide each frequency by it to obtain column relative frequencies

Two interval variables: E.g. Size vs. price of home (100 ft2 vs K dollars) which are

dependent and independent variable? Use of X and Y. (e.g. Xm02-09) Draw scatter diagram using X and Y

Interpreting scatter diagrams: Linear relationship: most of the points fall close to a

straight line through points (cf. least squares method)

- 6 -


Two main characteristics of linear relationship: Strength (strong, medium, weak, none) Direction (positively linear, negatively linear)

Nonlinear relationship Ex. 2.55 (Xr02-55), 56 (Xr02-56)2.5 Time series data Bankrate, Hbrhomes graph (<> cross-sectional data) Ex 2.73 (Xr02-73)

3 Art and science of graphical presentations graphical excellence graphical deception presenting statistics: writing reports and oral presentations

- 7 -


4 Numerical descriptive techniques

4.1 Measures of central location Label observations in a sample as We typically use n for the sample size, N for population size

Population quantities are usually not computable, especially when N=

Example (recycle below): mean time spent on the internet; 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 (hrs /month)

x=c(0, 7, 12, 5, 33, 14, 8, 0, 9, 22);mean(x);hist(x) Three measures of central location

Arithmetic mean:

sample mean ; population mean:

Median: the observation that falls in the middle of the sorted data

Mode: value that occurs with the greatest frequency Which to use?

Mode is usually a poor measure. Compared to mean, median is less sensitive to extreme

observations and in many cases more interpretable Geometric mean: useful for finance, when averaging

growth rate over years Let Ri be the rate of return in period i. The geometric

mean Rg of the returns R1,…,Rn is (1+Rg)n = (1+R1)…(1+Rn); Solving for Rg, we have ; example with R1=100% and R2=-50%. ($1,000 -> $2,000 -> $1,000 again)

Ex 4.3, 4.10 (geometric mean)4.2 Measures of variability Measure of spread or variability of the data Example: 8, 4, 9, 11, 13 (# of hours the students spent

studying stat last week)

- 8 -


Range = largest value observed - smallest value observed (too simple)

Variance: sample variance , population

variance

Why n-1? We will see in Chapter 10.1; Compute “deviations” first and squaring, summing,

dividing. Why squaring? (absolute value is also possible; MAD) The unit? (square of the original unit)

Shortcut for sample variance:

Standard deviation (SD): sample standard deviation , population standard deviation

Same unit as the original data; easy to interpret s2=2 =0 if and only if ___

Empirical Rule: Given a set of n measurements that is approximately normal (bell-shaped), it follows that the interval with endpoints contains ~ 68% of the measurements contains ~ 95% of the measurements contains almost all of the measurements

E.g. Analysis of the monthly returns on an investment shows the distribution is approximately bell shaped and mean=10% and sd=4%. What can you say about the distribution of the return?

hist(rnorm(240, 10, 4), col=’red’) How often is the return between 6 to 14%? How often is the return larger than 14%?

Coefficient of variation (CV): or Ex 4.23, 24((b) and (c) only; also compute standard

deviations as well), 27, 28

- 9 -


4.3 Percentiles and box plots Percentiles are everywhere (test scores…) The p’th percentile: the value for which p percent of

observations are less than that value and (100-p)% are greater than that value

Quartiles are 25th, 50th, 75th percentiles (divide the data into quarters), each called first/lower quartile, median, and third/upper quartileeach labeled Q1, Q2, Q3 (cf. quintiles and deciles)

Location of a p’th percentile in the sorted numbers is approximately

Recycle the internet data example: Simple, rounding approach Detailed approach

Relationship between the skewness and distribution of quartiles If Q2 is closer to Q1 than Q3, then ____ skewed If Q2 is closer to Q3 than Q1, then ____ skewed

Inter-quartile range (IQR) : Q3-Q1; spread of the middle 50% of the observations

(horizontal) Box plots: Q1, Q2, Q3 for the box boundaries; Left and right ‘whiskers’ extend outward from the box

boundaries to the outermost values that are within 1.5 * IQR from the box boundaries

Points outside the whiskers are ‘outliers’ (>1.5*IQR outward from Q1 or Q3); interesting or incorrect points

Multiple box plots: Great tool for comparing distribution of multiple groups

Ex 4.37, 4.43, 4.48 (do only “describe your findings” part; the boxplot is provided in the handout; feel free to try Minitab to draw the boxplot per in class instruction but it’s not required)

- 10 -


4.4 Measures of linear relationship Numerical measure for direction and strength of the linear

relationship Example: (which are X and which are Y?)

baseball wins vs. home/road attendance (Baseball attendance);

GMAT score vs. MBA GPA (xm04-16) Covariance between variables X and Y:

Population covariance ,

Sample covariance: ,

Shortcut for sample covariance:

Manual calculation: I xi yi1 2 13

… 6 20N 7 27

Total

Average

Xi=2,6,7; yi=13, 20, 27; How about yi=27, 20, 13?How about yi=20, 27, 13?

Look at the sign (direction) and magnitude (strength) – How do we judge magnitude of covariance?

- 11 -


Coefficient of correlation Population correlation ; sample correlation Correlation is between -1 and 1 Java Applet for correlation coefficient

Least squares method: an objective way of producing a straight line through data points in scatter diagram It produces a straight line such that the sum of squared

deviations between the points and the line is minimized Equation for a line:

,where

: interceptslope

: the (predicted) value of y determined by the line Use calculus to find coefficients b0, b1 which minimizes

Least squares line coefficients are given by

and .

Ex 4.55, 56, 58 (xr04-58; computer use is OK but show your work)

4.5 Comparing graphical and numerical techniques Comparing returns on two investment; centers=expected

return; spreads=risks (low-risk vs high-risk) Business stat marks vs. math stat marks: unimodal,

bimodal, … Relationship b/w price and size of houses4.6 General guidelines for exploring data Look at the shape of the distribution; find Center; spread;

peaks; skewness (bell curve?) Shapes guide on which numerical techniques to use

- 12 -


Optional (won't be graded): Ex 4.84, 4.86 (you have to use the computer, preferrably Minitab, for these two problems)

- 13 -


5 Data collection and sampling

5.1 Methods of collecting data Direct observation (observational data): aspirin vs. heart

attack example; limitations; inexpensive Surveys: Gallup Poll example; market research;

response rate Personal interview Telephone interview Self-administered survey Questionnaire design

Experiment (experimental data): same example Ex 5.15.2 Sampling The chief motif for a sample rather than population: cost Use sample quantities as ‘estimates’ for the corresponding

population quantities E.g. Nielson ratings (what is watched by 1000 television

viewers); quality control “Target population” (the population about which we want to

draw inferences) vs. “sampled population” (the actual population from which the sample has been taken) E.g. The Literary Digest : predicted Alfred Landon’s 3 to 2

victory over the incumbent Franklin D. Roosevelt based on 10 million sample ballots That are sampled from phone directory Of which “only” 2.3 million were returned (‘self-

selected samples’) Ex. 5.6, 5.75.3 Sampling plans A “simple random sample” is a sample selected in such a

way that every possible sample with the same # of observations is equally likely to be chosen Simple and good (do it “randomly”!!)

- 14 -


How to do it?? (random sample; jar; …) A “stratified random sample” is obtained by separating the

population into mutually exclusive sets, or strata, and then drawing simple random samples from each stratum To extract more information Criteria for separating a population into strata include:

gender, age, occupation,… Sampling procedure and analysis can be complicated:

plan ahead and consult stat pros! A “cluster sample” is a simple random sample of groups or

clusters of elements Reduce geometric distances the surveyor must cover to

gather data (reduce cost) Increases sampling error

Sample size and accuracy: The larger the sample size is, the more accurate the sample estimates becomes Details in Chapters 10 and 12

Ex 5.11, 14-165.4 Sampling and nonsampling errors Sampling error: differences between the sample and the

population that exist only because of the observations that happened to be selected for the sample E.g. the mean annual income of North American blue-

collar workers Estimate the mean income of the population by the

mean of the sample. The value of will deviate from simply by chance

This deviation can be large simply due to bad luck The only way to reduce the expected size of this error is

to take a larger sample Given a fixed sample size, we state the probability that

the sampling error is less than certain amount (Ch. 10) Nonsampling error: more serious; taking a larger sample

won’t help here; due to mistakes made in the acquisition of data or due to the sample observations being selected improperly

- 15 -


Error in data acquisition “Non-response error”: error or bias introduced when

responses are not obtained from some members of the sample

Selection bias Ex 5.17, 5.18

- 16 -


6 Probability Probability is critical in statistical inference since it provides

the link between the population and the sample6.1 Assigning probability to events A “random experiment” is a process that leads to one of

several possible outcomes E.g. coin flipping; grade on a stat test; time to assemble

computer; party preference A “sample space’ of a random experiment is a set of all

possible outcomes of the experiment (exhaustive and mutually exclusive)

Requirements of probabilities: given a sample space S, the probabilities assigned to outcome must satisfy two requirements: The probability of any outcome must be between 0 and

1, i.e. The sum of the probabilities of all the outcomes in the

sample space must be 1, i.e. Three approaches to assigning probabilities

The classical approach The relative frequency approach The subjective approach

An “event” is a set of outcomes in a sample space A “simple event” is an individual outcome

The “probability of an event” is the sum of probabilities of the simple events that constitute the event

Most useful way to interpret probability is the relative frequency approach for a hypothetical, infinite number of experiments

Ex. 6.1-3 (in class), 86.2 Joint, marginal, and conditional probability Want to consider ‘combinations’ of events

- 17 -


Example: relationship between whether a mutual fund outperforms market and whether the manager of the fund has an MBA from a top-20 program

Consider a population of 1,000 mutual funds

Mutual fund outperforms market

Mutual fund does not outperform market

Totals

The manager has MBA

110 290

The manager does not have MBA

60 540

Totals 1,000

The “intersection of events A and B,” denoted “A and B,” is the event that occurs when both A and B occurs.

The probability of the intersection is called the “joint probability”

P(A randomly selected mutual fund outperforms and its manager has an MBA degree) =

What is the joint probability if we sample a mutual fund from the above population?

Mutual fund outperforms market

Mutual fund does not outperform market

Totals

The manager has MBA

.11 .29

- 18 -


The manager does not have MBA

.06 .54

Totals “Marginal probabilities” are computed by adding across

rows or down columns P(A randomly selected mutual fund manager has MBA

degree) = ? i.e., When a mutual fund is randomly selected, the

probability that its manager has an MBA is ___ i.e., ___ all mutual fund managers have an MBA

Try, P(A randomly selected mutual fund outperforms the market) = ?

“Given that a fund is fund is managed by an MBA, what’s the probability that it outperforms the market?” Given A, what’s the probability of B?

The “Conditional probability of B given A”, written P(B|A), is the probability of event B given the occurrence of another related event A. Formally, it can be computed as P(B|A)=P(A and B)/P(A)

Two events A and B are “independent” if P(A|B)=P(A) or P(B|A)=P(B) i.e., the probability of one event is not affected by the

occurrence of the other event Checking dependence: For the table like above, we can

check all four combinations but showing it for only one of them [P(B) P(B|A) for some A and B] is enough. On the other hand, showing independence would be more work

The “union” of events A and B is the event that occurs when either A or B or both occur. It is denoted as “A or B” E.g. determine P(A1 or B1) Approach #1 : sum the components #2 : 1- P(the other component)

Ex 6.86

- 19 -


6.3 Probability rules and trees Want to calculate the probability of more complex events

from the probability of simpler events Complement rule: the “complement” of event A is … and is

denoted by AC. The rule says P(AC)=1-P(A); e.g. Multiplication rule: P(A and B) = P(A|B)P(B) or, P(B|A)P(A)

Proof: If independent,… it reduces to:

The joint probability of any two independent events A and B is P(A and B)=P(A)P(B)

Ex 6.5: 7 males and 3 females. P(two randomly selected students are both female)?

Ex 6.5: 7 males and 3 females. P(two randomly selected students by two professors to answer questions are both female)?

Addition rule: P(A or B)=P(A)+P(B)-P(A and B) [revisit the above example]

When two events are mutually exclusive (two events cannot occur together), the joint prob is 0, thus the above reduces to…

P(paper A)=?, P(paper B)=?, P(both papers)=?. Then P(either paper)=?

Probability trees First choice, second choice, joint probability {F,M}, {F,M}|F and {F,M}|M, {FF, FM , MF, MM} (for the

two cases above) Ex. 6.47, 51-55, 67, 68 6.4 Bayes’ Law Skip6.5 Identifying the correct method Read

- 20 -


7 Random variables and discrete probability distributions

Motivation: Want to tell if a coin is fair. Throw it 100 times. Reject the null hypothesis that the coin is fair if # of heads is too large or small. But where do we draw the line? 90? 70? How extreme is the observed value? Need to know probability distribution of the number of heads from a balanced coin.7.1 Random variables and probability distributions E.g. # of heads in flipping of two coins; total of two dice Random variable : a function or rule that assigns a

number to each outcome of an experiment Two types of random variable :

Discrete random variable: takes on a countable number of values; e.g.

Continuous random variable: takes on uncountable number of values.

Probability distribution: a table, formula, or graph, that describes the values of a random variable and the probability associated with these values.

X vs. x: X: name of a random variable; x: value of the random variable

P(X=x) or P(x) Requirements for a discrete probability distribution function

(distribution of a discrete random variable): for all x

Example. Consider a game where the player draws a card from a deck of cards and wins $100 for spade ace, $5 for any heart and $0 for anything else. If we let X be the winning (in $), specify P(x).

- 21 -


x P(x)

Example. Consider investing money to a start-up company. After a year, it either fails, has moderate success, or has a big success with probabilities 0.8, 0.15 and __, respectively. In each case, the investment return is given by $0, $1,000 and $10,000. What’s the quantity ot consider as a random variable X? What’s P(X>0)? What’s P(X=0)?

x P(x)

Population mean: (“the expected value of X”) Population variance:

Shortcut calculation for population variance:

Population SD : Note that we’re using the same terms as in Chapter 2. It’s

not a coincidence. Consider a population consisting of N individuals and assume that for a variable X, the population relative frequency of the value x, (# of individuals that are

- 22 -


x)/N, is given by P(x). Then the population mean as a

descriptive measure for the population is same as , the expected value of X. Same can be said

for the population variance and standard deviation. Laws of expected value and variance: for a random variable

X and a constant c, E(c)=c E(X+c) = E(X)+c E(cX)=cE(X) V(c)=0 V(X+c)=V(X) V(cX)=c2V(X)

Example. The monthly sales at a computer store has a mean of $25,000 and SD of $4,000. Also,

Profits = 30% of the sales – fixed costs of $6000.

Find the mean and SD of monthly profits[conventional method vs. empirical rule method]

Ex. 7.1(d), 2(d), 7, 19(a)(d), 39 (in answering 7.39, use the fact that answers to 7.38 is E(X) =4.00 and V(X) = 2.40)

7.2 Bivariate distributions Bivariate distribution provides probabilities of combinations

of two random variables (Cf. univariate distribution) Joint probability are written P(x,y): again, table or formula

X and Y are # of houses sold by two agents, Xavier and Yvonne per day; P(x,y) = P(X=x,Y=y) are given below:

x0 1

y 0 .11 .29

- 23 -


1 .06 .54

Requirements: 0P(x,y) 1 for all x,y xy P(x,y) = 1

Marginal probabilities P(x)= all_y P(x,y), P(y) = all_x P(x,y)

E(X)=X V(X)=2

X X E(Y)=Y V(Y)=2

Y Y

Covariance: Shortcut calculation:

Coefficient correlation Two discrete random variables X and Y are independent if

two events {X=x} and {Y=y} are independent for any x and y In other words, if P(x,y)=P(x)P(y) for any x and y More informally, if X and Y don’t affect each other

Laws of expected value and variance of the sum of two variables X+Y, P(x+y) X+Y E(X+Y) = E(X) + E(Y) V(X+Y) = V(X) + V(Y) + 2COV(X,Y) If X and Y are indep, COV(X,Y)=0 and =0

Total # of houses sold by Xavier and Yvonne Ex. 7.43-46, 55, 56

- 24 -


Quiz #1 scores (out of 36) Mean = 32.09 Median = 32 SD = 2.6

7.3 Binomial distribution Binomial random variable is the number of successes in

n independent trials with a constant success probability p.

- 25 -


We write X~bin(n,p) to describe that a random variable X follows such a binomial distribution.

Such experiment is called a binomial experiment: Consists of a fixed # of trials (n) Two possible outcomes (‘success’ and ‘failure’) The success probability is p. The trials are indep.

Each trial is called a ‘Bernoulli process’ E.g. Flipping coin; draw cards (not binomial); political

survey (not quite but come close) E.g. a clueless student takes an exam consists of 5 multiple

choice (1 out of 4) questions. Delineate n and p What’s the probability that he gets no answers correct?

P(X=0); two answers correct? P(X=2)=? What’s the chance that P(fail the quiz) = P(X2)=? For a class full of similar studnets, What’s the mean

score? SD? hist(rbinom(20, 5, 1/4))

Mathematically, we can show that if X~bin(n,p), P(x) = P(X=x) = for x=0,1,…,n Here, , which reads “n choose x,” is the number of

different ways of choosing x objects from n objects. P(Xx) : cumulative probability Binomial table: Table 1 in appendix B provides values of

cumulative probability for selected n and p. (x, P(X<=x)) P(X3) =?

Can compute by (1-P(X2) In general, P(Xx) = 1-P(X x-1)

P(X=3)=? P(X=x) = P(Xx) – P(X x-1)

General formula for mean and var of a binomial random variable :

Ex. 7.81-83, 89 (computer), 90

- 26 -


7.4 Poisson Distribution Another useful discrete probability distribution. # of

occurrences of events in an interval of time or specific region of space.

Some examples Formula: P(x) = where e=2.71828… Skip

- 27 -


8 Continuous probability distributions

8.1 Probability density functions Need a completely different approach to deal with a

continuous random variables since There are infinitely, uncountably many possible values the probability of individual value is virtually zero, i.e.

P(X=x) = 0 for any x Example. duration of a commute

Table of (intervals: relative frequency) E.g. 0-10 min: .3, 10-20 min: .5, 20-30 min: .2

We can only determine the probability of a range of values only

The probabilities sum up to 1 If we divide relative frequency by interval width, we have a

set of rectangles whose area equals the probability that the random variable will fall into each interval.

Imagine very large # of small intervals. A function f(x) that approximates the curve is called a probability density function (pdf):

Requirements for a pdf over a range a ≤ x ≤ b f(x)≥0 for all x the total area under the curve between a and b is 1

Probability of an interval: the area under the curve Integral calculus helps… but we don’t want to do it. Uniform distribution

Uniform pdf is given by f(x) = 1/(b-a) where a ≤x ≤b P(x1 < X < x2) = (x2-x1)*(1/(b-a))

Ex. 8.1, 9,10 8.2 Normal distribution The most important distribution in probability and statistics Normal pdf: where e=2.71828… and

=3.14159…

- 28 -


We write: X ~ N(,2), or X follows a normal distribution with mean and standard deviation

Example: For a certain professor, the duration of the morning commute follows a normal distribution with mean 30 and standard deviation 10, i.e. the commute duration X ~ N(30, 102). Then we want to answer questions like: What’s the probability that the trip will take more than

50 minute?

What’s the probability that the trip will take between 20 and 50 minutes?

On 2.5% of days, the trip will take longer than ___ minutes.

Example: For a certain population (say, a large school), the student’s SAT score is normal distributed with mean 500 and standard deviation 50, i.e. the SAT score X ~ N(500, 502). Then we want to answer questions like: What’s the probability that a randomly selected student

scores more than 600?

What’s the probability that a randomly selected student scores between 400 and 550?

To be in top 5% in the population, how much does a student need to score?

- 29 -


To be in bottom 5% in the population, how much does a student need to score?

Symmetrical, bell shaped Centered around the mean The spread specified by the variance 2

Try applets: Normal Distribution Parameters Normal Distribution Areas

Calculating normal probabilities

Compute the area in the interval under the curve. Use the probability table Need a separate table for different and ? No - by

standardizing the random variable If X~ N(,2), the transformed variable, denoted by Z, is

called the “standard normal random variable”: ~ N(0, 1)

“probability statement about X probability statement about Z”

If X ~ N(30, 102), what is P(25 < X < 40)?

P(25 < X < 40) = = P(-.5 < Z < 1)

- 30 -


“Z=-.5 corresponds to a value of X that is one-half a standard deviation below the mean”

The table gives P(0 < Z < z) for positive z. P(Z > 0) =

P(Z < 0) =

P(Z > 2) =

1-P(0 < Z < 2) =

P(Z < -3) =

P(Z>3) =

P(0 < Z < 1) =

P(-.5 < Z < 0) =

P(0 < Z < .5) =

P(-.5 < Z < 1) =

P(-.5 < Z < 0) + P(0 < Z < 1) = P(0 < Z < .5) + P(0 < Z < 1) =

- 31 -


P(1 < Z < 2) =

P(0 < Z < 2) – P(0 < Z < 1) =

P(-2 < Z < -1) =

P(1 < Z < 2) =

We sometimes need to compute ZA, the value z such that the area to the right under the standard normal curve is A, i.e., such that P(Z > ZA)=A Use the table backward Z0.025 =

Z0.05 =

ZA = 100(1-A)th percentiles of a standard normal random variable

If X ~ N(, 2), find x such that P(X > x) = A For example, if X ~ N(600, 502), find x such that P(X > x)

= 0.05

Convert the problem to Z

- 32 -


Find z0.05

Convert back to X space

How about top 10 percent? How about top 1 percent? Ex.8.19-24, 31-32, 37-41, 58 8.3 Exponential distribution

8.4 Other continuous distributions Student-t distribution

Very commonly used in statistical inference. (chapters 12, 13, 15, 17, 18, 19)

We use symbol T() to denote the random variable that follows the student-t distribution with degrees of freedom.

This we write as T() ~ t() (a la X ~ N(, 2)) We sometime write T() as T, if is clear from the

context. Example: if a random variable T follows the student-t

distribution with 10 degrees of freedom, then: What’s the probability that T will be greater than 1.812?

What’s the value of t such that T is greater than t 5% of time?

What’s the value of t such that T is less than t 5% of time?

- 33 -


The distribution looks very close to the standard normal; the larger v is the closer it is. E(T) = 0 and V(T) = /(-2) for >2

Computing student-t probabilities Student-t probabilities can be computed using computer

(TDIST in Excel) Finding student t values such that (TINV in

Excel) Table 4 of the book t.05,10 = 1.812 t.05,25 = 1.708 -t.05,25 = -1.708

Chi-squared distribution X2 ~ 2(v) Looks like …. For different v Finding chi-squared values

(use table 5) 2

.05,8 = 15.5073 2

.95,25 = 2.73264 F distribution

F~F(v1, v2) Finding

Ex. 8.83, 84

- 34 -


Midterm score (Out of 60) mean(x) = 48.4

median(x) = 49

>

- 35 -


9 Sampling distributions

9.1 Sampling distribution of the mean Example (same as above): For a certain population (say, a

large school), the student’s SAT score is normally distributed with mean 500 and standard deviation 50, i.e. the SAT score X ~ N(500, 502).

If we randomly sample 25 students from the school and have them take SAT, what can we say about the distribution of the sample mean SAT score? In particular, What’s the mean of ?

What’s the standard deviation of ?

What’s the distribution of ?

How does a conclusion changes if the original distribution of the inidividual score was not normal?

(exact; also, we don’t need many n)

In particular, what’s P( > 550)? What’s P( < 450)? What’s P(450 < < 550)

Compare this with P(X > 550), P(X < 450) and P(450 < X < 550)[the effect of smaller standard deviation]

- 36 -


Fair die example; 1 die; 2 dice; 5? 10? Sampling distribution of the mean of fair dice and CLT

Let X be the outcome of a single throw of a fair die Distribution of X

and are computed to be 3.5 and 2.92

The “sampling distribution of the mean” of two fair dice, .

Takes on what values?

1.0, 1.5, 2.0, …., 5.5, 6.0

The “sampling distribution” of a statistic is created by repeated sampling from one population.

and , computed to be 3.5 and 1.46 (half of the original)

Consistent with what the theory tells us. See below.

Sampling distribution of for larger n=5, 10, 25.

- 37 -


(distribution becomes narrower when n

increases) or Sampling distribution of becomes increasingly bell

shaped.

- 38 -


To summarize…. The sampling distribution of the sample mean :

, and or, equivalently, .

Also, the distribution is approximately normal regardless of the original population distribution, for a sufficiently large sample size (say, n 30). The larger the sample size is, the more closely the sampling distribution of will resemble a normal distribution.If the original distribution of X is normal, then is exactly normal.

The result is called the Central Limit Theorem (CLT): The sampling distribution of the sample mean of a random

sample drawn from any population is approximately normal for a sufficiently large sample size (say, n 30). The larger the sample size is, the more closely the sampling distribution of will resemble a normal distribution.

Implication for the inference? A claim has been made that the SAT score for a private

school has the distribution X~N(550, 100^2). To check this claim, a sample of 25 people have been surveyed and the sample mean was found to be 500. What is the P(X-bar < 500) if the dean’s claim was true.

P(X-bar < 500) = P(Z<(500-550)/(100/5)) =P(Z<-2.5) = …

What’s the conclusion?? The precursor of hypothesis testing

Z.025 = 1.96 P(-1.96<Z<1.96)=.95

In general,

- 39 -


For the above example, P(760.8<X-bar < 839.2) = .95 P(748.5 < X-bar < 851.5) = .99 the precursor of interval estimation

Ex. 9.5, 6, 7, 9, 10, 11, 15, 16 9.2 Sampling distribution of a proportion Among a very large population, 48% support a certain bill

and 52% do not. If we randomly select 100 people, what can we say about the sampling distribution of the sample proportion of the people who support the bill? Among others, What’s the mean of the sample proportion?

What’s the standard deviation of the sample proportion?

What’s the distribution of the sample proportion?

What’s the chance that the sample proportion is greater than 50%?

In binomial experiment, the estimator of the population proportion of successes is the sample proportion , the # of successes divided by the sample size.

- 40 -


Normal approximation to binomial experiment: Distribution of a sample proportion is given by:

or

.

Also, the variable Z = is approximately standard normal, provided that n is large. (i.e. both np ≥ 5 and n (1 p) ≥ 5)

Ex. 9.30, 349.3 Sampling distribution of the difference between two means For two separate population A and B (say, two large

schools), the SAT score of individual student follows N(550, 502) and N(500, 502) distributions, respectively. In other words, if we let X1 and X2 to denote respective random variables, X1 ~ N(550, 502) and X2 ~ N(500, 502). If we randomly select 25 students each from population A and B, what is the distribution of the difference between two sample means, ? In particular, What’s the mean of ?

What’s the standard deviation of ?

What’s the distribution of ?

What’s P( > 60)=?

- 41 -


How do the above change if X1 and X2 don’t follow a normal distribution?

For independent random samples of size n1 and n2 drawn

from of two normal populations N(1, 12) and N(2, 2

2), respectively, the difference of sample means has a normal distribution. Even when the two original distributions are not normal, the distribution of is approximately normal if both n1 and n2 are large (say both n1 30 and n2 30). Also,

and

Ex. 9.45, 46

- 42 -


10 Intro to estimation So far, we assumed known parameters and study the

sampling distribution of various statistics.

What if, we don’t know the value of parameters but have observed a single value of a statistic? We want to say something about the parameters.

For a certain population (say, a large school), the student’s SAT score is normally distributed with mean 500 and standard deviation 50.

If we randomly sample 25 students from the school and have them take SAT, what can we say about the distribution of the sample mean SAT score? In particular,

A certain school has the population mean score of 500 and standard deviation of 50. If we randomly sample 25 students from the school and have them take SAT, what can we say about the distribution of the sample mean SAT score? In particular, ….

A certain school has the unknown population mean score and standard deviation of 50. When we randomly sampled 25 students from the school and had them take SAT, we observed = 520. What can we say about the distribution of the sample mean SAT score?

A certain school has the unknown population mean score and unknown standard deviation . When we randomly sampled 100 students from the school and had them take SAT, we observed = 52 and s = 45. What can we say about the distribution of the sample mean SAT score?

- 43 -


Two general procedures for inference: estimation and

hypothesis testing10.1Concept of estimation A “point estimator” draws inferences about a population

by estimating the value of an unknown parameter using a single value or point

An “interval estimator” draws inferences about a population by estimating the value of an unknown parameter using an interval E.g. mean weakly income of sample of 25 students is

$400. (can also use $380-$420) An “unbiased estimator” of a population parameter is an

estimator whose expected value is equal to that parameter An unbiased estimator is said to be “consistent” if the

difference between the estimator and the parameter grows smaller as the sample size grows larger is consistent estimator of

10.2Estimating the population mean when the population SD is known

In general, confidence interval is of the form:

(the estimate) (a constant) (standard error of the estimate)

E.g. (SAT score) We believe X ~ N(, 502). For a certain school, if = 520 for n = 25, what is 95% CI for ? 90% CI? 99% CI?

- 44 -


100(1-)% confidence interval estimator of the unknown population mean is , or “Lower confidence limit” and “upper confidence limit” The probability 1- is called the “confidence level”

Table of (100(1 – )%, , /2, z/2) for 90%, 95%, 99% confidence levels

Why? : P[100(1 – )% confidence interval containing ] = (1 – ) The variable is standard normal or approximately

standard normal CI is random, but is not.

Interpreting the CI: It’s important to realize that we observe only one sample and only one value of . Cannot be correct all the time. Aim to be correct 95% of time.

The sampling error of 100(1 – )% confidence interval is .

We want {larger, smaller} sampling error, or {wider, narrower} CI

Larger leads to {narrower, wider} interval

Increasing the confidence level 100(1-)% leads to {narrower, wider} interval

Increasing sample size n leads to {narrower, wider} interval

Ex. 10.9, 10, 11, 15, 21

- 45 -


10.3Selecting the sample size Sample size to estimate the mean within W at (1-)

confidence level, Ex. In estimating the population mean SAT score, I want to

estimate it with W=10 for confidence level = .90 or alpha=1. How many samples do I need?

Ex. 10.41, 42, 51

- 46 -


11 Introduction to hypothesis testing Are there enough statistical evidences to enable us to

conclude that a belief or hypothesis about a parameter is supported by the data.

11.1Concepts of hypothesis testing E.g. Is a particular school A has the mean SAT score greater

than the national average of 0 = 500? We assume X ~ N(, 502) and just observed = 510 for n=100.

The null hypothesis H0: “the private school has the same mean SAT score as

the national average of 500 (usually specified as the status quo)”

The alternative hypothesis H1: “the private school has the mean SAT score higher

than 500”

There are two possible decisions: accept H0 or reject H0.

DecisionAccept H0 Reject H0

Truth

H0 true

H1 true

More common to say “cannot reject H0” than “Accept H0”

The decision is either correct or wrong. When the decision is wrong, we commit either:

- 47 -


Type I error: wrongfully reject H0 = P(Type I error)

Type II error: wrongfully accept H0 = P(Type II error)

The type I error probability of a certain testing procedure is called the significance level of the test, or sometimes just level of the test and is written as .

The test statistic: a statistic which we base our decision upon “The observed sample mean score” If the value of test stat is inconsistent with H0 (and

more consistent with H1), we reject H0. “Sufficient evidence” = “evidence beyond a

reasonable doubt” We use “sampling distribution” of the test statistic to

decide how sufficient the evidence is

The rejection region is a range of values such that if the test statistic falls into that range, we reject the H0 in favor of H1

Testing begins by assuming H0 is true. We reject the H0 if the test statistic has the value that is inconsistent with H0 but is consistent with H1. But how inconsistent does it have to be for us to reject it? That’s up to us. How aggressive do we want to be in rejecting the null?

Aggressive More likely to reject the correct H0 More likely to commit type I error Test with a larger

Conservative Less likely to reject the correct H0

- 48 -


Less likely to commit type I error Test with a smaller

A particular decision rule (test) is obtained by deciding on level , the type I probability we are willing to accept. Typically, = 0.05 = 5% is used.

The conclusion of the test is stated either as: “We reject H0 at 5% significance level” “We cannot reject H0 at 5% significance level”, etc.

Somehow, we don’t say “accept H0 at 5% significance level” or “accept H1 at 5% significance level”.

The P-value of a test is the probability of observing a test statistic at least as extreme as the one observed given that H0 is true.

In the example, H0: = 500. The alternative H1 could be of the form: H1: > 500 H1: < 500 H1: ≠ 500

Testing either the first two are called one-sided hypothesis testing and testing the third called two-sided hypothesis testing.

11.2Testing the population mean when the population standard deviation is known

Recall the example: Is a particular school A has the mean SAT score greater than the national average of 0 = 500? We assume X ~ N(, 502) and just observed = 510 for

- 49 -


n=100.

For testing H0: = 0 vs. H1: > 0, the test at level rejects H0 if

> z

The p-value = P(Z > z)

That test has type I error probability = .

If H0 is rejected, we say “the result is statistically significant at significance level ”

Can we reject H0 at = .10?

At = .05?

At = .01?

(1.28, 1.64, 2.33 for 10%,5%,1%)

P-value = P(Z > observed z)

In the example, P-value = P( > 510, given that H0 is true)

= P(Z>2) = .0228 Large P-value suggests H0 is more likely Small P-value suggests H1 is more likely

- 50 -


The level hypothesis testing is equivalent to “Reject H0 if p is { > or } ”

It is a better practice to report the P-value than just “accept” or “reject”

The test at level reject H0 if and only if P-value < . Equivalently, P-value is the smallest significance level at which a test can reject H0.

For testing H0: = 0 vs. H1: < 0, the test at level rejects H0 if

< z .The p-value = P(Z< z)

E.g. For another school B, the test score X ~ N(, 502). Is school B significantly worse than the national average? We observed = 492.5 for n = 100.

Can we reject H0 at = 0.05?

What’s the P-value?

P-value = P(Z<-1.5)= 0.0668

For testing H0: = 0 vs. H1: ≠ 0, the test at level rejects H0 if

> z/2 .(Or, equivalently, if z < z/2 or z > z/2)

- 51 -


The p-value = 2 P(Z>|z|)

E.g. For another school C, the test score X ~ N(, 502). Is school C significantly different from the national average? Observed = 492.5 for n=100.

Can we reject H0 at = 0.05?

What’s the P-value?

P-value =2 P(Z>1.5) = 0.1236

Interpreting the test results

Ex. 11.7-9, 13-15, 2811.3What about type II error probability? Recall that a test at level rejects H0: = 0 for H1: > 0

rejects if > z. Let’s suppose that indeed H1 is true, specifically, the true = 1 > 0. Then

P(Type II error when = 1) = P(Not reject H0 when = 1)= P( when = 1)= P( when = 1)= P( when = 1)

= P( ): increases as 1 (> 0) gets closer to 0 (i.e., as the problem

- 52 -


becomes harder)

Decreasing Type I error (smaller ) leads to larger type II error There’s no free lunch

If n increases, Type II error decreases for the given Since there is more information

Power of the test = 1 P(Type II error) = P(correctly rejecting the null)

OC (operating characteristic) curve Ex. 11.48, 49, 61 (??)11.4The road ahead You’re pretty much done for the quarter!

Three steps Define the problem Identify the appropriate method Interpret the results

Describe a population Compare two populations Compare two or more populations Analyze the relationship between two variables

- 53 -


12 Inference about a population

12.1 Inferences about population mean for a normal population, unknown

Is a particular school has the mean SAT score greater than the national average of 0 = 500? We assume X ~ N(,2) with unknown . We just observed = 510 and s = 45.0 for n = 25.

100(1-)% confidence Interval for , unknown

For H0: = 0 vs. H1: > 0,test at significance level rejects H0 if t > t,n-1

The P-value is P(t computed t)

For H0: = 0 vs. H1: < 0,test at significance level rejects H0 if t < t,n-1

The P-value is P(t computed t)

For H0: = 0 vs. H1: 0,test at significance level rejects H0 if |t| > t/2,n-1

(Or, equivalently, if t < t/2,n-1 or t > t/2,n-1)The P-value is 2P(t |computed t|)

The effect of non-normality on the inference based on t distribution

- 54 -


What kind of non-normality? (skewed, heavy tailed) Effect on the power, level of test, etc.

Consider “robust methods” of estimation and inference Checking required condition

Normality by histogram; if n is large, OK. Ex. 12.1, 2d, 3d, 4d, 8a, 9, 13, 21 12.2 Inference about a population variance Skip12.3 Inference about a population proportion E.g. from an exit poll of n = 765 voters, x = 407 people

were observed to have voted for a bill.

What’s ?

What’s the 95% Confidence interval?

For H0: p = .5 vs. H1: p > .5, can we reject H0 at = 5%?

If and , the distribution of can be

approximated by

100(1-)% confidence interval for the population proportion p is given by where

For H0: p = p0 vs. H1: p > p0, the test at level rejects H0 if

> z.

P-value = P(Z > z)

- 55 -


For H0: p = p0 vs. H1: p < p0, the test at level rejects H0 if

< z.

P-value = P(Z < z)

For H0: p = p0 vs. H1: p > p0, the test at level rejects H0 if

> z/2.

(Or, equivalently, if z < z/2 or z > z/2)P-value = 2 P(Z > |z|)

Sample size for estimating p within W at confidence level alpha = . Conservative estimate is given by the formula for = ½. Wish to estimate the above proportion within .03. What’s

required n? Ex. 12.54, 58, 66

- 56 -


13 Inference about comparing two population Keywords: pooled variance estimator; equal-variances test

statistic, How to tell if two variances are equal? Methods are there

but informal method would be fine for now. If there is no strong evidence against equal variance, it’s

usually “better” to assume the equal variance one. (why?) Checking required conditions

Draw histogram to check normality; if sample size is large, we’re OK; this t-testis robust too; if not normal there are nonparametric methods

E.g. comparing mean SAT score for school 1 and school 2 X1~N(1, 1

2), X2~N(2, 22)

For n1=25 and n2=25 samples, =530; =500 and s1=90 and s2=120

Assuming equal variances; sp=106; denom=30; 95% CI for mu1-mu2? (-30, 90); df=48; P-value = .16

Assuming unequal variances; nu=44.5 or 45 or 44; 95% CI for mu1-mu2= (-30, 90) (slightly larger than the previous one)

13.1 Inference about the difference between two means: independent samples

Consider three cases Case 1. Both population distributions are normally

distributed with Case 2. Both sample sizes n1 and n2 are large Case 3. The sample sizes n1 or n2 are small and the

population distributions are non-normal Concentrate on Case 1 for now

Population distributions are normal with equal variances Statistics:

- 57 -


Confidence interval for , independent samples:

where

Why? (sample distribution of ) Reasonably stable for mound-shaped distributions and

approximately equal SD A statistical test for , independent samples:

H0: Ha:

(D0 is a specified value, often 0)

T.S. :

R.R. : for a level , reject H0 if Check assumptions and draw conclusions

Test whether the mean score of school 1 is higher than school 2. Use =.05.

P-value of the test? 95% confidence interval on the difference of means? sp=?

Three critical conditions Two random samples are independent Population distributions are normal or mound-shaped Two population variances are equal

Approximate t-test for independent samples, unequal variance

T.S. and d.f. is [ or

where (round down to the

nearest integer)]

- 58 -


Similar for confidence interval. Ex. 13.1a, 2a, 3a, 5b, 713.2Observational and experimental (controlled study) data The latter is more expensive but can shed more right on

causality E.g. Slytherine may not be a better school than Gryffindor;

it may be that simply there are more good students going there; what kind of experimental study would be possible?

13.3 Inference about the difference between two means: matched pairs experiment

What if each measurement in one sample is “matched” or “paired” with a particular measurement in the other sample? E.g. comparing repair estimates from two garages for

each of 15 cars damaged by accidents Two-sample t-test gives a nonsense result. Why?

Also called ‘paired t-test’ (more common than ‘matched t-test’

Ask: is there some natural relationship exist between each pair of observations?

E.g. SAT score for before-and-after attending certain prep school;

Not the two-sample t-test. But run the regular t-test on the differences

Solution: use differences and obtain its sample mean and SD, .

Test hypotheses about Paired t test

H0: Ha:

(D0 is a specified value, often 0) T.S. : R.R. : for a level , reject H0 if

- 59 -


Check assumptions and draw conclusions Confidence interval for based on paired data:

Of course, assuming

the distribution of the s is (close to) a normal distribution

the differences are independent13.4 Inference about the ratio of two variances Skip13.5 Inference about the difference between two population

proportions

Use and Confidence intervals for 1-2 are given by

where Statistical test for H0: 1-20 etc. is based on

.

E.g. accident rate of vehicles with ABS and those without ABS

- 60 -


14 Statistical inference: review of chapters 12 and 13

Graphical methods and numerical measures for univariate data Variable type

Methods

X

Categorical Frequency, relative frequency, ; bar-chart, pie-chart

Interval , median, , percentiles; histogram, boxplot, (stem-and-leaf, ogive)

Graphical methods and numerical measures for bivariate data

YCategorical Interval

X

Categorical

contingency table ,;

bar-chart

;side-by-side boxplots

Interval ? , Cov(X,Y), ;scatter plot

Univariate statistical inference techniquesVariable type

Methods

XCategorical One-sample proportionInterval One-sample z-test; one-sample

t-test

Bivariate statistical inference techniquesY

- 61 -


Categorical Interval

XCategorical

Two-sample proportion; Chi-squared analysis

Two-sample t-test; ANOVA

Interval Logistic regression Regression

- 62 -


15 Analysis of variance Skip

16 Chi-squared tests Skip

- 63 -


17 Simple linear regression and correlation Regression analysis is used to predict the value of one

variable (Y) on the basis of other variables ( ) E.g. midterm score vs. final score for a class17.1Model Simple linear regression model:

where is the “error variable”17.2Estimating the coefficients Least squares regression line is obtained by finding b0, b1

which minimizes , where , the (predicted) value of y, is determined by the line

“Least squares line coefficients” are given by and .

See the old notes for formula for , etc. E.g. Computing the regression line from basic statistics17.3Required conditions on the error variable Conditions:

The probability distribution of is normal E()=0 The standard deviation of is , which is constant

regardless of the value of x. The value of for different observations are independent

(Or, people says iid N(0,2))

17.4Assessing the model How well does our model fit the data? (There may be no

relationship at all!)

- 64 -


“The sum of squares for error (SSE)” can also be computed as The {smaller, larger} SSE suggests more accurate model

“The standard error (SE) of estimate” is an estimate of , given by

The {smaller, larger} suggests more accurate model To formally test whether the slope is non-zero, do the

following: H0: 1=0 vs. H1: 1≠0 Test statistic is given by

where

If assumptions regarding error variable hold: Under H0, t

follows student t distribution with v=n-2 degrees of freedom

At significance level , reject H0 if |t|>t/2(n-2) 100(1-)% C.I. for 1 is given by

Coefficient of determination is given by

=(explained variation in y)/(total variation in y)= (Regression SS)/(Total SS) The {higher, lower} value of R2 means better fit of the

linear model Need to be able to extract SOME information from Minitab

output Typical disclaimer: correlation doesn’t imply causality

- 65 -


17.5(Optional) Applications in finance

17.6 Using the regression equation E.g. Things we really care about: Predicting the final score

from the midterm score Predicting the particular value of y for a given x:

Estimating the expected value of y for a given x:

These intervals gets {wider, narrower} as xg moves away from

18 Multiple regression Extension of the simple linear regression to multiple X

variables E.g. predicting final from midterm, quiz #1, quiz #2 scores

- 66 -


19 Logistics and things you/I really care Glossary See the syllabus19.1Couple of words about quiz #2 and final

19.1.1 Quiz #2 n=25; Sample mean = 23.64; sampled median = 24;

sample SD=4.26

Correlation between quiz#2 and midterm score=0.57219.1.2 Final No need for cheat-sheet for part I (you will be given a

formula sheet) Make your own cheat-sheet for part II (covers chapter 8~) Need Assist form for part I of the final! A reminder that effective Winter Quarter 2006, Assist

(“Scantron”) forms will no longer be provided by the Statistics Department for the STAT 1000 and STAT 2010 standardized tests.

Students can purchase Assist forms at the CSUEB Bookstore for 50 cents

- 67 -


INSTRUCTIONS FOR COMPLETING ASSIST FORMS:

1. Students should enter as much of their names as possible in the “Your Last Name” boxes.

2. The two letters of their Net ID are entered in the "First Initial"and "Middle Initial" boxes on that same line.

3. The four digits of their Net ID should be entered in the first fourboxes of the "Social Security Number" section.

- 68 -


19.2Practice midterm (50 minutes) …

- 69 -


19.3Practice final (50 minutes) Is anti-lock brake system (ABS) in cars really effective? If it

were effective, The number of accidents would decrease, and The cost of accident repairs would be less

Data were collected on 500 cars with ABS and 500 cars without. The number of cars involved in accidents was recorded, as was the cost of repairs. 42 out of 500 cars without ABS had accident and 38 out

of 500 cars with ABS had accident in a given year. What can we conclude?

For the repair cost for the two groups, we obtain: =42, = 2,075 and s1= 671 =38, = 1,714 and s2= 624

For the two situations above, perform: Compute the 95% CI for the parameter of interest Set up the null and alternative hypotheses Compute test statistic and perform the test at 5%

significance level Compute the P-value for the test (if you can)

- 70 -


STAT/MATH 6401, Advanced Probability I, Fall 2005 Course Note

Dr. Jaimie KwonMay 9, 2023

Table of Contents

1 What is statistics?...............................................................22 Graphical and tabular descriptive statistics........................3

2.1 Types of data................................................................32.2 Techniques for nominal data.........................................42.3 Graphical techniques for interval data..........................42.4 Describing the relationship b/w two variables...............52.5 Time series data............................................................5

3 Art and science of graphical presentations.........................64 Numerical descriptive techniques.......................................7

4.1 Measures of central location.........................................74.2 Measures of variability..................................................74.3 Measures of relative standing and box plots.................94.4 Measures of linear relationship.....................................94.5 Comparing graphical and numerical techniques.........114.6 General guidelines for exploring data.........................11

5 Data collection and sampling...........................................125.1 Methods of collecting data..........................................125.2 Sampling.....................................................................125.3 Sampling plans............................................................125.4 Sampling and nonsampling errors...............................13

6 Probability.........................................................................156.1 Assigning probability to events...................................156.2 Joint, marginal, and conditional probability.................156.3 Probability rules and trees..........................................176.4 Bayes’ Law..................................................................186.5 Identifying the correct method....................................18

7 Random variables and discrete probability distributions. .197.1 Random variables and probability distributions..........197.2 Bivariate distributions.................................................21

- 71 -


7.3 Binomial distribution...................................................237.4 Poisson Distribution.....................................................25

8 Continuous probability distributions.................................268.1 Probability density functions.......................................268.2 Normal distribution......................................................268.3 Exponential distribution..............................................318.4 Other continuous distributions....................................31

9 Sampling distributions......................................................349.1 Sampling distribution of the mean..............................349.2 Sampling distribution of a proportion..........................389.3 Sampling distribution of the difference between two means.................................................................................39

10 Intro to estimation.........................................................4110.1 Concept of estimation..............................................4210.2 Estimating the population mean when the population SD is known.........................................................................4210.3 Selecting the sample size.........................................43

11 Introduction to hypothesis testing.................................4411.1 Concepts of hypothesis testing.................................4411.2 Testing the population mean when the population standard deviation is known................................................4511.3 What about type II error probability?........................4611.4 The road ahead........................................................46

12 Inference about a population.........................................4712.1 Inference about a population mean when the sd is unknown..............................................................................4712.2 Inference about a population variance.....................4812.3 Inference about a population proportion..................48

13 Inference about comparing two population...................4913.1 Inference about the difference between two means: independent samples..........................................................4913.2 Observational and experimental (controlled study) data 5113.3 Inference about the difference between two means: matched pairs experiment..................................................5113.4 Inference about the ratio of two variances...............52

- 72 -


13.5 Inference about the difference between two population proportions........................................................52

14 Statistical inference: review of chapters 12 and 13.......5315 Analysis of variance.......................................................5416 Chi-squared tests...........................................................5417 Simple linear regression and correlation........................55

17.1 Model........................................................................5517.2 Estimating the coefficients.......................................5517.3 Required conditions on the error variable................5517.4 Assessing the model.................................................5517.5 (Optional) Applications in finance.............................5717.6 Using the regression equation..................................57

18 Multiple regression.........................................................5719 Logistics and things you/I really care.............................58

19.1 Couple of words about quiz #2 and final..................5819.1.1 Quiz #2...............................................................5819.1.2 Final....................................................................58

19.2 Practice midterm (50 minutes).................................6019.3 Practice final (50 minutes)........................................61

- 73 -

Date post:	09-Jun-2018
Category:	Documents
Upload:	nguyenngoc
View:	220 times
Download:	0 times

Introduction to Nonparametric Statistical Methodsjkwon/classes/stat_2010/2010... · Web viewRandom...

Documents