Excel Statistical Analysis

Conjoint Analysis

Conjoint Analysis is used by marketers to tell which product attributes of a product are most important to a consumerand to what degree is each important to the consumer.

Step 1 - Make a list of product attributesto be evaluated by consumer.

Brand Color PriceA Red $50 B Blue $100 C $150

Step 2 - Make a complete list of all possible Step 3 - Have the consumer rank each combination attribute combinations. on a scale of 1 (worst) to 10 (best).

Card Brand Color Price Card Brand Color Price1 A Red 50 1 1 1 502 A Red 100 2 1 1 1003 A Red 150 3 1 1 1504 A Blue 50 4 1 2 505 A Blue 100 5 1 2 1006 A Blue 150 6 1 2 1507 B Red 50 7 2 1 508 B Red 100 8 2 1 1009 B Red 150 9 2 1 150

10 B Blue 50 10 2 2 5011 B Blue 100 11 2 2 10012 B Blue 150 12 2 2 15013 C Red 50 13 3 1 5014 C Red 100 14 3 1 10015 C Red 150 15 3 1 15016 C Blue 50 16 3 2 5017 C Blue 100 17 3 2 10018 C Blue 150 18 3 2 150

Step 4 - Final data preparation step prior to running regression - Remove 1 variable from each set ofvariables with more than 1 choice. Removal of these variables removes the predictability of the other variables.

Card A B C Red Blue $50 $100 $150 1 1 0 0 1 0 1 0 02 1 0 0 1 0 0 1 03 1 0 0 1 0 0 0 14 1 0 0 0 1 1 0 05 1 0 0 0 1 0 1 06 1 0 0 0 1 0 0 17 0 1 0 1 0 1 0 08 0 1 0 1 0 0 1 0

9 0 1 0 1 0 0 0 110 0 1 0 0 1 1 0 011 0 1 0 0 1 0 1 012 0 1 0 0 1 0 0 113 0 0 1 1 0 1 0 014 0 0 1 1 0 0 1 015 0 0 1 1 0 0 0 116 0 0 1 0 1 1 0 017 0 0 1 0 1 0 1 018 0 0 1 0 1 0 0 1

Card B C Blue $100 $150 1 0 0 0 0 02 0 0 0 1 03 0 0 0 0 14 0 0 1 0 05 0 0 1 1 06 0 0 1 0 17 1 0 0 0 0

8 1 0 0 1 09 1 0 0 0 1

10 1 0 1 0 011 1 0 1 1 012 1 0 1 0 113 0 1 0 0 014 0 1 0 1 015 0 1 0 0 116 0 1 1 0 017 0 1 1 1 018 0 1 1 0 1

Conjoint Analysis is used by marketers to tell which product attributes of a product are most important to a consumer

ConjointConjoint is an analysis that provides a marketer with a method to predict how much more or less a consumer will value one combination of product attributes over another combination of product attributes. The degree that a consumer likes a product attribute is called the "utility" of that attribute. For example, a product might come in three brands, two colors, and at three levels of price. Each color, brand, and price level will have its own utility caluculated during the conjoint analysis. Conjoint is done using Multiple Regression. Each product attribute variation will assigned as one of the independent variable inputs to the Multiple Regression equation. For example, the color red will be represented by one independent variable while the colorblue will be presented by another independent variable. The resulting regression equation assigns a coefficient to each independent

Step 3 - Have the consumer rank each combination variable. These coefficients are the utilities of each of the attributes. The more positive an individual coefficient is, the morehighly valued is the associated product attribute. The coefficients can be interrpretted as the utilities of the variables.

Preference5 In this conjoint exercise, we are going to determine the utilities of eight product attributes. They are as follows:50 There are 18 possible combinations of these attributes (3 brands x two colors x three prices). The consumer rates each combination 8 on a scale of 0 to 10 (10 being the best). The consumer test results are modified for the regression equation and then run through the regression. 5 The resulting regression analysis calculates a coefficient for each independent variable as part of the regression output equation. 2 Each coefficient is the measure of value that the consumer places on the product attribute associated with that utiliy. 75 The chart on the left side provides the choices that the consumer had to analyze. The consumer3 was provided with 18 separate cards. Each card contained one of the 18 possible variations of9 product attributes. The consumer had to rate their overall preference of each combination of attributes 6 on a scale of 1 to 10. 5

10 The chart on the right shows the consumer's stated preference for each combination of attributes.7 Non-numerical attributes were assigned numbers. Brand A and Red are shown as 1's in their respective columns. Brand B and Blue were shown as 2's in their 5 respective columns. Brand C was assigned a 3 in its respective column.978

Step 4 - Final data preparation step prior to running regression - Remove 1 variable from each set ofvariables with more than 1 choice. Removal of these variables removes the predictability of the other variables.

Preference55085275 The chart is now further prepared for Regression Analysis. Each individual product attribute

3 is given its own column. Each product attribute now has either the value of 1 or 0. 965 One problem must be corrected before this data can be submitted for Regression

10 Analysis. Independent variables or combinations of independent variables should7 not be able to predict each other. Using independent variables that are highly correlated 5 to each other (either positively or negatively) produce a regression error known as co-linearity. 97 For example, if the color is either red or blue, knowing the state of one of the color (if8 the state of Blue = 1, the state of Red must = 0), we know the state of the other color.

This error condition also occurs when there are 3 variables. If you know the states of 2,Preference you know the state of the remaining one.

55 These error conditions are solved by removing one column of data from each type of 0 variation. Information about Brand A, Red, and Price level $50 were removed. 85 We will see below that this has no effect on the accuracy of the Regression output.27

5 SUMMARY OUTPUT39 Regression Statistics6 Multiple R 0.933190299015 R Square 0.870844134166

10 Adjusted R Square 0.8121369224237 Standard Error 1.1413191611515 Observations 1797 ANOVA8 df SS MS

Regression 5 96.6124727668845 19.3224946Residual 11 14.3287037037037 1.30260943Total 16 110.941176470588

Coefficients Standard Error t StatIntercept 5.916666666667 0.80703451834771 7.33136753Brand B 1.513888888889 0.69891239462005 2.16606387Brand C 3.347222222222 0.69891239462005 4.7891871Blue 1.231481481481 0.55999210574665 2.19910507$100 -2.31944444444 0.69891239462005 -3.31864832$150 -4.31944444444 0.69891239462005 -6.18023729

Regression Equation Combination Preference = 5.91666666666667 + (1.51388888888889)*(Brand B) + (3.34722222222222)*(Brand C) + (1.23148148148148)*(Blue) + (-2.31944444444445)*($100) + (-4.319444444)*($150)

Removing information about Brand A, Red, and Price level $50 did not hurt the outputaccuracy. These product attributes could still be considered to be part of the Regression equation, but with coefficients of 0.

The coefficients attached to each of the product attributes simply show the consumer'sutility for that attribute. The utilities for each attribute are relative to each other.

For example, Price level $50 has the highest preference with with a utility of 0 while Price level $150 has the lowest utility of -4.319444444. Blue has a utility of 1.231481481, which is that much hgiher than the utility of red, which was 0. Brand C was the most liked brand with a utility of 3.347222222 with Brand A is liked the least with a utility of 0.

The resulting Regression Equation still does a good job of predicting overall preference.For example, the consumer rated the combination of attributes on card 13 with a 10.

Here the predicted Combination Preference for card 13 attribute combination is: (5.9166) + (3.3472)(1) = 9.263 which is very close to the consumer's rating of 10.

The regression appears to be a good one because Adjusted R Squared is high (close to 1).Adjusted R Square = Explained variance over unexplained variance. Here, Adjusted R Square is 8.12.

Each of the variables has a low p-Value and is therefore a significant predictor.

The absolute value of the coefficients indicates the effect that each has on the consumer'soverall liking of product. For example, Brand C (coefficient = 3.347) produced the highest positive influence while the $150 price (coefficient = -4.319) reduces consumer liking the most.

The overall low significance of the regressions F statistic indicates that the regression, overall, is valid.

Conjoint is an analysis that provides a marketer with a method to predict how much more or less a consumer will value one combination of product attributes over another combination of product attributes. The degree that a consumer likes a product attribute is called the "utility" of that attribute. For example, a product might come in three brands, two colors, and at three levels of price. Each color, brand, and price level will have its own utility caluculated during the conjoint analysis. Conjoint is done using Multiple Regression. Each product attribute variation will assigned as one of the independent variable inputs to the Multiple Regression equation. For example, the color red will be represented by one independent variable while the colorblue will be presented by another independent variable. The resulting regression equation assigns a coefficient to each independent variable. These coefficients are the utilities of each of the attributes. The more positive an individual coefficient is, the morehighly valued is the associated product attribute. The coefficients can be interrpretted as the utilities of the variables.

In this conjoint exercise, we are going to determine the utilities of eight product attributes. They are as follows:

There are 18 possible combinations of these attributes (3 brands x two colors x three prices). The consumer rates each combination on a scale of 0 to 10 (10 being the best). The consumer test results are modified for the regression equation and then run through the regression. The resulting regression analysis calculates a coefficient for each independent variable as part of the regression output equation. Each coefficient is the measure of value that the consumer places on the product attribute associated with that utiliy.

The chart on the left side provides the choices that the consumer had to analyze. The consumerwas provided with 18 separate cards. Each card contained one of the 18 possible variations ofproduct attributes. The consumer had to rate their overall preference of each combination of attributes

The chart on the right shows the consumer's stated preference for each combination of attributes.Non-numerical attributes were assigned numbers. Brand A and Red are shown as 1's in their respective columns. Brand B and Blue were shown as 2's in their

The chart is now further prepared for Regression Analysis. Each individual product attribute

is given its own column. Each product attribute now has either the value of 1 or 0.

One problem must be corrected before this data can be submitted for RegressionAnalysis. Independent variables or combinations of independent variables shouldnot be able to predict each other. Using independent variables that are highly correlated to each other (either positively or negatively) produce a regression error known as co-linearity.

For example, if the color is either red or blue, knowing the state of one of the color (ifthe state of Blue = 1, the state of Red must = 0), we know the state of the other color.

This error condition also occurs when there are 3 variables. If you know the states of 2,

These error conditions are solved by removing one column of data from each type of variation. Information about Brand A, Red, and Price level $50 were removed.

We will see below that this has no effect on the accuracy of the Regression output.

F Significance F14.83368241 0.000143011

P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%1.4827739E-05 4.14039566921 7.692937664 4.14039566921 7.69293766412620.05314022391 -0.02440691892 3.052184697 -0.0244069189 3.0521846966944

0.0005630386 1.80892641442 4.88551803 1.80892641442 4.88551803002770.05016445725 -0.00105283227 2.464015795 -0.0010528323 2.46401579523120.00684768726 -3.85774025225 -0.781148637 -3.8577402522 -0.7811486366396.9058263E-05 -5.85774025225 -2.781148637 -5.8577402522 -2.781148636639


Removing information about Brand A, Red, and Price level $50 did not hurt the outputaccuracy. These product attributes could still be considered to be part of the

The coefficients attached to each of the product attributes simply show the consumer'sutility for that attribute. The utilities for each attribute are relative to each other.

For example, Price level $50 has the highest preference with with a utility of 0 while Price level $150 has the lowest utility of -4.319444444. Blue has a utility of 1.231481481, which is that much hgiher than the utility of red, which was 0. Brand C was the most liked brand with a utility of 3.347222222 with Brand A is liked the least with a utility of 0.

The resulting Regression Equation still does a good job of predicting overall preference.For example, the consumer rated the combination of attributes on card 13 with a 10.

Here the predicted Combination Preference for card 13 attribute combination is: (5.9166) + (3.3472)(1) = 9.263 which is very close to the consumer's rating of 10.

The regression appears to be a good one because Adjusted R Squared is high (close to 1).Adjusted R Square = Explained variance over unexplained variance. Here, Adjusted R Square is 8.12.

Each of the variables has a low p-Value and is therefore a significant predictor.

The absolute value of the coefficients indicates the effect that each has on the consumer'soverall liking of product. For example, Brand C (coefficient = 3.347) produced the highest positive influence while the $150 price (coefficient = -4.319) reduces consumer liking the most.

The overall low significance of the regressions F statistic indicates that the regression, overall, is valid.


Regression

Regression is a statistical techniques that is used to create predictive models. The models receive input (independent) variables and predict the outcome of the dependent variable.

When performing Multiple Regression, Correlation Analysis should be performed on a independent and dependent variables first, as below.

Monthly Rates of Return

Date S&P Viacom AT&T GM Coke1/30/1998 0.8799 0.7541 2.1407 -4.6296 -18.84062/27/1998 7.5187 14.9701 -2.5948 18.986 6.69643/31/1998 5.558 11.9792 7.7869 -1.7226 -3.34734/30/1998 1.3716 7.907 -8.5551 -0.5535 5.8442

5/29/1998 -1.6289 -5.1724 1.2474 6.679 1.94276/27/1998 2.4171 3.4091 0.8214 1.8261 2.1063

S&P Viacom AT&T GM0.8799 0.7541 2.1407 -4.62967.5187 14.9701 -2.5948 18.986

5.558 11.9792 7.7869 -1.72261.3716 7.907 -8.5551 -0.5535

-1.6289 -5.1724 1.2474 6.6792.4171 3.4091 0.8214 1.8261

Regression is a statistical techniques that is used to create predictive models. The models receive input (independent) variables and predict

When performing Multiple Regression, Correlation Analysis should be performed on a independent and dependent variables first, as below.

Correlation AnalysisTools / Data Analysis / Correlation

S&P Viacom AT&TS&P 1Viacom 0.938661646776515 1AT&T 0.128558378742314 -0.0989328141710348 1GM 0.470349106559988 0.35043796665674 -0.26371085984

Coke 0.2550526617323 0.342337358065012 -0.50149020822

Coke has a low correlation with the S&P and is therefore not a good predictor of the S&PAlso, if two of the independent variables above are highly correctlated with each other, only one of them shouldbe used in the Multiple Regression below. This is not the case here because none of the variables above have a high correlation with another variable. Using highly correlated variables as inputs to a Multiple Regressioncauses an error called Multicollinearity and should be avoided. Multiple Regressions should be built up by adding one new independent variable at a time and evaluating results. Good new independent variables noticeably raise R-Square and lower Standard Error without causing much change to Coefficients. Poor new independent variables don't change R-Square much but have unpredictable effects on Coefficients. Build regressions up one variable at a time and evaluate after adding each new variable.

Tools / Data Analysis / RegressionCoke was not used because it has a low correlation with S&P and is therefore not a good predictor of the S&PAll others (Viacom, AT&T, GM) were used because they had a relatively high correction with S&P and low corrections with each otherRegressions are Predictive, not Forecasting. All new independent variables must be chosen from within the range of the previously sampled independent variable,

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.987732311159233 Adjusted R Square - states that 94% of the variance of the S&P return is explained by the model - This is good.R Square 0.97561511850796Adjusted R Square 0.9390377962699 The high coefficient of Viacom indicates that it is the biggest predictor of the S&P. It's high correlation indicates this as well. Standard Error 0.821001265906013 The standard error of regression is used to determine confidence intervals.Observations 6 95% confidence interval = Predicted S&P Value +/- z(95%) * (Standard Error)

MS (Model Significance) shows high ratio of explained (regression) over unexplained (residual) variance. Low p value (Significance of F) shows regression model is statistically significantANOVA F Ratio = Explained variance (17.9) / Unexplained variance (0.67) = 26.6 - This is high and is good. A low P value shows that this is significant.

df SS MSRegression 3 53.9356008960948 17.978533632Residual 2 1.34808615723855 0.67404307862Total 5 55.2836870533333

Multiple Regression - Predicting S&P returns from returns of other investments

Coefficients Standard Error t StatIntercept 0.125062100111188 0.441697559819687 0.28313966725Viacom 0.39422080554261 0.0525631858895216 7.4999412397AT&T 0.170135064181028 0.070141632815607 2.42559315134GM 0.0912674536429872 0.0474978751373926 1.92150603325

Regression Equation S&P = (0.125062100111188) + (0.39422080554261)*(Viacom) + (0.170135064181028)*(AT&T) + (0.0912674536429872)*(GM)

Interpretting the Regression:

Low signifiance of the F statistic - indicates that, overall, the regession output is statistically significant (valid), at least to the 0.05 level of significance.

p-Values for each variable - The lower the p-Value, the better predictor the variable was. Viacom returns are a good predictor of the S&PAT&T and GM returns are much less effective predictors of the S&P return (higher p-Values) - These would not be valid predictors for a 0.05 level of significance.The small coefficients of these two company returns also indicate that they are lesss valid predictors.

Adding new independent variables to a regression equation always increases R Square.

Adjusted R Square is increased only when newly added independent variable increase predictability of the dependent variable.

GM Coke

1

0.6275136759865 1

Coke has a low correlation with the S&P and is therefore not a good predictor of the S&PAlso, if two of the independent variables above are highly correctlated with each other, only one of them shouldbe used in the Multiple Regression below. This is not the case here because none of the variables above have a high correlation with another variable. Using highly correlated variables as inputs to a Multiple Regressioncauses an error called Multicollinearity and should be avoided. Multiple Regressions should be built up by adding one new independent variable at a time and evaluating results. Good new independent variables noticeably raise R-Square and lower Standard Error without causing much change to Coefficients. Poor new independent variables don't change R-Square much but have unpredictable effects on Coefficients. Build regressions up one variable at a time and

Coke was not used because it has a low correlation with S&P and is therefore not a good predictor of the S&PAll others (Viacom, AT&T, GM) were used because they had a relatively high correction with S&P and low corrections with each otherRegressions are Predictive, not Forecasting. All new independent variables must be chosen from within the range of the previously sampled independent variable,

Adjusted R Square - states that 94% of the variance of the S&P return is explained by the model - This is good.

The high coefficient of Viacom indicates that it is the biggest predictor of the S&P. It's high correlation indicates this as well. The standard error of regression is used to determine confidence intervals.95% confidence interval = Predicted S&P Value +/- z(95%) * (Standard Error)MS (Model Significance) shows high ratio of explained (regression) over unexplained (residual) variance. Low p value (Significance of F) shows regression model is statistically significantF Ratio = Explained variance (17.9) / Unexplained variance (0.67) = 26.6 - This is high and is good. A low P value shows that this is significant.

F Significance F26.672677462781 0.0363534241943342

Predicting S&P returns from returns of other investments

P-value Lower 95% Upper 95% Lower 95.0%0.8036858950588 -1.77540911128015 2.02553331150253 -1.775409111280150.0173175912253 0.168059670301503 0.620381940783718 0.1680596703015030.1361101668089 -0.131660023707755 0.47193015206981 -0.1316600237077550.1946174185881 -0.113099408464481 0.295634315750455 -0.113099408464481

Regression Equation S&P = (0.125062100111188) + (0.39422080554261)*(Viacom) + (0.170135064181028)*(AT&T) + (0.0912674536429872)*(GM)

Low signifiance of the F statistic - indicates that, overall, the regession output is statistically significant (valid), at least to the 0.05 level of significance.

p-Values for each variable - The lower the p-Value, the better predictor the variable was.

AT&T and GM returns are much less effective predictors of the S&P return (higher p-Values) - These would not be valid predictors for a 0.05 level of significance.The small coefficients of these two company returns also indicate that they are lesss valid predictors.

Adding new independent variables to a regression equation always increases R Square.

Adjusted R Square is increased only when newly added independent variable increase predictability of the dependent variable.

The high coefficient of Viacom indicates that it is the biggest predictor of the S&P. It's high correlation indicates this as well.

MS (Model Significance) shows high ratio of explained (regression) over unexplained (residual) variance. Low p value (Significance of F) shows regression model is statistically significantF Ratio = Explained variance (17.9) / Unexplained variance (0.67) = 26.6 - This is high and is good. A low P value shows that this is significant.

Upper 95.0%2.02553331150253

0.6203819407837180.47193015206981

0.295634315750455

Testing Two Population Means To Determine If Change Occurred

The Confidence Interval or the t-Test can be used to determine if a population mean has changed.

Testing to determine if a change has occurred, for example, after an ad compaign is run.using the Confidence Interval

BEFORE AFTERAverage Average

Daily Daily

DEALER Sales Sales

A 100 - 110 =B 130 - 135 =C 120 - 122 =D 140 - 157 =E 155 - 160 =F 200 - 206 =G 300 - 309 = H 260 - 283 =I 190 - 202 =J 185 - 192 =K 100 - 110 =L 130 - 135 =M 120 - 122 =N 140 - 157 =O 155 - 160 =

P 200 - 206 =Q 300 - 309 = R 260 - 283 =S 190 - 202 =T 185 - 192 =U 100 - 110 =V 130 - 135 =W 120 - 122 =X 140 - 157 =Y 155 - 160 =Z 200 - 206 =

A1 300 - 309 = B1 260 - 283 =C1 190 - 202 =D1 185 - 192 =

Testing to determine if a change has occurred, using the t-Test

t-Test - Paired MeansSampling the same thing before and after to determine if something has changedTrying to determine if the "after" samples are statistically different than the "before"sample30 Samples should always be taken, unless population is known to be normally distributed(Here only 6 samples are taken for brevity)

In this case, we want to determine with 95% certainty whether or not there has been

t-Test: Paired Two Sample for Means

Before After0.7541 -4.6296 Mean

14.9701 18.986 Variance11.9792 -1.7226 Observations

7.907 -0.5535 Pearson Correlation-5.1724 6.679 Hypothesized Mean Difference3.4091 1.8261 df

t StatP(T<=t) one-tailt Critical one-tailP(T<=t) two-tailt Critical two-tail

P(T<=t) one-tail (0.289) is greater the α (0.05) so there has been no significant increase

P(T<=t) two-tail (0.579) is greater the α (0.05) so there has been no significant change at all

Here, because α is less than both P values, we cannot reject the Null Hypothesisin either case. The null Hypothesis states that there is no change in the mean.

a change from before to after. Null hypothesis is 0 and α = 0.05

Problem: A tire manufacturer wants to determine if a new rubber formulation will improve tire wear.12 sets of tires were created with the old rubber formula and 12 sets of news with the newrubber formulation. They were placed on the following cars and driven until they wore out.Determine at a 0.05 level of significance whether the new rubber produces longer tread life.

Car Tire Location Old Rubber New Rubber 1 Front 37661 31902

Rear 42342 412032 Front 31108 38816

Rear 41239 433053 Front 32903 35375

Rear 42658 523534 Front 29829 30883

Rear 39616 494245 Front 34625 38724

Rear 42650 432346 Front 31923 34565

Rear 39990 43861

The NULL Hypothesis here is that the mean tread wear of the old rubber equals the mean tread wear of the new rubber.The p-Value for both one-tailed test and two-0tailed test is less than the level of significance (0.05) so the NULL Hypothesis is rejected - Therefore, we have a 95% certainty that the new rubber compund increases tread wear.

Problem: Evaluate the returns of these two stocks to determine if there is a real difference. Use a 0.05 level of significance.

Viacom GM t-Test: Two-Sample Assuming Unequal Variances0.7541 -4.6296

14.9701 18.98611.9792 -1.7226 Mean

7.907 -0.5535 Variance-5.1724 6.679 Observations3.4091 1.8261 Hypothesized Mean Difference0.7541 -4.6296 df

14.9701 18.986 t Stat11.9792 -1.7226 P(T<=t) one-tail

7.907 -0.5535 t Critical one-tail-5.1724 6.679 P(T<=t) two-tail3.4091 1.8261 t Critical two-tail0.7541 -4.6296

14.9701 18.986

11.9792 -1.7226 p-Values for both one and two tailed tests are greater than the stated level of significance (0.05)7.907 -0.5535 so it can be stated with 95% certainty that there is a difference in the returns of these companies.

-5.1724 6.6793.4091 1.8261 The NULL Hypothesis that the means of both returns are equal is rejected.3.4091 1.82610.7541 -4.6296

14.9701 18.98611.9792 -1.7226

7.907 -0.5535-5.1724 6.6793.4091 1.82610.7541 -4.6296

14.9701 18.98611.9792 -1.7226

7.907 -0.5535-5.1724 6.679

Problem: A company is testing light bulbs from 2 suppliers. Below is listed the hours of usage before each sample burned out.Determine using a 0.05 level of significance whether the new supplier's light bulbs really last longer than theold supplier's.

Light Bulb Suppliers t-Test: Two-Sample Assuming Equal Variances

Old New42 55 Mean46 45 Variance64 58 Observations53 52 Pooled Variance38 54 Hypothesized Mean Difference44 47 df61 51 t Stat44 61 P(T<=t) one-tail50 49 t Critical one-tail60 56 P(T<=t) two-tail39 52 t Critical two-tail51 494237 The one-tailed p-value (one-tailed because we are only testing if one is better) is very close to45 the stated level of significance (0.05) so we cannot reject the NULL Hypothesis, which states that65 the means light bulb life for both suppliers is the same.544642442652

Testing Two Population Means To Determine If Change Occurred

Testing to determine if a change has occurred, for example, after an ad compaign is run.

DifferenceIn this case, we want to determine with 95% certainty whether an advertising campaign increased average daily sales

10 to our large dealer network. To determine this, we must take Before and After samples of average daily sales at least 30 dealers.5 The keys to success of this sampling are the following:2

17 1) At least 30 dealers must be sampled. 5 2) Before and After samples must be taken from the same dealers6 3) The samples must be AVERAGE sales, for example, average daily sales over a week or a month. It cannot just be one sample of one day's sales9 4) The dealer's sampled must be random and representative of the overall population.

2312 We are trying to determine whether the Mean Difference falls inside or outside the 95% Confidence Interval that the Mean Difference is 0.7 If the Mean Difference falls within this 95% Confidence Interval, We say that there is a 95% that the Mean Difference is 0 and No change occurred.

10 If the Mean Difference Falls outside this Confidence Interval, there is a 95% chance that average daily sales for the whole network has changed.52 We can state with 95% certainly that there has been no significant change if the Average (Mean) Difference is within

17 the 95% Confidence Interval of this mean being 0. To determine the 95% Confidence Interval for a 0 Mean, we need the following information:5

6 Sample size (COUNT) = 30 Need at least 30 samples.of daily averages from the same dealers9 Sample Standard Deviation (S 6.11

23 Sample Standard Error = 1.11 Sample Standard Error = (Sample Standard Deviation) / ( Square Root of Sample Size)12 Sample Mean (AVERAGE) = 9.607

10 α (1 - Confidence Interval) = 0.05 (for 95% Confidence Intveral, α = 0.05)52 The 95% confidence interval will contain 95% of the area under the Normal curve. The remaining 5% (α) will be split between each outer tail on the Normal curve.

17 The Z Score represents the right outer edge of the confidence interval. Total area under the Normal curve to the left of this Z value for 5 a 95% two-tailed confidence interval is 97.5%. The z Score for this is 1.96. This means that 97.5% of the total area under the Normal curve6 is to the left of 1.96 Standard deviations to the right of the mean.9

2312 Z Score (two tailed) for 95% CI 1.96 NORMSINV(0.975)7

The 95% Confidence Interval around a Sample Mean of 0 = 0 +/- (Z Score for 95% CI) * (Sample Standard Error)

0 +/- (1.96) x (1.11)

The 95% Confidence Interval for the Mean = 0 is from -2.18 to +2.18

If the Sample Mean (9.60) is outside of the 95% Confidence Interval for the Mean Difference being 0,We can say with 95% certainty that Average Daily sales throughout the entire population of dealershas increased.

This is the case because Mean of 9.60 is outside of the confidence interval of -2.18 to +2.18

We can now state with 95% certainty that the advertising campaign has caused a change in the daily sales of the dealer network.

Sampling the same thing before and after to determine if something has changedTrying to determine if the "after" samples are statistically different than the "before"sample30 Samples should always be taken, unless population is known to be normally distributed

In this case, we want to determine with 95% certainty whether or not there has been


Before After5.6411833333333 3.430955.626486125667 72.498467704

6 60.3504379666567

05

0.59207757272550.28977785002162.01504837208810.57955570004312.5705818346975

P(T<=t) one-tail (0.289) is greater the α (0.05) so there has been no significant increase

P(T<=t) two-tail (0.579) is greater the α (0.05) so there has been no significant change at all

Here, because α is less than both P values, we cannot reject the Null Hypothesisin either case. The null Hypothesis states that there is no change in the mean.

Null hypothesis is 0 and α = 0.05 (1 - 0.95)

A tire manufacturer wants to determine if a new rubber formulation will improve tire wear.12 sets of tires were created with the old rubber formula and 12 sets of news with the newrubber formulation. They were placed on the following cars and driven until they wore out.Determine at a 0.05 level of significance whether the new rubber produces longer tread life.


Old Rubber New RubberMean 37212 40303.75Variance 23678506 43699518.3864Observations 12 12Pearson Correlation 0.736490409117489Hypothesized Mean Difference 0df 11t Stat -2.39509193436364P(T<=t) one-tail 0.017769924104449t Critical one-tail 1.79588481423219P(T<=t) two-tail 0.035539848208898t Critical two-tail 2.20098515872184

The NULL Hypothesis here is that the mean tread wear of the old rubber equals the mean tread wear of the new rubber.The p-Value for both one-tailed test and two-0tailed test is less than the level of significance (0.05) so the NULL Hypothesis is rejected - Therefore, we have a 95% certainty that the new rubber compund increases tread wear.

Evaluate the returns of these two stocks to determine if there is a real difference. Use a 0.05 level of significance.

t-Test: Two-Sample Assuming Unequal Variances

Viacom GM5.6411833333333 3.430947.953867349713 62.498679055

30 300

571.15191573288530.12708217373751.6720288889437

0.2541643474752.0024654439045

p-Values for both one and two tailed tests are greater than the stated level of significance (0.05)so it can be stated with 95% certainty that there is a difference in the returns of these companies.

The NULL Hypothesis that the means of both returns are equal is rejected.

A company is testing light bulbs from 2 suppliers. Below is listed the hours of usage before each sample burned out.Determine using a 0.05 level of significance whether the new supplier's light bulbs really last longer than the

t-Test: Two-Sample Assuming Equal Variances

Old New47.5 52.416666667

90.547619047619 21.53787878822 12

66.8255208333330

32-1.6759540000140.051746313977

1.69388870259190.1034926279541

2.036933334407

The one-tailed p-value (one-tailed because we are only testing if one is better) is very close tothe stated level of significance (0.05) so we cannot reject the NULL Hypothesis, which states thatthe means light bulb life for both suppliers is the same.

In this case, we want to determine with 95% certainty whether an advertising campaign increased average daily salesto our large dealer network. To determine this, we must take Before and After samples of average daily sales at least 30 dealers.

2) Before and After samples must be taken from the same dealers3) The samples must be AVERAGE sales, for example, average daily sales over a week or a month. It cannot just be one sample of one day's sales4) The dealer's sampled must be random and representative of the overall population.

We are trying to determine whether the Mean Difference falls inside or outside the 95% Confidence Interval that the Mean Difference is 0.If the Mean Difference falls within this 95% Confidence Interval, We say that there is a 95% that the Mean Difference is 0 and No change occurred.If the Mean Difference Falls outside this Confidence Interval, there is a 95% chance that average daily sales for the whole network has changed.

We can state with 95% certainly that there has been no significant change if the Average (Mean) Difference is withinthe 95% Confidence Interval of this mean being 0. To determine the 95% Confidence Interval for a 0 Mean, we need the following information:

Need at least 30 samples.of daily averages from the same dealers

Sample Standard Error = (Sample Standard Deviation) / ( Square Root of Sample Size)

(for 95% Confidence Intveral, α = 0.05)

The 95% confidence interval will contain 95% of the area under the Normal curve. The remaining 5% (α) will be split between each outer tail on the Normal curve.The Z Score represents the right outer edge of the confidence interval. Total area under the Normal curve to the left of this Z value for a 95% two-tailed confidence interval is 97.5%. The z Score for this is 1.96. This means that 97.5% of the total area under the Normal curve

NORMSINV(0.975)

The 95% Confidence Interval around a Sample Mean of 0 = 0 +/- (Z Score for 95% CI) * (Sample Standard Error)

If the Sample Mean (9.60) is outside of the 95% Confidence Interval for the Mean Difference being 0,We can say with 95% certainty that Average Daily sales throughout the entire population of dealers

This is the case because Mean of 9.60 is outside of the confidence interval of -2.18 to +2.18

We can now state with 95% certainty that the advertising campaign has caused a change in the daily sales of the dealer network.

Analysis of Variance - ANOVA

ANOVA is a technique for testing the equality of different population means. ANOVA is very useful because it can beextened to any number of populations. All ANOVA test the NULL Hypothesis - that is - all samples drawn have the same mean.

ANOVA is often used by markets to tests whether different marketing campaigns with multiple varying elements actually yielded different results.

The NULL Hypothesis is rejected - that is - there are real differences between the means - if the p-Value pertaining to thatitem being evaluated is less than the desired level of significance. For example, in the 1st ANOVA below, the p-Valuepetaining to "Between Methods (Groups) is less than the desired lever of significance - So there is a difference between the groups.

Anova: Single Factor - Single Factor Analysis Calculated by ExcelThe Hand Calculation of this ANOVA is performed at the bottom of this worksheet

StudentsProblem: 3 different sale training methods are used. Three groups of

four randomly chosen new saleppeople are chosen. Each 1group is trained using one of the methods. After the course 2is completed, sales totals of each salesperson over the 3next two weeks is collected. 4

Determine within a 0.05 level of significance whether thereis a difference in the effectiveness of the courses.

Anova: Single Factor

SUMMARYGroups Count Sum

Method 1 4 68Method 2 4 80Method 3 4 92

ANOVA

Source of Variation SS dfBetween Groups 72 2Within Groups 46 9

Total 118 11

The p-Value for Methods (Between Groups, which are the Methods) (0.011419201) is much less than the level of significance (0.05) so there is a difference between the effectiveness of the teaching methods..

The p-Value calculated by Excel agrees with the hand-calculated p-Value, which is less than the level of significance. This indicates that there is a real difference in the effectiveness between the courses.

Anova: Two Factor - Two Factor Without Replication

Two factors are being evaluated and each test is performed only once.

Problem: Here are 3 different types of typing keyboards.5 Typists each got to use all three keyboards. Here Typist 1are the typing speeds of each typist on of of the 3 Typist 2keyboard types. Determine at a 0.01 level of Typist 3significance (99% certainty) whether typing speed Typist 4differs between the 3 keyboard type. Typist 5

In this example, the two factors that influence the speed of typing are 1) the keyboard, and 2) the typing ability of each typist.

Anova: Two-Factor Without Replication

SUMMARY Count SumTypist 1 3 180Typist 2 3 338Typist 3 3 141Typist 4 3 303Typist 5 3 216

Keyboard A 5 375Keyboard B 5 379Keyboard C 5 424

ANOVA

Source of Variation SS dfRows 9151.06666666667 4Columns 296.133333333333 2Error 94.5333333333329 8

Total 9541.73333333333 14

The p-Value for the Rows (5.42004E-08) is much less than the level of significance (0.05) so there is a difference between the speed of each typist.

The p-Value for columns (0.003428581) is much less than the level of significance (0.05) so there is a difference between keyboards regarding typing speed.

Anova: Two Factor - Two Factor With Replication

Two factors are being evaluated and the tests are performed more than once (in this case, each test is performed in two markets).

Problem A Perfume company was testing a product using3 different advertising focuses (Sophisticated, Athletic, PopularDesign 13 different package Designs, and testing 2 separatemarkets. Using a 0.05 level of significance, Design 2determine 1) Advertising Focus, 2) Package Design,or 3) the Interaction between them had any affect Design 3on sales. The chart shows the sales with each combination in each of the two markets.

Anova: Two-Factor With Replication

SUMMARY Sophisticated AthleticDesign 1

Count 2 2Sum 5.53 3.37Average 2.765 1.685Variance 0.00244999999999999 0.25205

Design 2Count 2 2Sum 5.97 2.9Average 2.985 1.45Variance 0.186049999999998 0.005

Design 3Count 2 2Sum 5.13 6.03Average 2.565 3.015Variance 0.00124999999999999 0.03645

TotalCount 6 6Sum 16.63 12.3Average 2.77166666666667 2.05Variance 0.0732566666666671 0.62848

ANOVA

Source of Variation SS dfSample 0.80721111111111 2Columns 4.99107777777778 2Interaction 2.27712222222222 4Within 1.0447 9

Total 9.12011111111111 17

The p-Value for Sample (0.076062) is more than the level of significance (0.05). We cannot reject the NULL Hypothesis that states that the package does not affect sales.

The p-value for Columns (0.00037339) is less than the level of significance (0.05). This indicates that that overall advertising strategies affect sales differently.

The p-Value for Interaction (0.022409) is less than the level of significance. This indicates that different combinations of interactions (package / ad campaign) have different affects on sales.

Anova: Single Factor - Single Factor Analysis Calculated by Hand( Excel calculation of Single Factor ANOVA is shown at the top of this Worksheet)

Problem: 3 different sale training methods are used. Three groups offour randomly chosen new saleppeople are chosen. Eachgroup is trained using one of the methods. After the course is completed, sales totals of each salesperson over thenext two weeks is collected.

Determine within a 0.05 level of significance whether thereis a difference in the effectiveness of the courses.

Method 116211813

Column Total 68

Column Mean 17

Grand Mean = (17 + 20 + 23) / 3

Grand Mean = 20

Column Mean - Grand Mean -3

(Column Mean - Grand Mean)^2 9

# Rows x [ (Column Mean - Grand Mean)^2 ] 36

Sum of Squares Between Groups = 36 + 0 + 36 = 72

Degrees of Freedom

Between Groups DOF = # groups - 1 = c - 1 = 3 - 1 = 2

Within Groups DOF = C(r-1) = 3 (4 - 1) = 9

Total Degrees of Freedom = 11

Sum of Squares

Between Groups Sum of the Squares 72Sum of Squares Within Groups 46

Total Sum of the Squares 118

Mean Squares

MS = Mean Square = Sum of Square / degrees of freedom

SS df72 246 9

F Statistic

F Statistic = (MS Between Group) / (MS Within Groups)F Statistic = 36 / 5.111111 = 7.04347826087

p Value

p-Value = FDIST(F Statistic,DOF Between Groups,DOF Within Groups) =

p-Value = FDIST(7.043478,2,9) = 0.014419202927

The p-value of 0.014419 is less than the designated level of significance of 0.05. This indicates that there is less than a 5% chance that this result could have occurredif there was no difference in effectiveness between the courses. Therefore, there is at least 95% certainty that there is a real difference in effectiveness of the courses.

ANOVA is a technique for testing the equality of different population means. ANOVA is very useful because it can beextened to any number of populations. All ANOVA test the NULL Hypothesis - that is - all samples drawn have the same mean.

ANOVA is often used by markets to tests whether different marketing campaigns with multiple varying elements actually yielded different results.

The NULL Hypothesis is rejected - that is - there are real differences between the means - if the p-Value pertaining to thatitem being evaluated is less than the desired level of significance. For example, in the 1st ANOVA below, the p-Valuepetaining to "Between Methods (Groups) is less than the desired lever of significance - So there is a difference between the groups.

Anova: Single Factor - Single Factor Analysis Calculated by ExcelThe Hand Calculation of this ANOVA is performed at the bottom of this worksheet

Teaching MethodMethod 1 Method 2 Method 3

16 19 2421 20 2118 21 2213 20 25

Average Variance17 11.333333320 0.6666666723 3.33333333

MS F P-value F crit36 7.04347826 0.014419201 4.25649472914256

5.111111111

The p-Value for Methods (Between Groups, which are the Methods) (0.011419201) is much less than the level of significance (0.05)

The p-Value calculated by Excel agrees with the hand-calculated p-Value, which is less than the level of significance. This indicates that there is a real

Keyboard A Keyboard B Keyboard C51 57 72

109 112 11747 43 5198 98 10770 69 77

In this example, the two factors that influence the speed of typing are 1) the keyboard, and 2) the typing ability of each typist.

Average Variance60 117

112.6666667 16.333333347 16

101 2772 19

75 767.575.8 819.784.8 724.2

MS F P-value F crit2287.766667 193.605078 5.42004E-08 7.00607662307967148.0666667 12.5303244 0.003428581 8.6491106407445311.81666667

The p-Value for the Rows (5.42004E-08) is much less than the level of significance (0.05) so there is a difference between the speed of each typist.

The p-Value for columns (0.003428581) is much less than the level of significance (0.05) so there is a difference between keyboards regarding typing speed.

Two factors are being evaluated and the tests are performed more than once (in this case, each test is performed in two markets).

Sophisticated Athletic Popular Use "2 Rows Per Sample"2.80 2.04 1.582.73 1.33 1.263.29 1.50 1.002.68 1.40 1.822.54 3.15 1.922.59 2.88 1.33

Popular Total

2 62.84 11.741.42 1.95666667

0.0512 0.46722667

2 62.82 11.691.41 1.94833333

0.3362 0.75057667

2 63.25 14.41

1.625 2.401666670.17405 0.44477667

68.91

1.4850.12407

MS F P-value F crit0.403605556 3.4770269 0.076062669 4.256494729142562.495538889 21.4988513 0.00037339 4.256494729142560.569280556 4.90430267 0.022409688 3.633088511501560.116077778

The p-Value for Sample (0.076062) is more than the level of significance (0.05). We cannot reject the NULL Hypothesis that states that the package does not affect sales.

The p-value for Columns (0.00037339) is less than the level of significance (0.05). This indicates that that overall advertising strategies affect sales differently.

The p-Value for Interaction (0.022409) is less than the level of significance. This indicates that different combinations of interactions (package / ad campaign) have different affects on sales.

Anova: Single Factor - Single Factor Analysis Calculated by Hand

Method 2 Method 319 2420 2121 2220 2580 92 Column Total

20 23 Column Mean

0 3

0 9

0 36

Sum of Squares Within Treatments = 34 + 2 + 10 =

MS36

5.111111111

The p-Value represents the proportion of area under the F Distribution curve to the right of the given F value.If this p-Value is less than the stated level of significance, this demonstrates that there is a differencein the objects or process being analyzed. - in other words, there is a difference in the variances.

The p-value of 0.014419 is less than the designated level of significance of 0.05. This indicates that there is less than a 5% chance that this result could have occurredif there was no difference in effectiveness between the courses. Therefore, there is at least 95% certainty that there is a real difference in effectiveness of the courses.

Method 1 Method 2 Method 316 19 2421 20 2118 21 2213 20 2568 80 92

17 20 23

Method 1 Method 2 Method 3

16 - 17 19 - 20 24 - 2321-17 20 - 20 21 - 23

18 - 17 21 - 20 22 - 2313 - 17 20 - 20 25 - 23

Method 1 Method 2 Method 3-1 -1 14 0 -21 1 -1-4 0 2

Square each

Method 1 Method 2 Method 31 1 1

16 0 41 1 1

16 0 434 2 10

46

The p-Value represents the proportion of area under the F Distribution curve to the right of the given F value.If this p-Value is less than the stated level of significance, this demonstrates that there is a differencein the objects or process being analyzed. - in other words, there is a difference in the variances.

Determining if Population Variance Has Changed - Uses Chi Squared Distribution

Quality control people use the Chi Square test to determine if process' variance levels are staying within given limits.

The Chi Square Distribution is used to determine if a population's variance has been changed. The Chi Squre Distribution is skewed with the high point of thecurve occuring at the point on the x axis that equals the number of degrees of freedom (n-1 --> Sample Size - 1). The total area under the Chi Squared curve is 1.0. The area under the curve to the left or right of outer limits determines wihether it can be said with a certain degree of confidence that the population variance has changed.If the area outside the Chi Square Statistic (the p value) is less than the desired level of significance, then the population variance has changed.

If Sample Standard Deviation, s, is greater than Population Standard Deviation, σ, then the Chi Squared Statistic will be to the right (greater than) the degree of freedom pointand the p value produced by CHIDIST(ChiSquare Statistic, degrees of freedom) will be the p value of the right tail.

If Sample Standard Deviation, s, is less than Population Standard Deviation, σ, then the Chi Squared Statistic will be to the left (less than) the degree of freedom pointand the p value produced by CHIDIST(ChiSquare Statistic, degrees of freedom) will still be the area under the Chi Square curve to the right of the Chi Square Statistic point..To get the area under the left tail (are to the left of the Chi Square point), the p-value = 1 - CHIDIST(Chi Square Statistic, degrees of freedom)

Test on Whether a Population Variance Has Increased Above a Given Value

Problem: A manufacturer wants to check if the variance on a process has changed. A machine drills a hole as part of the manufacturing process.The standard deviation of the hole diameter has historically been 1.6 ml. A random sample of 50 hole diameters were checked in one batch. The measured sample standard deviation was 1.9 ml.At an 0.05 level of significance, has the population standard deviation increased above 1.6 ml?

Givens:n= 50Degrees of Freedom= n-1 49Level of Significance, α, = 0.05Population Standard Deviation, σ, = 1.6Sample Standard Deviation, s, = 1.9

Use the Chi Squared Test to determine if there has been a change in variance.

1) Calculate Chi Square Statistic, = [ (n-1)*(s*s) ] / (σ*σ) = 69.09766

2) Obtain p value from Chi Square Statistic

Upper p value = CHIDIST(69.09766,49) = 0.030749

This p value states the portion of total area under the Chi Square distribution curve for 49 degree of freedom to the left of the Chi Square StatisticThe Chi Square Statistic is caluculated from sample size (n - 1), population standard deviation, and sample standard deviation.If the p value ( the area under the Chi Square distribution curve to the right of the Chi Square Statistic on that curve) is greater than the level of significance value we are evaluating (α = 0.05 on a one-tailed test), then we accept the NULL Hypothesis.

In the case the p value (0.030749) is less than the desired level of significance (α = 0.05), and we reject the NULL Hypothesis.

It appears that the population variance has increased above 1.6 ml.

Test on Whether a Population Variance Has Decreased Below a Given Value

Problem: A manufacturer wants to check if the variance on a process has changed. A machine drills a hole as part of the manufacturing process.The standard deviation of the hole diameter has historically been 1.6 ml. The engineers believe that they have improved the process.A random sample of 50 hole diameters were checked in one batch. The measured sample standard deviation was 1.35 ml.At an 0.05 level of significance, has the population standard deviation decreased 1.6 ml?

Givens:n= 50Degrees of Freedom= n-1 49Level of Significance, α, = 0.05Population Standard Deviation, σ, = 1.6Sample Standard Deviation, s, = 1.375

Use the Chi Squared Test to determine if there has been a change in variance.

1) Calculate Chi Square Statistic, = [ (n-1)*(s*s) ] / (σ*σ) = 36.18774

2) Obtain p value from Chi Square Statistic

Area under curve to right = CHIDIST(69.09766,49) = 0.912951

p value = Area to the left of Chi Square point = 1 - CHIDIST () = 0.087049

This p value states the portion of total area under the Chi Square distribution curve for 49 degree of freedom to the left of the Chi Square StatisticThe Chi Square Statistic is calculated from sample size (n - 1), population standard deviation, and sample standard deviation.If the p value ( the area under the Chi Square distribution curve to the right of the Chi Square Statistic on that curve) is greater than the level of significance value we are evaluating (α = 0.05 on a one-tailed test), then we accept the NULL Hypothesis.

In the case the p value (0.087049) is greater than the desired level of significance (α = 0.05), and we do not reject the NULL Hypothesis that there has been no change.

It appears that the population variance has not decreased below 1.6 ml.

Determining if Population Variance Has Changed - Uses Chi Squared Distribution

Quality control people use the Chi Square test to determine if process' variance levels are staying within given limits.

The Chi Square Distribution is used to determine if a population's variance has been changed. The Chi Squre Distribution is skewed with the high point of thecurve occuring at the point on the x axis that equals the number of degrees of freedom (n-1 --> Sample Size - 1). The total area under the Chi Squared curve is 1.0. The area under the curve to the left or right of outer limits determines wihether it can be said with a certain degree of confidence that the population variance has changed.If the area outside the Chi Square Statistic (the p value) is less than the desired level of significance, then the population variance has changed.

If Sample Standard Deviation, s, is greater than Population Standard Deviation, σ, then the Chi Squared Statistic will be to the right (greater than) the degree of freedom pointand the p value produced by CHIDIST(ChiSquare Statistic, degrees of freedom) will be the p value of the right tail.

If Sample Standard Deviation, s, is less than Population Standard Deviation, σ, then the Chi Squared Statistic will be to the left (less than) the degree of freedom pointand the p value produced by CHIDIST(ChiSquare Statistic, degrees of freedom) will still be the area under the Chi Square curve to the right of the Chi Square Statistic point..To get the area under the left tail (are to the left of the Chi Square point), the p-value = 1 - CHIDIST(Chi Square Statistic, degrees of freedom)

Test on Whether a Population Variance Has Increased Above a Given Value

A manufacturer wants to check if the variance on a process has changed. A machine drills a hole as part of the manufacturing process.

A random sample of 50 hole diameters were checked in one batch. The measured sample standard deviation was 1.9 ml.At an 0.05 level of significance, has the population standard deviation increased above 1.6 ml?

This p value states the portion of total area under the Chi Square distribution curve for 49 degree of freedom to the left of the Chi Square StatisticThe Chi Square Statistic is caluculated from sample size (n - 1), population standard deviation, and sample standard deviation.If the p value ( the area under the Chi Square distribution curve to the right of the Chi Square Statistic on that curve) is greater than the level of significance value we are evaluating (α = 0.05 on a one-tailed test), then we accept the NULL Hypothesis.

In the case the p value (0.030749) is less than the desired level of significance (α = 0.05), and we reject the NULL Hypothesis.

Test on Whether a Population Variance Has Decreased Below a Given Value

A manufacturer wants to check if the variance on a process has changed. A machine drills a hole as part of the manufacturing process.The standard deviation of the hole diameter has historically been 1.6 ml. The engineers believe that they have improved the process.A random sample of 50 hole diameters were checked in one batch. The measured sample standard deviation was 1.35 ml.At an 0.05 level of significance, has the population standard deviation decreased 1.6 ml?

This p value states the portion of total area under the Chi Square distribution curve for 49 degree of freedom to the left of the Chi Square StatisticThe Chi Square Statistic is calculated from sample size (n - 1), population standard deviation, and sample standard deviation.If the p value ( the area under the Chi Square distribution curve to the right of the Chi Square Statistic on that curve) is greater than the level of significance value we are evaluating (α = 0.05 on a one-tailed test), then we accept the NULL Hypothesis.

In the case the p value (0.087049) is greater than the desired level of significance (α = 0.05), and we do not reject the NULL Hypothesis that there has been no change.

Normal DistributionThe Normal distribution is a continuous distribution, as oppoed to a discrete distribution such as the binomial distribution, whish is a set of discrete points.

Any Normal distribution can be identified by two variables - the mean µ and standard deviation σ

The area under the entire density function = 1.

Most problems involving the Normal distribution fall into two categories:

1) Determining the probability of a normally distributed random variable having a value within a given interval

2) Determining a Confidence Interval - that is - Determining an interval within which the value of a normally distributed random variable will fall with a given probability

To be able to apply the Normal distribution, It is extremely important that the underlying population can be proven to be normally distributed. This is often not the case.

For any population, whether Normally distributed or not, the distribution of x bar (the average of each sample) will be approximatelyNormally distributed if sample size is large (30 or more).This a basic tenant of the Central Limit Theorem - Statistics' most fundamental rule.It is important to note that the problems on this page do not deal with samples. These problems only use parameters of the entire populations.

z = number of standard deviations that a points lies from the mean

Population Mean = µ = "mu"

Population Standard Deviation = σ = "sigma"

z = ( x - µ ) / σ = ( x - mean ) / ( Length of 1 Standard Deviation )

The z distribution, sometimes called the standard normal distribution, is a normal distirbution with the mean, µ, = 0 and the standard deviation, α, = 1.

Population parameters are generally described with Greek letters, such as µ (population mean) and α (population standard deviation)while Sample parameters are genearlly described with Roman letters, such as x bar (sample mean) and s (sample standard devation)

Statistical Function NORMSDIST(z) tells what percentage of total area of standardized normal curve (mean = 0 and standard deviation length = 1) is to the left of a point z standard deviations from the mean, which is 0. NORMSDIST(0) = 0.5 This means that half of the area under the standardized normal curve exists to the left of z when z = 0 (z is exactly on top of the mean, that is, 0 standard deviations away from the mean)

NORMSDIST(1.96) = 0.975 This means that 97.5% of the total area under that staandardized normal curve is to the left of the z when z is 1.96 standard deviations from the mean.This point of z = 1.96 is often used to calculate the 95% Confidence interval. That is, the section under the normal curve that starts a 1.96 standard deviations to the left of the mena and extends to 1.96 standard deviations to the right of the normal curve will contain 95% of the total area under the bell shaped Normal curve.

Statistical Function NORMSINV() tells how many standard deviations a point on a normal curve is to the left of the mean that the stated total area under the normal curve will equal the percentage given as the argument for the function.

NORMSINV(0.0975) = 1.96 This means that 97.5% of the total area under the normal curve is to the left of the point 1.96 standard deviations from the mean

Statisical Function NORMDIST(x, mean, standard dev, TRUE) will calculate the area under the curve to the left of point x on a normal curve with the given mean and standard deviation. The TRUE stated to provide Cumulative area - This is nearly always TRUE)

NORMDIST(1.96,0,1,TRUE) = 0.975 Setting mean to 0 and stan. Dev. To 1 makes it a standardized Normal curve, like the above problem.

Problem: A store has normally distributed daily sales. The average daily sales = $2,000 and the daily sales standard deviation = $500, What is the probability that the sales of one random day will be below $1,000?

Population Mean = µ = "mu" = $2,000

Population Standard Deviation = σ = "sigma" = = $500

x = $1,000

NORMDIST(1000,2000,500,TRUE) = 0.02275

2.28% This can be interpreted by saying the only 2.28% of the total area under this particular Normal curve falls to the left of x = 1,000

Problem: A brand of car has a mean fuel consumption of 27 mpg with a standard deviation of 5 mpg.What percentage of the cars can be expected to have a fuel consumption of between 25 mpg and 30 mpg?Fuel consumption is normally distributed for this population.

Percentage of cars with fuel efficiency between 25 mpg and 30 mpg =

Percentage of cars with fuel efficiency less than 30% - Percentage of cars with fuel efficiency less than 25% = NORMDIST(30,27,5,TRUE) - NORMDIST(25,27,5,TRUE) = 0.725747 - 0.344578 =

For the regular Normal curve, x = µ + zσ

The standardized Normal curve has µ = 0 and σ = 1.

Statistical Function NORMSINV() tells how many standard deviations a point on a normal curve is to the left of the mean that the stated total area under the normal curve will equal the percentage given as the argument for the function.

NORMINV(0.975,0,1) 1.96 This means that 97.5% of the total area under the normal curve is to the left of the point 1.96 standard deviations from the mean

Problem: A company's package delivery time is normally distributed with a mean of 10 hours and a standard deviation of 3 hours. What delivery time will be beaten by only 2.5% of all deliveries?

µ = 10

σ = 3

NORMINV(0.025,10,3) = 4.12 Meaning that only 2.5% of all package delivery times will be quicker than 4.12 hours.

Problem: A tire company makes a tire with a normally distributed tread life that has a mean of 39,000 miles and standard deviation of 5,300 miles. What tread life would be exceeded by 98% of all tires?

µ = 39,000

σ = 5,000

NORMINV(0.02,39000,5300) = 28115 Meaning that only 2% of all tires will wear out before 28,115 miles..

Problem: A tire company makes a tire with a normally distributed tread life that has a mean of 39,000 miles and standard deviation of 5,300 miles. What would the range of tread life be that 95% of all tires would wear out in?

µ = 39,000

σ = 5,000

Calculation of the left boundary:

NORMINV(0.025,39000,5300) = 28612 Meaning that only 2.5% of all tires will wear out before 28,115 miles.

Calculation of the right boundary:

NORMINV(0.975,39000,5300) = 49388 Meaning that only 2.5% of all tires will wear out after 49,388 miles..

So, 95% of tires will wear out in the range of 28,612 miles to 49,388 miles.

The Normal distribution is a continuous distribution, as oppoed to a discrete distribution such as the binomial distribution, whish is a set of discrete points.

1) Determining the probability of a normally distributed random variable having a value within a given interval

2) Determining a Confidence Interval - that is - Determining an interval within which the value of a normally distributed random variable will fall with a given probability

To be able to apply the Normal distribution, It is extremely important that the underlying population can be proven to be normally distributed. This is often not the case.

For any population, whether Normally distributed or not, the distribution of x bar (the average of each sample) will be approximately

This a basic tenant of the Central Limit Theorem - Statistics' most fundamental rule.It is important to note that the problems on this page do not deal with samples. These problems only use parameters of the entire populations.

The z distribution, sometimes called the standard normal distribution, is a normal distirbution with the mean, µ, = 0 and the standard deviation, α, = 1.

Population parameters are generally described with Greek letters, such as µ (population mean) and α (population standard deviation)while Sample parameters are genearlly described with Roman letters, such as x bar (sample mean) and s (sample standard devation)

Statistical Function NORMSDIST(z) tells what percentage of total area of standardized normal curve (mean = 0 and standard deviation length = 1)

This means that half of the area under the standardized normal curve exists to the left of z when z = 0 (z is exactly on top of the mean, that is, 0 standard deviations away from the mean)

This means that 97.5% of the total area under that staandardized normal curve is to the left of the z when z is 1.96 standard deviations from the mean.This point of z = 1.96 is often used to calculate the 95% Confidence interval. That is, the section under the normal curve that starts a 1.96 standard deviations to the left of the mena and extends to 1.96 standard deviations to the right of the normal curve will contain 95% of the total area under the bell shaped Normal curve.

Statistical Function NORMSINV() tells how many standard deviations a point on a normal curve is to the left of the mean that the stated total area under the normal curve

This means that 97.5% of the total area under the normal curve is to the left of the point 1.96 standard deviations from the mean

Statisical Function NORMDIST(x, mean, standard dev, TRUE) will calculate the area under the curve to the left of point x on a normal curve with the given mean and standard deviation.

Setting mean to 0 and stan. Dev. To 1 makes it a standardized Normal curve, like the above problem.

Problem: A store has normally distributed daily sales. The average daily sales = $2,000 and the daily sales standard deviation = $500,

This can be interpreted by saying the only 2.28% of the total area under this particular Normal curve falls to the left of x = 1,000

Problem: A brand of car has a mean fuel consumption of 27 mpg with a standard deviation of 5 mpg.What percentage of the cars can be expected to have a fuel consumption of between 25 mpg and 30 mpg?

Percentage of cars with fuel efficiency less than 30% - Percentage of cars with fuel efficiency less than 25% =

0.381169

38.12%

Statistical Function NORMSINV() tells how many standard deviations a point on a normal curve is to the left of the mean that the stated total area under the normal curve

This means that 97.5% of the total area under the normal curve is to the left of the point 1.96 standard deviations from the mean

Problem: A company's package delivery time is normally distributed with a mean of 10 hours and a standard deviation of 3 hours.

Meaning that only 2.5% of all package delivery times will be quicker than 4.12 hours.

Problem: A tire company makes a tire with a normally distributed tread life that has a mean of 39,000 miles and standard deviation of 5,300 miles.

Meaning that only 2% of all tires will wear out before 28,115 miles..

Problem: A tire company makes a tire with a normally distributed tread life that has a mean of 39,000 miles and standard deviation of 5,300 miles.

Meaning that only 2.5% of all tires will wear out before 28,115 miles.

Meaning that only 2.5% of all tires will wear out after 49,388 miles..

This means that half of the area under the standardized normal curve exists to the left of z when z = 0 (z is exactly on top of the mean, that is, 0 standard deviations away from the mean)

Confidence Intervals

Collection of 40 individual test scores210340490610

Calculate with 95% certainty an interval in which the population mean must fall based upon a random sample of 40 test scores taken from that population.

In other words, calculate a 95% Confidence Interval for the population mean.Sample size must be at least 30 and must be random and representative of the population.

Sample size (COUNT) = Sample Standard Deviation (STDEV) =α (1 - Confidence Interval) = Mean (AVERAGE) =

Excel calculates the Confidence Interval to be 49.42 using the following statistical function: CONFIDENCE (alpha, standard_dev,size)]

Input for this function are CONFIDENCE(0.05,159.48,40) =

Let's see how Excel's calculation holds up to the correct, manual calculation of Confidence Interval calculated from this sample:(Excel hits it just about right on)

The 95% Confidence Interval around a Sample Mean of 0 = 0 +/- (Z Score for 95% Confidence Interval) * (Sample Standard Error)

Z Score for 95% Confidence Interval (two sided) = Z(0.975) = 1.96


Sample Standard Error = (159.48) / (Square Root [40] ) = 25.21

Confidence Interval = Sample Mean +/- Z Score(95% Confidence Interval) *(Sample Standard Error)

Confidence Interval = 473.75 +/- 49.41 = 124.32 to 223.16

This means that there is a 95% chance that the mean of the entire popultationis between the endpoints of this 95% Confidence Interval

Statistically this is written as:

Confidence Interval = 473.5 +/- (1.96) x (25.21) = 473.5 +/-

Confidence Interval = Sample Mean +/- Zα/2 * (Sample Standard Deviation / Square root of Sample Size)

Getting Z Score for Two-Tailed 95% Confidence IntervalTwo-tailed 95% confidence interval will have 2.5% of toal curve area in each tail.Therefore this Z Score corresponds to 97.5% of total area to left of Z

Z Score for two-tailed 95% confidence interval = (NORMSINV) - Input is percentage (expressed as decimal) of area under standardized normal curve to the left of Z = Standardized normal curve --> Mean = 0, Standard Deviation Length = 1

Getting Z Score for One-Tailed 95% Confidence IntervalOne-tailed 95% confidence interval will have 5% of total curve area in right tail.Therefore this Z Score corresponds to 95% of total area to left of Z

Z Score for one-tailed 95% confidence interval = (NORMSINV) - Input is percentage (expressed as decimal) of area under standardized normal curve to the left of Z =

Determining Sample Size (n) for a Given Confidence Level and Bound (B)

n = number of sample needed to establish a specified confidence interval of of width B on either side of mean

e.g. How many samples must be taken to estimate the population diameter (of, for example, holes drilled by a machine) to within 0.05 mm. of the mean sample diameter with 99% confidence.Standard deviation (determined from previous sampling) is 0.75 mm ?.

n = [ (Z score of two-tailed 99% confidence)**2 x (sample standard deviation)**2 ] / [Interval**2]

n = [ (2.575)**2 x (0.75)**2 ] / [ (0.05)**2 ] = 1,492

NORMSINV(0.995)=

Problem: A restaurant owner wants to estimate within $2.00 the average amount that customers spend during lunch. For experience, the standard deviation of the population is $5.00. How many samples need to be taken to get a sample average mean expenditure during lunch?that is 92% certain of being within $2.00 of the population mean

Z score of two-tailed 92% confidence = NORMSINV(0.96) =

Population Standard Deviation = 5.00

Interval = 2.00


n = [ (1.751)**2 x (5.00)**2 ] / [ (2.00)**2 ] =

220 230 240 270370 370 380 400500 500 510 510640 640 640 650

Calculate with 95% certainty an interval in which the population mean must fall based upon a random sample of 40 test scores taken from that population.

In other words, calculate a 95% Confidence Interval for the population mean.Sample size must be at least 30 and must be random and representative of the population.

40159.48

0.05 (for 95% Confidence Inveral, α = 0.05)473.75

Excel calculates the Confidence Interval to be 49.42 using the following statistical function: CONFIDENCE (alpha, standard_dev,size)]

49.42

Let's see how Excel's calculation holds up to the correct, manual calculation of Confidence Interval calculated from this sample:

The 95% Confidence Interval around a Sample Mean of 0 = 0 +/- (Z Score for 95% Confidence Interval) * (Sample Standard Error)

Z Score for 95% Confidence Interval (two sided) = Z(0.975) = 1.96 1.96 Insert / FuNORMSINV(0.975)


Sample Standard Error = (159.48) / (Square Root [40] ) = 25.21

Confidence Interval = Sample Mean +/- Z Score(95% Confidence Interval) *(Sample Standard Error)

Confidence Interval = 473.75 +/- 49.41 = 124.32 to 223.16

This means that there is a 95% chance that the mean of the entire popultation

(Need a sample size of at least 30

Confidence Interval = 473.5 +/- (1.96) x (25.21) = 473.5 +/- 49.41 (Excel's answer of 49.42 is pretty close to the manual calculation of 49.41)

* (Sample Standard Deviation / Square root of Sample Size)

Getting Z Score for Two-Tailed 95% Confidence IntervalTwo-tailed 95% confidence interval will have 2.5% of toal curve area in each tail.

1.96

0.975

Getting Z Score for One-Tailed 95% Confidence IntervalOne-tailed 95% confidence interval will have 5% of total curve area in right tail.

1.64

0.95

Determining Sample Size (n) for a Given Confidence Level and Bound (B)

n = number of sample needed to establish a specified confidence interval of of width B on either side of mean

e.g. How many samples must be taken to estimate the population diameter (of, for example, holes drilled by a machine) to within 0.05 mm. of the mean sample diameter with 99% confidence.


2.576

Problem: A restaurant owner wants to estimate within $2.00 the average amount that customers spend during lunch. For experience, the standard deviation of the population is $5.00. How many samples need to be taken to get a sample average mean expenditure during lunch?

1.751


19 Samples

Although 30 samples shold be the minimum taken unlessyou know for certain that the underlying population is normally distributed.

300 300 320 320 320410 410 450 470 470540 540 580 580 610660 660 750 750 790

(for 95% Confidence Inveral, α = 0.05)

NORMSINV(0.975)

Need a sample size of at least 30 to be able to use z score for Normal Distribution)

(Excel's answer of 49.42 is pretty close to the manual calculation of 49.41)

For experience, the standard deviation of the population is $5.00. How many samples need to be taken to get a sample average mean expenditure during lunch?

Binomial DistributionBinomial distributions are are collections of discrete values as opposed to, for example, the Normal distribution, which is continuous.

Any binomial distribution can be identified the value of two of its variables - the number of trials (n) and the probability of success on a single trial (p)

Random Number GeneratorTools / Data Analysis / Random Number Generator

In this case, generate 5 random numbers, Each with possible outcomes of 2 or 3. Each event has a 20% probability of a "2" outcome and an 80% of a "3" outcome. .(You could easily do the same thing with outputs of 1 and 0 - measuring something occuring or not occurring)

3 Number of variable = 1 Outcome Probability3 (The value of the 1 variable is 1 or 0) 2 0.23 Number of random variables = 5 3 0.822 Distribution type is Discrete

Value in input range - the Yellow highlighted

Ouput range - Highlight the tan range

Sum of 2's = 2 Statistical function COUNTIF - Select the range of outputs to be counted and then select the cell that has the output to be counted, (Where outcome = 2)The sum is the number of successes in 5 random trials, each having a 0.20 chance of a "2" outcome.

This sum is a binomially distributed random variable.

Calculating the probability of a certain number of a given outcome to occurin a certain number of trialsif the probability of that outcome on a single trial is known.

Problem: What is the probability of 3 successful outcomes in 5 trials if the probability of a successful outcome in 1 trial is 20%?

s = number of successes = 3

n = number of trials = 5

p = probability of successful outcome = 0.2 on 1 trial

Find Cumulative distribution (NO) - Use 0 0

Probability of this is = 0.0512Statistical Function / BINOMDIST (in this case, you don't want cumulative distribution - Use 0 as that last argument)

Which is = 5.12%Format / Cell / Percentage

Problem - In 12 trials (n = 12), what is the probability that at least 10 of them (Sum of the probabilities that s = 10, s = 11, and s = 12)will have the 1 of the 2 possible outcomes that has a probability of occuring of 65%?

The probabilities of each outcome need to be added up.

10 11 120.65

120

0.108846 0.036753 0.005688009 0.151288 This represents a combined probability of Statistical function BINOMDIST(s,p,n,FALSE)BINOMDIST(10,12,0.65,0) + BINOMDIST(11,12,0.65,0) + BINOMDIST(12,12,0.65,0)

Problem - What is the possibility of getting between 4 and 6 heads on 10 flips of a fair coin?

Probability of getting between 4 and 6 head = P(4) + P(5) + P(6)

Also equals [ P(1) + P(2) + P(3) + P(4) + P (5) + P(6) ] - [ P(1) + P(2) + P(3) ]

This equals [ Cumulative probability of P(6) ] - [ Cumulative probability of P(3) ]

6 30.5 0.510 10

1 10.828125 - 0.171875 Equals

BINOMDIST(6,10,0.5,1) BINOMDIST(3,10,0.5,1)

Problem - If 10% of products require servicing, what is probability that less than 15 of 200 products will need servicing?

The problem actually asks what is the probability that up to 14 products will need servicing.

Therefore, you are solving for the cumulative probability that up to 14 products need servicing

s = 14p = 0.10n = 200TRUE = 1

BINOMDIST(14,200,0.10,1) = 0.092946

9.29%

Binomial distributions are are collections of discrete values as opposed to, for example, the Normal distribution, which is continuous.

Any binomial distribution can be identified the value of two of its variables - the number of trials (n) and the probability of success on a single trial (p)

In this case, generate 5 random numbers, Each with possible outcomes of 2 or 3. Each event has a 20% probability of a "2" outcome and an 80% of a "3" outcome. .(You could easily do the same thing with outputs of 1 and 0 - measuring something occuring or not occurring)

ProbabilityThis = p - This is the probability that the outcome of the event will be "1" and not "0"This = q - This is the probabability that the outcome of the event will be "0" and not "1"

Statistical function COUNTIF - Select the range of outputs to be counted and then select the cell that has the output to be counted, (Where outcome = 2)The sum is the number of successes in 5 random trials, each having a 0.20 chance of a "2" outcome.

Calculating the probability of a certain number of a given outcome to occur

Problem: What is the probability of 3 successful outcomes in 5 trials if the probability of a successful outcome in 1 trial is 20%?

Statistical Function / BINOMDIST (in this case, you don't want cumulative distribution - Use 0 as that last argument)

Problem - In 12 trials (n = 12), what is the probability that at least 10 of them (Sum of the probabilities that s = 10, s = 11, and s = 12)

15.13%

Problem - What is the possibility of getting between 4 and 6 heads on 10 flips of a fair coin?

0.65625

65.63%

Problem - If 10% of products require servicing, what is probability that less than 15 of 200 products will need servicing?

Statistical function COUNTIF - Select the range of outputs to be counted and then select the cell that has the output to be counted, (Where outcome = 2)

Population Proportions

When sample of size n is used to estimate a population proportion, e.g. a proportion of a population who would vote for a certain candidate, it can be analyzed using the binomial distribution

The population proportion of success will be the same as p, the probability of success of a single trial.

The following relationships hold true for population proportions:

The mean of sample proportions = µ = p

The standard deviation of sample proportions = σ = SQRT { [ p (1 - p) ] / n }

The confidence interval of a population proportion would be = µ ± zσ = p ± zSQRT { [ p (1 - p) ] / n }

Problem: A random sample of 350 people was chosen and each person was asked if they recognized a particular brand. 112 people recognized the brand. Calculate a 95% confidence interval of the proportion of the total population who recognize the brand.

Givens:n= 350p= 112 / 350 = 0.32

Confidence level 0.95 - This means that 2.5% of area under Normal curve exists in each tail above and below the confidence interval.

z = NORMSINV(0.975) = 1.96 - 97.5% of the total area under the normal curve is to the left of a point 1.96 standard deviations from the mean

The confidence interval = µ ± zσ = p ± zSQRT { [ p (1 - p) ] / n } =

The confidence interval = 0.32 ± 0.04887

The confidence interval = 0.27113 to 0.36887 Which means that there is a 95% chance that between 27.1% and 36.9% of the total population are aware of the brand.

Determining Sample Size for a Desired Sampling Error

The minimum number of sample needed, n, to obtian a confidence interval of a certain width, e (or given sample error)

n = p (1-p) (z/e)**2

It is better to use the binomial distribution to calculate the p value when dealing with a proportion.The p value is the area under the Normal curve outside of x - NOT the probability of a successful trial)

Problem: A manufacturer of circuit boards wants to keep the proportion of defective boards at 0.098. The manufactur tested 156 randomly chosen boards and found 20 to be defective. Determine with a 95% certainty (0.05 level of significance) the defective proportion has not increased above 0.098.

n = 156p = 0.098x = 19

The probability that 20 or more boards are defective =

1 - the probability that19 or less are defective = 1 - Cumulative probability of 19 defective = 1 - BINOMDIST(19,256,0.098,1)

1 - 0.870142 = 0.129858

This p-value of 0.129858 is greater than α (0.05 - the level of significance - the proportion of area under the Normal curve to the right of the critical value)

We therefore conclude that the large x value could have happened by chance and we fail to reject the NULL Hypothesis.

To determine whether a known population has changed, take a sample of the population and use the binomial distribution tocalculate the probability of that sampling event (the number of successes, x, per given sample size,n, given p - the previously know probability of success in a single trial) and compare that probabiilty to the desired level of significance.If this probability is less than the level of significance you have established (α for a one-tailed test and α/2 for a two-tailed test),then the NULL Hypothesis is rejected.

When sample of size n is used to estimate a population proportion, e.g. a proportion of a population who would vote for a certain candidate,

Problem: A random sample of 350 people was chosen and each person was asked if they recognized a particular brand. 112 people recognized the brand. Calculate a 95% confidence interval of the proportion of the total population

- This means that 2.5% of area under Normal curve exists in each tail above and below the confidence interval.

- 97.5% of the total area under the normal curve is to the left of a point 1.96 standard deviations from the mean

Which means that there is a 95% chance that between 27.1% and 36.9% of the total population

The minimum number of sample needed, n, to obtian a confidence interval of a certain width, e (or given sample error)

It is better to use the binomial distribution to calculate the p value when dealing with a proportion.The p value is the area under the Normal curve outside of x - NOT the probability of a successful trial)

Problem: A manufacturer of circuit boards wants to keep the proportion of defective boards at 0.098.

Determine with a 95% certainty (0.05 level of significance) the defective proportion has not increased above 0.098.

1 - the probability that19 or less are defective = 1 - Cumulative probability of 19 defective = 1 - BINOMDIST(19,256,0.098,1)

This p-value of 0.129858 is greater than α (0.05 - the level of significance - the proportion of area under the Normal curve to the right of the critical value)

We therefore conclude that the large x value could have happened by chance and we fail to reject the NULL Hypothesis.

To determine whether a known population has changed, take a sample of the population and use the binomial distribution tocalculate the probability of that sampling event (the number of successes, x, per given sample size,n, given p - the previously know probability of success in a single trial)

If this probability is less than the level of significance you have established (α for a one-tailed test and α/2 for a two-tailed test),

Histograms, Charting, and Descriptive Statistics

Civilian Labor Force (1,000)Year Males Females1948 40,619 14,9741949 40,803 15,5801950 41,129 16,2851951 40,831 17,0001952 40,712 17,5931953 41,334 17,9571954 41,496 17,4921955 41,749 18,2661956 42,645 19,4561957 42,625 19,5911958 42,833 20,0931959 43,053 20,4551960 43,563 20,6891961 43,907 21,6081962 43,589 21,758 1st - Highlight the Males and Females column of data to create the chart. Do not highlight the year column.1963 44,025 22,1341964 44,397 22,734 2nd - In the 2nd step of creating the chart, click the Series tab and highlight the Year column as the x-axis.1965 44,837 23,3511966 44,698 24,043 Descriptive Statistics - Tools / Data Analysis / Descriptive Statistics1967 45,086 25,003 Males1968 45,671 25,6421969 46,081 26,770 Mean1970 46,842 27,954 Standard Error1971 47,627 28,810 Median1972 48,542 29,580 Mode1973 49,389 30,148 Standard Deviation1974 50,862 31,491 Sample Variance1975 51,213 32,972 Kurtosis1976 51,753 34,214 Skewness1977 52,784 35,399 Range1978 54,077 37,323 Minimum1979 55,349 38,959 Maximum1980 56,225 40,747 Sum1981 56,860 41,866 Count1982 57,461 42,9521983 58,105 44,2551984 59,250 44,9941985 59,949 46,7401986 61,126 47,8521987 61,899 49,0851988 62,423 50,4361989 63,375 51,9961990 64,805 52,9251991 65,149 53,328

Year

1949

1951

1953

1955

1957

1959

1961

1963

1965

1967

1969

1971

1973

1975

1977

1979

1981

1983

1985

1987

1989

1991

1993

1995

1997

010,00020,00030,00040,00050,00060,00070,00080,000

Males vs. Female Hires

Males

Females

1992 65,767 54,3561993 66,329 54,9821994 66,788 56,3221995 67,516 56,8711996 67,434 57,5031997 68,884 58,7881998 69,547 59,5831999 70,295 60,718

Measures of Dispersion - Standard Deviation and Variance

x x bar (x mean)

20 11830 11842 11840 11855 118

521 118

Sum ( (x - x bar)**2) =

# of points 6Statistical Function COUNT

n 6

n-1 5

Sum 708Arithmetic Function SUM

Mean 118Statistical Function AVERAGE

Individual Function Calculation of Stan Dev & VarVariance 39117.2Statistical Function VAR

Standard Deviation 197.7807Statistical Function STDEV

Histogram and Descriptive Statistics

State

Alabama $ 53,700 Descriptive StatisticsAlaska $ 94,400 Median Value Owner OccupiedArizona $ 80,100 Arkansas $ 46,300 MeanCalifornia $ 195,500 Standard ErrorColorado $ 82,700 MedianConnecticut $ 177,800 ModeDelaware $ 100,100 Standard DeviationDistrict of Columbia $ 123,900 Sample VarianceFlorida $ 77,100 KurtosisGeorgia $ 71,300 SkewnessHawaii $ 245,300 RangeIdaho $ 58,200 MinimumIllinois $ 80,900 MaximumIndiana $ 53,900 SumIowa $ 45,900 CountKansas $ 52,200

Kentucky $ 50,500 HistogramLouisiana $ 58,500 Bin Range Requested By Histogram (in Yellow)Maine $ 87,400 IntervalMaryland $ 116,500 1Massachusetts $ 162,800 2Michigan $ 60,600 3Minnesota $ 74,000 4Mississippi $ 45,600 5Missouri $ 59,800 6Montana $ 56,600 7Nebraska $ 50,400 8Nevada $ 95,700

New Hampshire $ 129,400 Histogram - Tools / Data Analysis / HistogramNew Jersey $ 162,300 New Mexico $ 70,100 45000New York $ 131,600 70000North Carolina $ 65,800 95000North Dakota $ 50,800 120000Ohio $ 63,500 145000

Median Value Owner

Occupied

Oklahoma $ 48,100 170000Oregon $ 67,100 195000Pennsylvania $ 69,700 220000Rhode Island $ 133,500 MoreSouth Carolina $ 61,100 South Dakota $ 45,200 Tennessee $ 58,400 Texas $ 59,600 Utah $ 68,900 Vermont $ 95,500 Virginia $ 91,000 Washington $ 93,400 West Virginia $ 47,900 Wisconsin $ 62,500 Wyoming $ 61,600

Sorting Data and Histogram To Find Patterns

Original Data Sorted Data

Country CountryAustralia $ 16,085 Turkey Austria $ 17,280 Greece Belgium $ 17,454 Portugal Canada $ 19,178 Ireland Denmark $ 17,621 Spain Finland $ 15,997 New Zealand France $ 18,227 United Kingdom Germany $ 19,500 Finland Greece $ 7,775 Australia Iceland $ 17,237 Netherlands Ireland $ 11,507 Sweden Italy $ 16,896 Italy Japan $ 19,107 Norway Luxembourg $ 21,372 Iceland Netherlands $ 16,530 Austria New Zealand $ 13,883 Belgium

Gross Domestic Product Per Capita using Purchasing Power Parity 1991


per capita GDP (dollars)

7000

0

9500

0

1200

00

1450

00

1700

00

1950

00

2200

00M

ore

0

5

10

15

20

25

30 27

11

4 42 1 1 1

Histogram - Median Income

Frequency

45000 - Starting (25000 blocks)

Fre

qu

en

cy

Norway $ 16,904 Denmark Portugal $ 9,191 France Spain $ 12,719 Japan Sweden $ 16,729 Canada Switzerland $ 21,747 Germany Turkey $ 3,491 Luxembourg United Kingdom $ 15,720 Switzerland United States $ 22,204 United States

1st - Highlight the Males and Females column of data to create the chart. Do not highlight the year column.

2nd - In the 2nd step of creating the chart, click the Series tab and highlight the Year column as the x-axis.

Descriptive Statistics - Tools / Data Analysis / Descriptive StatisticsFemales

52371.3076923077 Mean 34646.61362.9393673634 Standard Error 2051.745

50125.5 Median 30819.5#N/A Mode #N/A

9828.29554875439 Standard Deviation 14795.3496595393.3936654 Sample Variance 2.19E+08-1.3294344021872 Kurtosis -1.365154

0.412167140456704 Skewness 0.34144329676 Range 4574440619 Minimum 1497470295 Maximum 60718

2723308 Sum 1801623

52 Count 52

Year

1949

1951

1953

1955

1957

1959

1961

1963

1965

1967

1969

1971

1973

1975

1977

1979

1981

1983

1985

1987

1989

1991

1993

1995

1997

010,00020,00030,00040,00050,00060,00070,00080,000


Males

Females

Measures of Dispersion - Standard Deviation and Variance

x - x bar (x - x bar)2

-98 9604-88 7744-76 5776-78 6084-63 3969403 162409

195586n-1 = 5

Direct Calculations of Standard Deviation and VarianceVariance = [Sum ( ( x - x bar)**2 )] / [n-1] = 39117.2

Standard Deviation = SQ RT (Variance) = 197.7807Arithmetic Function SQRT

Descriptive Statistics Calculations of Stand Dev & VarianceDescriptive Statistics Tools / Data Analysis / Descriptive Statistics

Mean 118Standard Error 80.7436271995093Median 41

Mode #N/AStandard Deviation 197.780686620307Sample Variance 39117.2Kurtosis 5.92557031095312Skewness 2.42991903196838Range 501

Minimum 20Maximum 521Sum 708Count 6

Descriptive StatisticsMedian Value Owner Occupied

84209.80392156866018.54145245485

68900#N/A

42980.98302692471847364901.960793.556208617060221.84961606045097

20010045200

2453004294700

51

Bin Range Requested By Histogram (in Yellow)More than .. But not more than..

45000 7000070000 9500095000 120000

120000 145000145000 170000170000 195000195000 220000220000 245000

Histogram - Tools / Data Analysis / Histogram

Frequency2711

44

2111

The data needs to be copied here and then sorted Data / Sort Histogram

Bin Frequency

3491 1 $ 3,491 8169.25 1 $ 7,775 12847.5 3 $ 9,191 17525.75 11 $ 11,507 More 8 $ 12,719 $ 13,883 $ 15,720 $ 15,997 $ 16,085 $ 16,530 $ 16,729 $ 16,896 $ 16,904 $ 17,237 $ 17,280 $ 17,454


per capita GDP (dollars)

7000

0

9500

0

1200

00

1450

00

1700

00

1950

00

2200

00M

ore

0

5

10

15

20

25

30 27

11

4 42 1 1 1

Histogram - Median Income

Frequency

45000 - Starting (25000 blocks)

Fre

qu

en

cy

$ 17,621 $ 18,227 $ 19,107 $ 19,178 $ 19,500 $ 21,372 $ 21,747 $ 22,204

Year

1949

1951

1953

1955

1957

1959

1961

1963

1965

1967

1969

1971

1973

1975

1977

1979

1981

1983

1985

1987

1989

1991

1993

1995

1997

010,00020,00030,00040,00050,00060,00070,00080,000


Males

Females

Descriptive Statistics Calculations of Stand Dev & Variance

Allowing Excel to pick bin size (leave bin range blank)

Frequency

0

5

10

15

Histogram

Frequency

per Cap GDP

Fre

qu

en

cy

Hypothesis Testing of a Population Mean

Hypothesis testing is one of the types of statistical tests to determine if a change has occurred to a population mean.

Overall, two hypothesis are being created and tested.

The first hypothesis, the NULL Hypothesis, is usually stated in terms such as "There has been no change in the population mean"This will normally involve an equal sign.

The second hypothesis, the Alternative Hypothesis, states that the population mean has changed in one of three ways:

1) The population mean has changed (increased OR decreased) - This involves a two-tailed test2) The population mean has decreased - This involves a one-tailed test with the left tail3) The population mean has increased - This involves a one-tailed test with the right tail.

In summary, hypothesis testing involves:

1) Determining the NULL hypothesis, determining the level of certainty to which that NULL Hypothesis

1) Determining the NULL hypothesis. This is normally that the original population mean has not changed. 2) Determining the level of certainty to which that NULL Hypothesis will be tested. If you want to establish a 95% certainty level, then α, "alpha" , = 0.053) Take a sample of the population. 4) Calculate the sample mean. This value will be called x.5) Graph this sample mean on the normal curve created from the original population mean6) The NULL Hypothesis is accepted or rejected based upon the results of either of the following tests (which are both equivalent to each other)

6a) The critical value test - The level of certainty, α, is converted to a "critical value." This "critical value" is the number of standard deviations that the level of certianty is from the mean. For example, on a two-tailed test, an α of 0.05 translates to a 95% level of certainty. On a two-tailed test, this would result in 2.5% of the total area under the Normal curve to be greater than the right critical value and 2.5% of the area under the Normal curve to be less than the left critical value. Each critical value is 1.96 standard deviations from the mean on the normal curve - NORMSINV(0.975) = 1.96 The z value of the sample mean is calculated. The z-value is the number of standard deviations that the sample mean is from the population mean on a Normal curve derived from the population mean. If the z-value of the sample is farther away from the mean than the critical value (the z value of that level of certainty), then the NULL hypothesis is normally rejected

6b) The p-value test - This is equivalent to the above test - A Normal curve is constructed based upon the population mean. The α is the significance level. The significance level represents that percentage of the area under the normal curve that is outside the required level of certainty. For example, on a two-tailed test with a 95% required level of certainty, α = 0.05. The test is two-tailed so 2.5% of the total area will be in one tail above the 95% certainty level and 2.5% of the area under the normal curve will be below the 95% confidence area. The p value is equal to the percentage of area under the normal curve that is outside of x on the normal curve. If the p value is less than the the percentage of the area under the normal curve corresponding to α, the NULL Hypothesis is normally rejected.

Two-tailed test - Testing whether a population mean changed in either direction

50 sheets are sample having a sample mean of 14.982 mls. At the 0.05 significance level (95% confidence level) whether the manufacturer's claim that the average thickness of 15 mls. is correct.

Givens: n= 50α= 0.05σ= 0.1x= 14.982µ= 15

The NULL Hypothesis is the population mean, µ, = 15 mls.

The ALTERNATE Hypothesis is that µ ≠ 15 mls. (Since we are testing whether a difference exists in either direction, this is a two tailed test)

1) Calculate Sample Standard Error Sample Standard Error = σ / SQRT(n) = 0.014142

2) Calculate z value for sample - Z value = (x - µ) / (Sample Standard Error)= -1.272792

3) Calculate p value - the area under the Normal curve outside the sample z value. NORMSDIST(1.272792) = This states that 10.154% of the total area under the Normal curve is lies outside a point 1.27 standard deviations from the mean on either side (tail) of the Normal curve.

THE P TEST CAN BE PERFORMED AT THIS POINT

The NULL Hypothesis is rejected if the p-value (the percentage of area under the Normal curve ouside point x) is less than α/2 (in a two-talied test) or α (in a one-tailed test)

The p-value = 0.101546 and is much larger than α/2 (0.025) so the NULL Hypothesis is not rejected - The manufacturer's claim appears to be valid.

TO PERFORM THE EQUIVALENT CRITICAL VALUE TEST, DO THE FOLLOWING;

1) Calculate the critical value of α - NORMSINV(0.975)= 1.96

This states that α of 0.05 on a two-tailed test produces a confidence interval that goes from 1.96 standard deviations above the mean to 1.96 standard deviations below the mean.

If x is outside of this range (the z value for z is greater than 1.96), then the NULL Hypothesis is rejected.

In this case, the z value of x (1.27279) is less than the critical value (1.96) and therefore x is closer to the mean than the critical value, and we do not reject the NULL Hypothesis.

One-tailed test - Testing whether a population mean changed in only one direction

Problem: A furniture company states that its average delivery time is 15 days with a (population) standard deviation of 4 days. A random sample of 50 deliveries showed an average delivery time of 17 days. Determine within 98% certainty (0.02 significance level) whether delivery time has increased.

Givens: n= 50α= 0.02

Problem: A manufacturer claims that the average thickness of metal sheets is 15 mls. And that the population standard deviation,

σ= 4x= 17µ= 15

This is a one-tailed test because we are checking whether delivery time increased.

Using the P-test, we will determine if the p value (area above x under the normal curve) is less than α (since this is a one-tailed test)

1) Calculate Sample Standard Error Sample Standard Error = σ / SQRT(n) = 0.565685

2) Calculate z value for sample - Z value = (x - µ) / (Sample Standard Error)= 3.535534

3) Calculate p value - the area under the Normal curve outside the sample z value = 1 - NORMSDIST(3.535534) = This states that 0.000203 of the total area under the Normal curve is lies above the point 3.535534 standard deviations above the mean.

This p-value (0.000203) is less than α (0.02) so the NULL Hypothesis is rejected - It appears likely that delievery time has increased.

NULL Hypothesis - µ = 15ALTERNATE Hypothesis - µ > 15

Hypothesis testing is one of the types of statistical tests to determine if a change has occurred to a population mean.

The first hypothesis, the NULL Hypothesis, is usually stated in terms such as "There has been no change in the population mean"

The second hypothesis, the Alternative Hypothesis, states that the population mean has changed in one of three ways:

2) Determining the level of certainty to which that NULL Hypothesis will be tested. If you want to establish a 95% certainty level, then α, "alpha" , = 0.05

6) The NULL Hypothesis is accepted or rejected based upon the results of either of the following tests (which are both equivalent to each other)

6a) The critical value test - The level of certainty, α, is converted to a "critical value." This "critical value" is the number of standard deviations that the level of certianty is from the mean. For example, on a two-tailed test, an α of 0.05 translates to a 95% level of certainty. On a two-tailed test, this would result in 2.5% of the total area under the Normal curve to be greater than the right critical value and 2.5% of the area under the Normal curve to be less than the left critical value. Each critical value is 1.96 standard deviations

The z value of the sample mean is calculated. The z-value is the number of standard deviations that the sample mean is from the population mean

If the z-value of the sample is farther away from the mean than the critical value (the z value of that level of certainty), then the NULL hypothesis is normally rejected

The α is the significance level. The significance level represents that percentage of the area under the normal curve that is outside the required level of certainty. For example, on a two-tailed test with a 95% required level of certainty, α = 0.05. The test is two-tailed so 2.5% of the total area will be in one tail above the 95% certainty level

The p value is equal to the percentage of area under the normal curve that is outside of x on the normal curve. If the p value is less than the the percentage of the area under the normal curve corresponding to α, the NULL Hypothesis is normally rejected.

Two-tailed test - Testing whether a population mean changed in either direction

50 sheets are sample having a sample mean of 14.982 mls. At the 0.05 significance level (95% confidence level) whether

The ALTERNATE Hypothesis is that µ ≠ 15 mls. (Since we are testing whether a difference exists in either direction, this is a two tailed test)

NORMSDIST(1.272792) = 0.101546 This states that 10.154% of the total area under the Normal curve is lies outside a point 1.27 standard deviations from the mean on either side (tail) of the Normal curve.

The NULL Hypothesis is rejected if the p-value (the percentage of area under the Normal curve ouside point x) is less than α/2 (in a two-talied test) or α (in a one-tailed test)

The p-value = 0.101546 and is much larger than α/2 (0.025) so the NULL Hypothesis is not rejected - The manufacturer's claim appears to be valid.

This states that α of 0.05 on a two-tailed test produces a confidence interval that goes from 1.96 standard deviations above the mean to 1.96 standard deviations below the mean.

In this case, the z value of x (1.27279) is less than the critical value (1.96) and therefore x is closer to the mean than the critical value, and we do not reject the NULL Hypothesis.

One-tailed test - Testing whether a population mean changed in only one direction

Problem: A furniture company states that its average delivery time is 15 days with a (population) standard deviation of 4 days.

Problem: A manufacturer claims that the average thickness of metal sheets is 15 mls. And that the population standard deviation, σ, is 0.1 mls.

Using the P-test, we will determine if the p value (area above x under the normal curve) is less than α (since this is a one-tailed test)

1 - NORMSDIST(3.535534) = 1 - 0.999797 = 0.000203 This states that 0.000203 of the total area under the Normal curve is lies above the point 3.535534 standard deviations above the mean.

This p-value (0.000203) is less than α (0.02) so the NULL Hypothesis is rejected - It appears likely that delievery time has increased.

Discrete Variables

Calculating Means, Standard Deviations, and Variances of their distributions of Disrete Variables.

x P(x) x * P(x)Grade Probability

4 0.1 0.43 0.2 0.62 0.35 0.71 0.25 0.250 0.1 0

1 1.95

Expected Value = mean = x bar = Sum [ x * P(x) ] = 1.95

xGrade Mean ( x - Mean ) Square of (x - Mean )

4 1.95 2.05 4.20253 1.95 1.05 1.10252 1.95 0.05 0.00251 1.95 -0.95 0.90250 1.95 -1.95 3.8025

Variance = SUM [

Standard Deviation = SQRT (Variance) =

These are the variance and stand dev of probability distribution of x (the distribution of the grades)

P(x) Probability { Square of (x-Mean) } * P(x)

0.1 0.420250.2 0.2205

0.35 0.0008750.25 0.2256250.1 0.38025

{ Square of (x-Mean) } * P(x) ] = 1.2475

Standard Deviation = SQRT (Variance) = 1.116915Mathematical Function SQRT

These are the variance and stand dev of probability distribution of x (the distribution of the grades)

Date post:	22-Dec-2015
Category:	Documents
Upload:	budi-santoso
View:	26 times
Download:	0 times

Excel Statistical Analysis

Documents