Business and Economic StatisticsTutorial 1: Describing Categorical Data (Ch 4)
Tutor: Sam CapursoE-mail: ...
1st
1. Why Statistics?
Statistics
Initiates policy / decisions
Evaluates and informs policy / decisions
Accountants work in an economy (in fact, everyone does)
i P E.R. Confidence More...
Business Consumer
2. Prac set up
Task Minutes
Attendance, hand back work 5
Summary for this week 5 - 10
Individual written work (4 in the semester)
10
Individual MCQ test 10 - 15
Group MCQ scratchy test 10 (or until finished)
Group WAQ Approx 1 hour
Worked Example 10 - 15
3. First prac (only)
* Introduction* “House keeping”* Arrange groups* Work out team names and take attendance* Prac work
• * Need to attend lectures and read text BEFORE PRAC
• * Assessment for pracs = • Indiv MCQ (5%)^ + Team MCQ (5%)^ + Team WAQ (10%)^^
• ^ Hand in prac
• ^^ Hand in by due date: … in hand-in box: names, ID numbers, time, day, tutor.
4. Things to note
5. Add previous prac’s results
Building a HouseGroup activity
Roles
Architect – design, framework, ideas Tradesperson – technical, 'expert' in field Superintendent – leader, knowledge of different
areas Decorator – finer details, user-friendliness Real estate agent – communication, 'sells the
product' General contractor – follows direction, able to learn
how to perform different roles
Task
Questions:1.Why did you choose this role?2.What types of skills / experiences are related to
this role?3.What are the ways in which someone in your
role can work with someone from (choose a different role)?
4.How can you relate this activity to working in your BES team?
Stratified and clustered sampling
Clip:http://
www.youtube.com/watch?v=CvPPM2stuPg&feature=c4-overview&list=UUZFQ2rSVMR2ahKAzBto5P7w
Note2nd
Population
Sampling frame (list)
Target sample
Actual sample (respondents)
Convenience sampling
Undercoverage
Non-response bias
Voluntary response bias
Response bias
Note:
n↑ ≠ ↓biasn↑ ↓sampling error
(error due to randomness)
Need to improve survey design to bias
If ↑ n, just asking more people the wrong question!
Sampling:
Simpson’s ParadoxE.g.2nd
School Girls Boys Total
School A 273 77 350
School B 289 61 350
Total 562 138 700
Which school had higher proportion of girls?
School % girls
School A 78%
School B 83%
School B has more girls
School A Girls Boys Total
Yr 11 81 6 87
Yr 12 192 71 263
Total 273 77 350
School B Girls Boys Total
Yr 11 234 36 270
Yr 12 55 25 80
Total 289 61 350
School A has more girls in
each year level
School
Year 11
Girls Boys
Year 12
Girls Boys
Percentage of girls by school broken into year levels
School Yr 11 Yr 12
School A 93% 73%
School B 87% 69%
So, something must be going on with year levels when we add them up to get results before.
School A Girls Boys TotalYr 11 81 6 87Yr 12 192 71 263Total 273 77 350
School B Girls Boys TotalYr 11 234 36 270Yr 12 55 25 80Total 289 61 350
Percentage of girls in each year level
Year level % girlsYr 11 88%Yr 12 72%
% Yr 11 in each school
School % Yr 11School A 25%School B 77%
So, proportion of girls exaggerated in School B, because...* Year 11 students are more likely to be girls, and* School B has higher proportion of Year 11 students
Year 12Girls
Boys
Year 11Girls
Boys
School A Yr 11
Yr 12
School B Yr 11
Yr 12
CharacteristicCategory summed
Group Category summed
Displaying and Describing Quantitative Data
3rd Note
Displaying and Describing Quantitative Data
3rd Note
Displaying and Describing Quantitative Data
• Construct a box-and-whisker plot for the following data: 3, 8, 1, 5, 3, -2, 3
• Solution:• Ordered: -2, 1, 3, 3, 3, 5, 8• Median: 3• Q1: 2• Q3: 4• IQR: 4 – 2 = 2• 1.5 * IQR = 3• LF = Q1 – 3 = -1• UF= Q3 + 3 = 7• So, whiskers at 1 and 5, outliers are -2 and 8
3rd E.g.
Interpretation of slope coefficient
Clip:http://www.youtube.com/watch?v=BgCoGYXwD4w&list=UUZFQ2rSVMR2ahKAzBto5P7w
Note4th
Correlation and Linear Regression
• The difference between r (correlation coefficient) and R2 (the coefficient of determination)…
• The difference between interpreting r and commenting on a scatter plot…
• Question – True or false? Two variables which are strongly related will always have a high correlation coefficient. Explain…
• Is this point unusual? What to do…
E.g.4th
Probability and Expected Values
E.g.
Be aware of the following:
* V[X + c] ≠ V[X] + c
* SD[X + Y] ≠ SD[X] + SD[Y]; = V Var[X] + Var[Y]* where X, Y are random variables, c is a constant.
* Note the two tests for independence…
* Interpretation of expected value: we expect ….(include units)… in the long run, on average.
5th
Probability and Expected Values
E.g.
Questions:
1. Find the formula for P(A or B) if A and B are: independent; not independent.
2. Find the formula for P(A and B) if A and B are: disjoint; not disjoint.
3. Consider disjoint events A and B, which both have non-zero probabilities. Can A and B ever be independent? Explain in words or using formulae.
4. Complete the following: E[aX + bY + c]; Var[aX + bY + c], where a, b are constants, and X, Y are independent random variables
5th
Probability and Expected Values
E.g.
Consider a single trial with two outcomes, success (which we will represent by a 1) or failure (0).
Let the probability of success be p.
a) What is the probability of failure? Hint: you need to make sure the probability model is valid.
b) Write down the formula for calculating the expected value.c) Use this to work out E(y) in terms of p.d) Write down the formula for calculating variance.e) Use this to show Var(y) = p(1-p).
y 0 1
Pr(y) ? p
5th
Solutions
Normal and sampling distributions
• The four types of normal probability questions: P(X < A) P(A < X < B) = P(X < B) – P(X < A) P(X > B) = P(X < -B) = 1 – P(X < B) Given the probability, what are the boundaries?
Proportions
Shape Model Normal
Centre Mean
Spread Variance
Assumptions Conditions
1.2.
1.2.3.
Means
Shape Model Normal
Centre Mean
Spread Variance
Assumptions Conditions
1.2.
1.2.3.
http://www.youtube.com/watch?v=ddBdqqtXiao&feature=c4-overview&list=UUZFQ2rSVMR2ahKAzBto5P7w
6th Note
Because Z tables only have < probs
Normal distributionE.g.6th
The length, X cm, of members of a certain species of fish is normally distributed with mean 40 and standard deviation 5.
a. Find the probability that a fish is longer than 45 cm.
b. Find the probability that a fish is between 35 cm and 50 cm long.
c. Describe the longest 10% of this specifies of fish.Solutions
Confidence intervals and hypothesis tests
Proportions• Confidence intervals for proportions: + z
• Remember to check conditions
• Interpretation: we are 95% confident the population proportion lies between [lower bound] and [upper bound]
• n =
7th Note
CI 90% 95% 99%
z 1.645 1.96 2.576
Confidence intervals and hypothesis tests
Means
• CI: + twhere s = sample standard deviationand where t has df = n – 1 • Remember to check conditions
• Similar interpretation…
7th Note
Demo – finding t from tables
Confidence intervals and hypothesis tests
Hypothesis tests of one proportion
• Hypothesis test: one-tailed (< >) or two-tailed• Conditions• State model using (z or t)• Standardised statistic• P-value (or… learn other way this week, ‘critical
value’ approach)• Conclusion
7th Note
Hypothesis test: 1 proportionHistorically, 53% of the population supported the ruling political
party. A recent survey, in which the 150 respondents were selected randomly, showed that 93 of them supported the party. A two-tailed z-test at the 0.05 level of significance is to be used to determine whether or not the population proportion has significantly changed.
a. State the null hypothesis and the alternative hypothesis.b. Check the conditions that justify inference in this context.c. Determine whether or not the null hypothesis should be
rejected, and make a conclusion based on your finding.
E.g.7th
Handwritten solution
Inference so far… reviewing the p-value
8th Note
Inference so far…
8th Note
Inference so far…
hypothesis tests for counts
8th Note
Hypothesis test: 1 mean
• Previous research has shown that the average IQ of Australians was 110. In 2012, a random sample of 40 Australians revealed an average IQ of 100 with standard deviation 15. The researcher wants to test, at a 1% level of significance, whether the average IQ of Australians has indeed decreased.
• (Fictional data)
E.g.
Handwritten solution
8th
Excel Output9th Note
Inference in regression9th Note
Inference in regression9th Note
Inference in regression9th Note
Inference in regressionWe are estimating the relationship between bwght (birth weight of newborn baby in pounds) and cigs (packets of cigarettes smoked per week by mother prior to birth).Consider the Excel output below and answer the following questions.
Regression Statistics
Multiple R -0.1507R Square 0.0227Adjusted R square 0.022
Standard Error 1.258
Observations 1388
ANOVA
df SS MS F Significance F
Regression 1 51.0172632 51.0172632 32.24 0
Residual 1386 2193.55977 1.58265495
Total 1387 2244.57703 1.61829634
Coefficients S. Error tstat P-value Lower 95% Upper 95%Intercept 7.485744 0.0357713 209.27 0 7.415572 7.55915cigs -0.0321108 0.0056557 -5.68 0 -0.0432054 -0.03210161
E.g.9th
a. Which do you think is the explanatory variable and which is the response variable?
b. Write down and interpret the correlation coefficient.c. Write down and interpret R2 (the coefficient of determination).d. Interpret the slope and the intercept.e. Are the signs and sizes of the slope and intercepts reasonable? Explain.f. Write down and interpret the 95% confidence interval for the slope.g. Do the same for the 90% confidence interval. Explain how this differs from
the 95% confidence interval.h. Formulate a null and alternative hypothesis for the slope, using economic or
general theory.i. Conduct this hypothesis test using a 5% level of significance and make a
conclusion.j. Test whether the slope is significantly different from -0.05 at a 1% level of
significance.k. Suppose a hypothesis test for the slope had hypotheses H0: β1 = 0, and HA:
β1≠0. Explain the purpose of conducting this test in terms of assessing whether the current regression model should be used.
E.g.9th
Notation - recap:• μ
• σ
• s• = (or for estimate)• n• N• P
• p-value• b0,1
• β0,1
• Population mean• Sample mean• Population standard deviation (variability of individual observations)• Sample standard deviation• Standard deviation of sample means
• Sample size• Population size• Population proportion• Sample proportion• See definition…• Sample coefficient on intercept/slope in
regression• Population coefficient on intercept/slope in
regression
10th Note
Multiple Linear Regression; Dummy Variables; Time Series – some things to note
Multiple linear regression
• Interpretation of slope coefficient: we estimate for every [one unit] increase in [explanatory variable], the [response variable] [increases/decreased] by [… units], on average, holding all other explanatory variables fixed.
• Inference on the whole equation• H0: β1 = β2 = … = 0
no linear relationship between Y and X1, X2, …
• HA: β1 ≠ 0 and/or β2 ≠ 0 at least one of the slopes is significant; there is a significant relationship
between the response variable and the explanatory variables as a group.
• Use p-value from Excel “Significance-F”
10th Note
Multiple Linear Regression; Dummy Variables; Time Series – some things to note
Dummy variables
• Interpretation of dummy variables… see example.• The dummy variable trap…• Testing the significance of a dummy variable is the same as
testing whether there is a significant difference between the means of the two categories.
Time Series
• Interpretation of trend line, trend = a + bt
• Trend is [a units] at [origin] and [increases / decreases] by [b units] each [time period, t].
10th Note
Components of a classical time series modelTrend
Cyclical
Seasonal
Irregular
Dummy Variables1. Consider the following equation:• Income = β0 + β1experience + β2gender + ε
• where gender = 1 if male, 0 if female.
a. State what you expect the sign of β1 and β2 to be. Explain why.
b. Interpret the following:i. The slope coefficient on gender.ii. The slope coefficient on experience.c. Redefine gender to be 1 if female, 0 if male. What happens to β2?
2. Suppose that we want to examine the level of crime in different regions of Adelaide: north, south, east and west. In other words, in our regression model, crime level is the response variable, and region is the explanatory variable. Create a dummy variable for the region.
Solutions – for 2
10th E.g.
Time Series and Price Indices
• Price relative = 100*
• Be careful about the difference between a percentage increase and percentage point increase.
Assume a, b > 100
• Interpretation: price index of A means prices are (a – 100)% higher in Year A than in the base year / there has been a (a – 100)% increase
• The increase in the index number from Year A to Year B is (b – a) percentage points or… • %• Note: you could do the same using prices, instead of price indices. • Interpretation of average price relatives: on average, the price of the … goods increased by
…% between … and … (*)
• Could do the same for expenditure … … but of little use.• Same interpretation, but instead of “price” use “cost”.
11th Note
Year Base year A B
Prince index 100 a b
Time Series and Price Indices
• Laspeyres Price Index = . This is the increase in the cost of the time 0 basket of goods in time t relative to what they cost in time 0.
• Paasche Price Index = = . This is the increase in the cost of the time t basket of goods in 2010 relative to what they would have cost in 2008.
• Same interpretation as (*)
• Note:• Why the Laspeyres and Paasche Indices differ.• How to shift the base, and chain series. • Nominal = in current prices. Real = in constant (base year
prices)• Real prices = (if price index base = 100)
11th Note
Time Series and Price Indices
Discussion question – what are the limitations of the CPI?
• Overestimates price index because there is a type of Laspeyres index
• What items are included in the goods basket? (Can’t include all of them!)
• Only surveys metropolitan households• Data taken from survey – potential sources of sampling bias• Does not account for change in quality in goods with same /
lower price (e.g. computers)• How do you include new technology that didn’t exist in the
previous period?• What prices do you take? CPI doesn’t take into account sales /
specials
11th Note