Statistics:
Concepts of statistics for researchers
How to Use This Course Book
This course book accompanies the face-to-face session taught at IT Services. It contains a copy of the slideshow and the worksheets.
Software Used
We might use Excel to capture your data, but no other software is required. Since this is a Concepts course, we will concentrate on exploring ideas and underlying concepts that researchers will find helpful in undertaking data collection and interpretation.
Revision Information
Version Date Author Changes made
1.0 January 2014 John Fresen Course book version 1
2.0 October 2014 John Fresen Updates to slides
3.0 January 2015 John Fresen Updates to slides
4.0 February 2015 John Fresen Updates to slides and worksheets
…
…
14.0 December 2016 John Fresen Updates to slides and worksheets
15.0 February 2017 John Fresen Updates to slides and worksheets
16.0 May 2017 John Fresen Updates to slides and worksheets
17.0 October 2017 John Fresen Updates to slides and worksheets
Copyright
The copyright of this document lies with Oxford University IT Services.
Contents 1 Introduction ............................................................................. 1
1.1. What You Should Already Know ......................................................... 1
1.2. What You Will Learn ........................................................................... 1
2 Your Resources for These Exercises ........................................ 2
2.1. Help and Support Resources ............................................................. 2
3 What Next? .............................................................................. 3
3.1. Statistics Courses ............................................................................... 3
3.2. IT Services Help Centre ..................................................................... 3
Statistics Concepts
1 Introduction Welcome to the course Stat is t ics: Concepts .
This is a statistical concepts course, an ideas course, a think-in-pictures course. What are the basic notions and constructs of statistics? Why do we differentiate between a population and a sample? How do we summarize and describe sample information? Why, and how, do we compare data with expectations? How do hypotheses arise and how do we set about testing them? With inherent uncertainty in any sample, how can one extrapolate from a sample to the population? And then, how strong are our conclusions?
This course is designed to prepare you to get the most from the statistical applications that we teach. It involves discussion of real-life examples and interpretation of data. We strive to avoid mathematical symbols, notation and formulae.
1.1. What You Should Already Know We assume that you are familiar with entering and editing text, rearranging and formatting text - drag and drop, copy and paste, printing and previewing, and managing files and folders.
The computer network in IT Services may differ slightly from that which you are used to in your College or Department; if you are confused by the differences, ask for help from the teacher.
1.2. What You Will Learn In this course we will cover the following topics:
• Descriptive statistics and graphics
• Population and sample
• Probability and probability distributions
• Comparing conditional distributions
• Confidence intervals
• Linear regressions
• Hypothesis testing
• From problem – to data – to conclusions
Where to get help….
Topics covered in related Statistics courses, should you be interested, are given in Section 3.1.
1 IT Learning Centre
Statistics Concepts
2 Your Resources for These Exercises The exercises in this handbook will introduce you to some of the tasks you will need to carry out when working with WebLearn. Some sample files and documents are provided for you; if you are on a course held at IT Services, they will be on your network drive H:\ (Find it under My Computer).
During a taught course at IT Services, there may not be time to complete all the exercises. You will need to be selective, and choose your own priorities among the variety of activities offered here. However, those exercises marked with a star * should not be skipped.
Please complete the remaining exercises later in your own time, or book for a Computer8 session at IT Services for classroom assistance (See section 8.2).
2.1. Help and Support Resources You can find support information for the exercises on this course and your future use of WebLearn, as follows:
• WebLearn Guidance https://weblearn.ox.ac.uk/info (This should be your first port of call)
If at any time you are not clear about any aspect of this course, please make sure you ask John for help. If you are away from the class, you can get help and advice by emailing the central address [email protected].
The website for this course including reading material and other material can be found at https://weblearn.ox.ac.uk/x/Mvkigl
You are welcome to contact John about statistical issues and questions at [email protected]
2 IT Learning Centre
Statistics Concepts
3 What Next? 3.1. Statistics Courses
Now that you have a grasp of some basic concepts in Statistics, you may want to develop your skills further. IT Services offers further Statistics courses and details are available at http://courses.it.ox.ac.uk.
In particular, you might like to attend the course
Stat ist ics: In troduct ion: this is a four-session module which covers the basics of statistics and aims to provide a platform for learning more advanced tools and techniques.
Courses on particular discipline areas or data analysis packages include:
R: An introduct ion
R: Mul t iple Regression using R
Stat is t ics: Designing c l inical research and biostat is t ics
SPSS: An introduct ion
SPSS: An introduct ion to using syntax
STATA: An introduct ion to data access and management
STATA: Data manipulat ion and analysis
STATA: Stat is t ical , survey and graphical analyses
3.2. IT Services Help Centre The IT Services Help Centre at 13 Banbury Road is open by appointment during working hours, and on a drop-in basis from 6:00 pm to 8:30 pm, Monday to Friday.
The Help Centre is also a good place to get advice about any aspect of using computer software or hardware. You can contact the Help Centre on (2)73200 or by email on [email protected]
3 IT Learning Centre
1
Statistics ConceptsOctober 2017
Thanks to:
Dave Baker, IT Services, University of OxfordJill Fresen, IT Services, University of OxfordJim Hanley, McGill University, Montreal, Quebec, CanadaMichael Friendly, York University, Toronto, Ontario, CanadaMargaret Glendining, Rothamsted Experimental StationIan Sinclair, REES Group, Oxford
[email protected]@gmail.com
2
Session 1: Setting the scene
We are drowning in information but starving for knowledge
– Rutherford D. Roger
4
3
Research question – particular problem
- collect data - draw conclusions
5
Statistical models: observe = truth + error observe = model + error observe = signal + noise
Fundamental assumption of statisticsnoise/error is ubiquitous
Sir Francis Galton(16 February 1822 – 17 January 1911)http://en.wikipedia.org/wiki/Francis_Galton
General: What do we inherit form our ancestors?
Particular: Do tall parents have tall children and short parents, short children?
Data: Famous 1885 study: 205 sets of parents 928 offspring
Peas: pea pods 9
4
Photo: first 12 families listed in Galton’s notebook.
Sir Ronald Fisher - The grandfather of statistics (17 February 1890 – 29 July 1962)
http://en.wikipedia.org/wiki/Ronald_Fisher
We’ll use his potato data
8
5
T. Eden and R. A. Fisher (1929) Studies in Crop Variation. VI. Experiments on the Response of the Potato to Potash and Nitrogen. J. Agricultural Science 19, 201–213.
9
H. V. Roberts (1979)
10
6
Data sets summary:
Galton: Do tall parents have tall children?
Do big peas produce big peas?
Fisher: Response of the Potatoes to Potash and Nitrogen
Roberts: Do woman earn less than men?
11
Session 2: Descriptive statistics
7
Speaking of Graphics by Paul J Lewi’s
http://www.datascope.be/sog.htm
The Visual Display of Quantitative Information by Edward Tufte
The Golden Age of Statistical Graphics by Michael Friendly in
Statistical Science, 2008, Vol 23, No 4, p502-535
The Grammar of Graphics by Leland Wilkinson
Michael Friendly’s graphics page: http://www.datavis.ca/
Strongly recommend:
14
Visualize
Model
Transform
8
15
observation/perception is interpretive . . . . describe your data. . . . . . . tell the story of your data. . . . . . . . . . .what is your data saying?
narration depends on many things. . . . extent of knowledge . . . . . . . . purpose of description
e.g. describe your research
16
9
Describe the source and location of data
• How was data obtained?• Where is it stored?• What processing has been done on the data?• Who has access to data?
17
Numerical descriptors of a data set(Usually most uninformative
- difficult to interpret)
• Order statistics – smallest to biggest• Mean/average• Variance and standard deviation• Quartiles, percentiles• Prevalence of HIV/Aids
. . . Many more18
10
Graphical descriptors of a data set: (A picture says a thousand words)
• Dot plot• Box and whisker plot• Histogram • Pie chart• Scatterplot . . . many more
19
20
11
21
degrees of freedom (df) = (total) variation = variance = standard deviation =
12
23
Your notes:
24
13
25
26
Your notes:
14
27
28
Guess means and sd’s
Histogram of sons heights (481 sons)
height (in)
Freq
uenc
y
60 65 70 75 80
050
100
Histogram of daughters heights(453 girls)
height (in)
Freq
uenc
y
60 65 70 75 80
050
100
15
Probability density histograms
29
Prob Hist sons(481 sons)
height (in)
Den
sity
60 65 70 75 80
0.00
0.10
Prob Hist daughters(453 girls)
height (in)
Freq
uenc
y
60 65 70 75 80
0.00
0.10
Den
sity
30
Your notes:
16
31
17
33
34
Do worksheet 1
Preferably work in pairs or groups
18
Session 3: Probabilityand probability distributions
What is an experiment?
36
19
37
Classical probability:assumes equally likely outcomes
(games of chance)
toss a coin
roll a die
Empirical probability:empirical probability is a percentage
probability of smokers developing lung cancer (Richard Doll: 1950)
probability of an motor insurance claim
38
20
Subjective probably:can vary from person to person
probability of a business venture being successful
probability of a successful heart replacement
probability of Oxford winning boat race
39
Discrete probability distributions
40
21
41
Your notes:
Continuous probability distributions
42
Den
sity
0.00
0.10
height (in)
Freq
uenc
y
60 65 70 75 80
0.00
0.10
Den
sity
Den
sity Sons
Daughters
22
43
Your notes:
44
60 65 70 75 80
0.00
0.10
Den
sity
60 65 70 75 80
0.00
0.10
Den
sity
0.00
0.10
Den
sity
Probability of selecting a son between 65 and 72 inches area = 0.78
Probability of selecting a son between 63 and 75 inches area = 0.97
Probability of selecting a son between 69 and 74 inches area = 0.51
23
45
Your notes:
Normal or Gaussian distribution Affectionately called the bell-curve
46
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
y
De Moivre (1733); Laplace (1783); Gauss (1809)
24
47
Do worksheet 2
Session 4 Population and Sample
25
Thanks to Dilbert Cartoons
fundamental notions of statistics:
Population
Variation in population
Sample
Describe variation by a probability distribution
26
51
fundamental strategy of statistics: compare observations with expectations
do men and women earn similar salaries?
is yield under fertilizer A same as yield under fertilizer B?
is the generic alternative as good as the brand name drug?
do ART children compare with normal children?
52
fundamental method of statistics:
compare conditional distributions
e.g.
compare salary conditional on gender
compare yield conditional on fertilizer
27
53
Population and sample
54
Your notes:
28
Population
Sample
55pro’s and con’s of these constructs?
Anything calculated from a sample is called a statistic
e.g.
average, maximum, range, proportion having HIV/Aids
or a combination of these 56
Sample
29
57
Your notes:
58
Statistical Inference
sample populationextrapolate
Going from particular to general – inductive inference Controversial issue
Hume (1777) Stanford Encyclopedia of Philosophy Wikipedia, many others
30
59
Statistical InferenceOne can describe the statistical aspects of any samplebut can only reliably extrapolate from a random sample
Why a random sample?
What is wrong with a random sample?
How do we obtain a random sample?
60
Your notes:
31
Population
61
Each time we take another random sample
different answer
sampling distribution
standard deviation of sampling distribution is called the standard error
Nearly all statistical theory assumes a random sample
If the sample is not random
- we can’t rely on statistical theory
62
32
63
Your notes:
sample average converges to population average
What happens as sample gets large?
distrib of sample average converges to normal distrib
64
Two central pillars of statistics: LLN and CLT
33
65
Same applies to most other statistics:
sample quantity converges to population quantity
sampling distrib converges to the normal distribution
66
Your notes:
34
Example of a sampling distributionAt lunch time we’ll work in pairs.Each pair take a random sample of size 20 from the population of 600 buttons.
Compute five statistics:• average ht• average wt• average BMI • proportion having hypertension (bp) • proportion having diabetes (db)
68
for comparison:
Number of random samples of size 20 from 600 buttons approx 1 X 1037
of size 30 from 600 buttons approx 4 X 1050
approx 1024 stars in observable universeapprox 1080 atoms in observable universe
https://www.space.com/26078‐how‐many‐stars‐are‐there.html
35
69
My students in at a Statistics conference in 2003
70
Your notes: What would make this group of students a population and what would make it a sample?
36
72
Your notes: How would one define the reference population for Moure open cast coal and how would one attempt to get a random sample from it?
37
73
estimate the density of stomata on undersides of loblolly pine needles
74
TABLE I: Number of stomata per centimetre on each of ten loblolly pine needles.
needle 1 2 3 4 5 6 7 8 9 10 149
143 138 131
136 139 129 143
143 142 124 134
121133126130
148121124128
129134127113
127130123125
134 137 119 130
117 128 119 118
129132131137
38
75
loblolly pine, is one of several pines native to the South Eastern United States, from central Texas east to Florida, and north to Delaware and southern New Jersey
How would one define the reference population for loblolly pine trees?
How would you attempt to obtain obtain a random sample from loblolly pine trees?
Closing RemarksOur conclusions will be no stronger than the degree to which the constructs, assumptions and mathematical models correlate with the real world:
Population (Can we define this?) Random sample (How do we achieve this?)Limited information (What variables or factors are important?)Simplified models (Linear regression, normal distribution)Probability of an error Quasi-Modus Tollens Argument (to come later)
39
77
Your notes:
Session 5: Confidence intervals
40
79
When random sampling from a population to obtain an estimate of a population parameter (mean, prevalence of HIV) the sample estimate is a random quantity.
use the CLT the sampling distribution will be approx. normal
90% CI: 1.64 95% CI: 1.96 99% CI: 2.58
(The more confidence you want, the wider is the CI)
80
Your notes:
41
For our sampling exercise of the buttons data, a 95% CI
mean (ht or wt or BMI) of the population is: 1.9 ⁄
prevalence (%diabetic or % hypertensive): 1.96
1 ⁄
81
Session 6: Regression
42
83
Do tall parents have tall children, short parents short children?
62 64 66 68 70 72 74
6062
6466
6870
7274
Midparent
Chi
ld
Frequency scatterplot of Galton Data
1
2
4
1
22
1
1
11
4
4
1
55
2
1
9
5
7
1111
7
7
5
2
1
3
3
5
2
1717
14
13
4
3
5
14
15
3638
28
38
19
11
4
1
7
11
16
2531
34
48
21
18
4
3
1
16
4
1727
20
33
25
20
11
45
1
1
1
13
12
18
14
7
4
33
1
34
3
5
10
4
9
22
1
2
1
2
7
24
1
3
14 23 66 78 211
219
183
68 43 19 457
32
59
48
117138
120
167
99
64
41
1714
84
43
85
62 64 66 68 70 72 74
6062
6466
6870
7274
Child vs Midparent
Midparent
Chi
ld h
eigh
t
62 64 66 68 70 72 7460
6264
6668
7072
74
Child vs Midparent Child jittered
Midparent
Chi
ld h
eigh
t
Regression = means of conditional distributions
86
62 64 66 68 70 72 74
6062
6466
6870
7274
Midparent
Chi
ld
trace of actual means regression of Child on Midparent
62 64 66 68 70 72 74
6062
6466
6870
7274
Midparent
Chi
ld
trace of linear regression means assumes means lie on a straight line
62 64 66 68 70 72 74
6062
6466
6870
7274
Midparent
Chi
ld
superimposing actual and linear regressions
44
Linear regression model assumes:1. Conditional distributions are normal2. Conditional means lie on a straight line3. Conditional distributions all have same spread 87
62 64 66 68 70 72 74
6062
6466
6870
7274
Chi
ld
Linear regression model
62 64 66 68 70 72 74
6062
6466
6870
7274
Chi
ld
Linear regression fitted to data
88
Your notes:
45
mid(ph) 64.0 64.5 65.5 66.5 67.5 68.5 69.5 70.5 71.5 72.5ave(ch) 65.3 65.6 66.3 66.9 67.6 68.2 68.9 69.5 70.2 70.8
What did Galton get from his linear regression?
He concluded that:
YES, tall parents do tend to have tall childrenbut their children regress down to the population average
YES, short parents tend to have short childrenbut their children tend to regress up to the population average
How fortunate. Imagine if this were not so.89
He drew similar conclusions about other hereditary factors
such as intelligence for example
Intelligence testing began to take a concrete form with Sir Francis Galton
considered to be the father of mental tests
90
46
What is the purpose of doing regression?
Prediction?
Credit scoring - predict the probability of bad debt from various characteristics of a client
Explanation?
which type of advertising, radio, TV, bill boards, is most effective for improving TESCO sales
Other? 91
92
Your notes:
47
Does the average diam of child peas depend on average diamof parent peas?
93
94
Your notes:
48
95
Do worksheet 4 on regression
Session 7:Hypothesis formulation and testing
49
H. V. Roberts (1979)
Why did Roberts collect these data? 98
50
99
Your notes: about the starting salary data
Research Hypothesis (often called the alternate hypothesis)
is what we are trying to prove, i.e.
: female starting salaries < male starting salaries
(should have expressed this in terms of conditional distributions)
100
51
Null Hypothesis
is the negation of the research hypothesis i.e. the research hypothesis is null and void
: female salaries = male salaries
(should have expressed this in terms of conditional distributions)
101
Neyman-Pearson: Based on court of law
intension of conviction assumed innocent until “proved guilty”
Of course, we can always make an error in our decision
102
52
103
Your notes:
Court of law Truth (Always unknown)
Innocent Guilty
Our Decision
Aquit
Correct Decision Type II Error
TypeIIError
Convict Type I Error Type I Error
Correct Decision 1
|
53
Neyman‐Pearson Statistical hypothesis test
Truth (Always unknown)
true true
Our Decision
Accept
Correct Decision Type II Error
TypeIIError
Accept
Type I Error Type I Error
Correct Decision 1
|
Diagnostic test
Truth (Gold standard)
True positive True negative
test positive True positives False positives
negative False negatives True negatives
Sensitivity = proportion of true positives that are correctly identified as such
Specificity = proportion of true negatives that are correctly identified as such
54
107
Your notes:
Elements of a NP Hypothesis Test
Alternative hypothesis (Research hypothesis)
Null hypothesis
Test statistic
Rejection regions
Conclusions
55
109
Your notes:
starting salaries, conditional on gender
110
56
Fisherian Inference
Null hypothesis but no Alternative Hypothesis
compute the p‐value of the test statistic under
small p‐values are thought to be evidence against Null hypothesis
Short: NHST
57
Strength of a Statistical Argument
IH
not IModus Tollens Argument (very strong) Have some hypothesis H Has some implication I ----------------------------- But evidence shows not I ----------------------------- Therefore conclude not H
58
116
Real life implementation of Modus Tollens Quasi Modus Tollens Argument (not so strong) Have some hypothesis H together with Assumptions kAAA ,,, 21
Have some implication I ----------------------------- But evidence shows that I is unlikely ----------------------------- Therefore conclude not H
59
The logic of Neyman‐Pearson statistics is to adopt
decision procedures with known long‐term error
rates (of false positives and false negatives) and
then control those errors at acceptable levels.
118
Your notes:
60
Bayesian Inference
In the Bayesian approach
= probability that is true (prior)
| = probability of data is true given (likelihood)
= probability that is true given data (posterior)
Bayesian Inference
Bayes Theorem: posterior ∝ likelihood × prior
∝
61
Bayesian Inference
Bayes factor = |
|
62
Real problemMathematical translation
Mathematical solution
Interpret solution
123
124
Please do worksheet 5
63
Session 8: ANOVA
Variation = sum of squared deviations from the meanVariance = average variation
= variation/degrees of freedom
e.g. data: 1, 2, 3, 4, 5 mean = 3
df=4variation =10 {= 1 3 2 3 3 3 4 3 + 5 3 } variance = 2.5 {10/4}
126
64
The Analysis of Variance Equations (ANOVA)total variation = variation explained by the model +
variation due to noise
127
total df = df for model + df due to noise
65
ANOVA in pictures - for a simple regression model
129
24
68
1014
Original datay
24
68
1014
SSTotal = 67.209
y
24
68
1014
SSModel = 19.881
y
24
68
1014
SSError = 47.328
y
66
Analysis of Variance Table Source Df Sum Sq Mean Sq F-ratio model 1 19.881 19.881 1.6803 error 4 47.328 11.832 Total 5 67.209
R2 = percent variation explained by model = 19.881/67.209 = 29.58%
131
Variance ratio = F-value = 1.68
Residual standard error = sqrt(11.832)
67
133
ANOVA Galton Data
Source Df Sum Sq Mean Sq F value Pr(>F)Model 1 1236.9 1236.93 246.84 <2e-16 Error 926 4640.3 5.01Total 927 5877.2
R-squared:
F-value:
Residual standard error:
68
135
Estimated Coefficients:Est Std. Err t-value Pr(>|t|)
Intercept 23.94 2.81 8.517 <2e-16 Midparent 0.646 0.041 15.711 <2e-16
Regression equation:
average height = 23.94 + 0.646 × Midparent height
69
137
Conclusions:Although the model is crude,
based on midparent height alone,
does not include gender,
or any other factors that might explain
variation in child height,
it is nonetheless a highly significant
model
138
Your notes:
70
T. Eden and R. A. Fisher (1929) Studies in Crop Variation. VI. Experiments on the Response of the Potato to Potash and Nitrogen. J. Agricultural Science 19, 201–213.
139
140
71
141
72
143
ANOVA - Fisher’s potato data Source Df Sum-Sq Mean-Sq F-ratio Pr(>F) nitrog 3 209646 69882 31.220 4e-12 potash 3 32926 10975 4.903 0.004 Resid 57 127589 2238 TOTAL 63 370161
Conclusions?
73
145
Two‐way ANOVA without interaction
Two‐way ANOVA with interaction
146
Your notes:
74
147
Do worksheet 6
Bibliography 1. Cole, T. J. (2000), “Galton’s Midparent Height Revisited,” Annals of Human Biology, 27, 401–
405. 2. Daly, C. (1964) Statistical Games Journal of the Royal Statistical Society. Series C (Applied
Statistics), Vol. 13, No. 2, pp. 74-83 3. Friendly, M. (2008) The Golden Age of Statistical Graphics. Statistical Science, Vol. 23, No. 4,
pp. 502-535 4. Friendly, M. and Denis, D. (2005) The early origins and development of the scatterplot.
Journal of the History of the Behavioral Sciences, Vol. 41(2), 103–130 5. Friendly, M. (2004) The Past, Present and Future of Statistical Graphics. (An Ideo-Graphic
and Idiosyncratic View). http://www.math.yorku.ca/SCS/friendly.html 6. Eden, T. and Fisher, R.A. (1929) Experiments in the response of potato to potash and
nitrogen. Studies in Crop Variation, Vol XIX, pp 201 - 213. 7. Fisher, R.A. (1921) An examination of the yield of dressed grain. Studies in Crop Variation.
Vol. XI, pp107 – 135. 8. Fisher, R.A. (1934) The Contributions of Rothamsted to the Development of Statistics
Rothamsted Experimental Station Report For 1933 pp 43 – 50 9. Galton, F. (1869). Hereditary Genius: An Inquiry into its Laws and Consequences. London:
Macmillan. 10. Galton, F. (1886). Regression towards mediocrity in hereditary stature. Journal of the
Anthropological Institute of Great Britain and Ireland, 15, 246263. 11. Galton, F. (1877), “Typical Laws of Heredity,” in Proceedings of the Royal Institution of Great
Britain, 8, pp. 282–301. 12. (1886), “RegressionTowards Mediocrity in Hereditary Stature,” Journalof the Anthropological
Institute of Great Britain and Ireland, 15, 246–263. (1889), Natural Inheritance, London: Macmillan. (1901), “Biometry,” Biometrika, 1, 7–10. (1908), Memories of My Life (2nd ed.) London: Methuen.
13. Gunst, R. (2000) Classical Studies That Revolutionized the Practice of Regression Analysis. Technomerics, February 2000, 42, 1, 62-64
14. Hadley Wickham, Dianne Cook, Heike Hofmann, and Andreas Buja, (2010) 15. Hanley, J. (2004) “Transmuting Graphical Inference for Infovis” Women into Men: Galton’s
Family Data on Human Stature. The American Statistician, Vol. 58, No. 3 1 16. Hanley, J. A. (2004), Digital photographs of data in Galton’s notebooks, and
related material, available online at http://www.epi.mcgill.ca/hanley/galton. 17. Hanley, J and Turner, E. (2010) Age in medieval plagues and pandemics: Dances of Death
or Pearson’s bridge of life? Significance, June 2010, 85-87 18. Handley, J., Julien, M., and Moodie, E.E.M. (2008) Student’s z, t, and s: What if Gosset had
R? The American Statistician, February 2008, Vol. 62, No. 1 19. Jacques, J.A. and Jacques, G.M. (2002) Fisher’s randomization test and Darwin’s data – A
footnote to the history of statistics. Mathematical Biosciences, Vol 180, 23–28 20. Jaggard, K.W., Qi, A. and Ober, E.S. Possible changes to arable crop yields by 2050. Phil.
Trans. R. Soc. B, 365, 2835–2851 21. Nievergelt, Y. (2000). A tutorial history of least squares with applications to astronomy and
geodesy. Journal of Computational and Applied Mathematics ,121, 37-72. 22. Pagano, M. and Anoke, S. (2013) Mommy's Baby, Daddy's Maybe: A Closer Look at
Regression to the Mean. CHANCE, 26:3, 4-9 23. Pearson, K. (1930). The Life, Letters and Labours of Francis Galton, Vol.III: Correlation,
Personal Identification and Eugenics . Cambridge University Press.
24. Stigler, S. M. (1986). The History of Statistics: The Measurement of Uncertainty before 1900. Harvard University Press.
25. Stigler, S. M. (1999). Statistics on the Table: The History of Statistical Concepts and Methods. Harvard University Press.
26. Sung Sug Yoon, R.N., Vicki Burt, R.N., Tatiana, L., Carroll, M.D. (2012). Hypertension Among Adults in the United States, 2009–2010. NCHS Data Brief, No. 107, October 2012.
27. Wachsmuth, A. and Wilkinson, L. (2003) Galton’s Bend: An Undiscovered Nonlinearity in Galton’s Family Stature Regression Data and a Likely Explanation Based on Pearson and Lee’s Stature Data. Publication details unknown.
28. Wright, K (2013) Revisiting Immer’s Barley Data. The American Statistician, 67:3, 129-133 29. A review of basic statistical concepts. Author unknown. 30. Diabetes in the UK 2012. 31. Pearson, K. (1896), “Mathematical Contributions to the Theory of Evolution. III Regression,
Heredity and Panmixia,” Philosophical Transactions of the Royal Society of London, Series A, 187, 253–318.
32. (1930), The Life, Letters and Labours of Francis Galton, (Vol. IIIA), London: Cambridge University Press.
33. Pearson, K., and Lee, A. (1903), “On the Laws of Inheritance in Man: I. Inheritanceof Physical Characters,” Biometrika, 2, 357–462.
34. Stigler, S. (1986), “The English Breakthrough: Galton,” in The History of Statistics: 35. The Measurement of Uncertainty before 1900, Cambridge, MA: The Belknap Press of
Harvard University Press, chap. 8. 36. Tredoux, G. (2004), Web site http://www.galton.org. 37. Wachsmuth, A., Wilkinson, L., and Dallal, G. E. (2003), “Galton’s Bend: A Previously
Undiscovered Nonlinearity in Galton’s Family Stature Regression Data,” The American Statistician, 57, 190–192.
38. Paul J Lewi Speaking of Graphics that can be found at http://www.datascope.be/sog.htm The Power to See: A New Graphical Test of Normality," Aldor-Noiman, S., Brown, L. D., Buja, A., Rolke, W., Stine, R.A., The American Statistician, 67 (4), 249{260 (2013). Valid Post-Selection Inference," Berk, R., Brown, L., Buja, A., Zhang, K., Zhao, L., The Annals of Statistics, 41 (2), 802{837 (2013). Statistical Inference for Exploratory Data Analysis and Model Diagnostics," Buja, A., Cook, D., Hofmann, H., Lawrence, M., Lee, E.-K., Swayne, D.F., and Wickham, H., Philosophical Transactions of the Royal Society A., 367, 4361{4383 (2009). The Plumbing of Interactive Graphics," Wickham, H., Lawrence, M., Cook, D., Buja, A., Hofmann, H., and Swayne, D.F., Computational Statistics, (April 2008). Visual Comparison of Datasets Using Mixture Distributions," Gous, A., and Buja, A., Journal of Computational and Graphical Statistics, 13 (1) 1{19 (2004). Exploratory Visual Analysis of Graphs in GGobi," Swayne, D.F., Buja, A., and Temple-Lang, D., refereed proceedings of the Third Annual Workshop on Distributed Statistical Computing (DSC 2003), Vienna. GGobi: Evolving from XGobi into an Extensible Framework for Interactive Data Visualization," Buja, A., Lang, D.T., and Swayne, D.F., Journal of Computational Statistics and Data Analysis, 43 (4), 423-444 (2003).
Murrell, P. (2011). R graphics (2nd ed.). London, United Kingdom: Chapman & Hall.
Pashler, H. and Wagenmakers, E–J. (Eds.) (2012). Editors’ Introduction to the Special Section on
Replicability in Psychological Science: A Crisis of Confidence?, Perspectives on
Psychological Science, 7(6): 528–530.
R Development Core Team. (2015). R: A Language and Environment for Statistical Computing.
Vienna, Austria: R Foundation for Statistical Computing. Retrieved from http://www.R-
project.org
Tay, l., Parrigon S., Huang, Q., and James M. LeBreton2, J.M. (2016) Graphical Descriptives: A
Way to Improve Data Transparency and Methodological Rigor in Psychology. Perspectives on
Psychological
XGobi: Interactive Dynamic Data Visualization in the X Window System," Swayne, D.F., Cook, D., and Buja, A., Journal of Computational and Graphical Statistics, 7, 113{130 (1998). https://rbertolusso.github.io/posts/LR03-residuals-RMSE#code-to-load-data-and-initial-calculations Pearson Father – Son data – beautiful discussion
Worksheet 1: Descriptive statistics Question 1 Plot the data 2, 3, 5, 8, 12 on a hand drawn dot diagram below Compute the mean ( mean=sum/no of data points) Compute the variation (sum of squares of data about their mean) Compute the variance (variation/df) Compute the sd (square root of variance) Answers: sum=30; mean=30/5=6; variation=66; variance=variation/df=66/4=16.5; sd=sqrt(16.5)=4.06
Your answers:
Question 2 The number of people in the lunch queue at noon at the Mathematical Institute on 20 working days during January was: 15, 8, 10, 0, 17, 12, 18, 8, 13, 14, 17, 0, 10, 12, 3, 6, 0, 2, 6, 6 order statistics: 0, 0, 0, 2, 3, 6, 6, 6, 8, 8, 10, 10, 12, 12, 13, 14, 15, 17, 17, 18, plotted on the diagram below
Compute the three quartiles Create a hand drawn Box-and-whisker plot above the dotplot Visually estimate the mean and sd
LQ = any number between 3 and 6. Some would use 4.5 Median = any number between 8 and 10. Some would use 9
UQ = any number between 13 and 14. Some would use 13.5 about 9 and 5; actual values 8.85 and 5.91
Your answers:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
02
46
81
0
Dotplot of data
y
Question 3 The following scatterplot shows wt (the weight in lbs of an individual) plotted against ht (height in inches of an individual) for the button data (to come later). Estimate visually the means and sd’s for the conditional distributions of weights at heights of 65 and 70 inches respectively.
Your answers: Actual values: at 65 mean = 139.6, sd=17.1; at 72 mean=160.3, sd=15.8 R code for dotplot
x <- c(0, 0, 0, 2, 3, 6, 6, 6, 8, 8, 10, 10, 12, 12, 13, 14, 15, 17, 17, 18)
y <- c(1, 2, 3, 1, 1, 1, 2, 3, 1, 2,1,2,1,2,1,1,1, 1, 2, 1)
points<- 0:20
plot(x,y,pch=19, cex=1.5,xlab="Dotplot of data", xlim=c(0,20), ylim=c(0,10),
at=points,cex.lab=2)
R code for plot of weight vs height
bd<- read.table("E:/buttondata.txt",header=T)
attach(bd)
plot(ht,wt,xlab="height",ylab="weight",cex.lab=1.5,main="Plot of weight vs height for buttons data")
abline(v=65,lty=2);abline(v=72,lty=2)
wt.65 <- wt[ht==65]; wt.72 <- wt[ht==72];
mean.65 <- mean(wt.65);mean.72<- mean(wt.72)
sd.65 <- sd(wt.65);sd.72<- sd(wt.72);
mean.65;mean.72;sd.65;sd.72
Worksheet 2: Probability Question 1: The standard statistical model is: what you observe = truth + error/noise What contributes to the noise in the reduced Galton data on midparent height and child height?
Your answers:
Question 2: What meanings of probability are invoked in the following statements: a Smokers are 23 times more likely to get lung cancer than are non-smokers
How would one compute this? How might this statement be re-phrased?
b Insurance costs are usually based on risk. Women get a 40% discount on motor insurance in South Africa. Does this mean that women are better drivers? Discuss.
c You bought four tickets in a lottery. What are your chances of winning?
Your answers:
Worksheet 3: Sampling and confidence intervals Do this during lunch break – work in pairs 1. Take a random sample of 20 buttons. Record ht and wt data in Excel. Don’t record BMI, bp and db. 2. Use Excel to compute means, sd’s. Compute the quantiles by sorting data from smallest to largest. 3. Compute 95% confidence intervals for each of the parameters measured.
case ht wt BMI bp db
1
2
3
•
•
•
20
mean
sd
min
LQ
MED
UQ
max
LCL
UCL
LCL = Lower Confidence Limit UCL = Upper Confidence Limit a 95% CI for the mean (ht or wt or BMI) of the population is:
𝑠𝑎𝑚𝑝𝑙𝑒 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 ± 1.96 × 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 √𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒⁄
a 95% CI for the prevalence (%diabetic or % hypertensive):
𝑠𝑎𝑚𝑝𝑙𝑒 𝑝𝑟𝑒𝑣𝑎𝑙𝑒𝑛𝑐𝑒 ± 1.96 × √𝑝𝑟𝑒𝑣𝑎𝑙𝑒𝑛𝑐𝑒 × (1 − 𝑝𝑟𝑒𝑣𝑎𝑙𝑒𝑛𝑐𝑒) 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒⁄ Use the following functions in Excel: average, stdev.s, min,max To compute the quantiles order data from smallest to largest
Worksheet 4: Regression
1 Assume the linear regression model holds for these data.
What are the assumptions of the linear regression model?
2 Locate visually the means of the conditional distributions at ht = 65 and 72 Fit a linear regression line through these data by hand.
Actual values: at 65 mean = 139.6, sd=17.1; at 72 mean=160.3, sd=15.8
3 Estimate the slope of your fitted line. NB: Slope = rise/run Estimated from the two conditional distributions slope=(160.3-139.6)/7 = 2.96 Estimated from fitting the linear regression model to all the data slope = 3.07
4 Estimate visually the standard deviation of the data about the regression visually. Estimated from fitting linear regression model to the full data: Residual standard error: 15.36 Continued overleaf
5 Describe the conditional distributions of weights at heights of 65 and 72 inched.
6 Can one predict weight from height? What can one predict? What are the uncertainties?
7 Can you describe or interpret what the regression model is telling us?
R code for plot of weight vs height
bd<- read.table("E:/buttondata.txt",header=T)
attach(bd)
plot(ht,wt,xlab="height",ylab="weight",cex.lab=1.5,main="Plot of weight vs height for buttons data")
abline(v=65,lty=2);abline(v=72,lty=2)
wt.65 <- wt[ht==65]; wt.72 <- wt[ht==72];
mean.65 <- mean(wt.65);mean.72<- mean(wt.72)
sd.65 <- sd(wt.65);sd.72<- sd(wt.72);
mean.65;mean.72;sd.65;sd.72
Worksheet 5: Hypothesis formulation and testing
Question 1: Discuss possible the type I and type II errors in the following settings. One might also want to consider the problem from the perspective of different individuals or groups.
1 A new drug for treating HIV/Aids is proposed and a clinical trial is designed to compare it with the existing drug. Possible individual or groups: Researchers who propose the drug. The person with HIV/Aids and their family. The drug company who produce drug for treating HIV/Aids. Research hypothesis: Null hypothesis: Type I Error: Consequences of a Type I Error: Type II Error: Consequences of a Type II Error:
2 A teenager gets caught up in gang violence and is given a life sentence, at age 18, for the conviction of second degree murder of an opposing gang member. Possible individual or groups: The teenager who is sentenced. Her family. The family of the person who was killed. The legal or justice community who carry the burden of justice. The community at large who expect justice from the legal system. Research hypothesis: Null hypothesis: Type I Error: Consequences of a Type I Error: Type II Error: Consequences of a Type II Error:
Truth (Always unknown)
Innocent Guilty
Our Decision
Aquit
Correct Decision Type II Error
𝛽 = 𝑃𝑟(Type II Error)
Convict Type I Error
𝛼 = 𝑃𝑟(Type I Error)
Correct Decision 1 − 𝛽 = 𝑃𝑜𝑤𝑒𝑟
= 𝑃𝑟(𝐶𝑜𝑛𝑣𝑖𝑐𝑡𝑖𝑛𝑔|𝐺𝑢𝑖𝑙𝑡𝑦)
Truth (Always unknown)
𝐻𝑂 true 𝐻𝐴 true
Our Decision
Accept 𝐻𝑂
Correct Decision Type II Error
𝛽 = 𝑃𝑟(Type II Error)
Accept 𝐻𝐴
Type I Error 𝛼 = 𝑃𝑟(Type I Error)
Correct Decision 1 − 𝛽 = 𝑃𝑜𝑤𝑒𝑟
= 𝑃𝑟(𝑅𝑒𝑗𝑒𝑐𝑡𝑖𝑛𝑔 𝐻𝑂|𝐻𝑂 𝐹𝑎𝑙𝑠𝑒)
3 A student is accused of plagiarism, her dissertation is rejected and she is excluded from the university because of it. Possible individual or groups: The student. The university. The community. Research hypothesis: Null hypothesis: Type I Error: Consequences of a Type I Error: Type II Error: Consequences of a Type II Error:
4 For the starting salary data, the null hypothesis is that the average salaries paid to men and women are equal, and that the observed differences were due to other circumstantial factors. Possible groups: Males. Females. The bank. Community. Research hypothesis: Null hypothesis: Type I Error: Consequences of a Type I Error: Type II Error: Consequences of a Type II Error:
Question 2: In a Galton regression setting, where one is comparing the conditional distributions of child height at given values of midparent heights, what is the null hypothesis of interest for this research problem. What would be considered sufficient or convincing evidence for rejecting the null hypothesis. Hint: Whatever is computed from a random sample, is itself a random variable, whose sampling distribution we can compute.
Worksheet 6: ANOVA
Question 1: For the Fisher data there are a number of hypotheses one might wish to consider about the effect of potash, nitrogen and the interaction of potash and nitrogen.
1. Formulate conceptually and in words the mathematical model for these data.
2. What would be sensible null hypotheses for this experiment?
3. How convincing is the evidence against the null hypotheses judging from the graphs? Does one in fact need a statistical test against the null hypotheses? Or is the evidence against the null hypotheses simply overwhelming?
ANOVA for Potato Data including an interaction term
Source Df Sum Sq Mean Sq F value p-value
nitrogen 3 209646 69882 32.299 2.72e-11 (main effect of nitrogen)
potash 3 32926 10975 5.073 0.00413 (main effect of potash)
nitrogen:potash 9 18925 2103 0.972 0.47556 (interaction term)
Residuals 45 97361 2164 (noise term)
TOTAL 60 358858
p-value = approximate measure of type I error probability. Small p-values provide
strong evidence against the null hypothesis.
Interpret the ANOVA for these data keeping in mind the assumptions of the model and the possible effect on the computed statistics of the model assumptions being violated.
Question 2: Practical ANOVA computation for fictitious data to compare Apples, Pears and Oranges. Construct the ANOVA for the data in the table below, chosen for ease of mental arithmetic. ANOVA model: Total variation = variation due to model + variation about model Total df = df for model + df due to noise Strategy: We'll compute the Total Variation, the variation about the model and then compute the variation explained by model by subtraction. Similarly the df computations. Of course we could compute the variation and df due to the model directly. Compute the following: sum for each group mean for each group variation for each group = sum of squared deviations of group data points about group means variation about model = sum of variations for each group about their group means df for each group = number of observations in group - 1 sum of all data average of all data total variation for all data = sum of squared deviations of data points about overall means total df = total number of data points - 1
Apples Pears Oranges
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7
sum for each group
mean for each group
variation about group mean
df for each group
ANOVA
Souce Variation
Also called Sum of Squares SS
df Variance
Also called Mean Square MS = SS/df
F variance ration
model
error/noise
Total
Your rough work Remark: Clearly the assumption of normality does not apply for these data, because the observations in the groups are uniformly distributed; yet one can still apply the ANOVA methodology. Thus the distribution of the F-test, based on normality of the data, will not be correct. Hence our inferences will be wobbly. We should therefore always check the model assumptions, to evaluate the suitability of the F-test. A good question would be how to check the model assumptions. That would need an applied statistics course.
Source SS df MS F
model 10 2 5 2
noise 30 12 2.5
Total 40 14