Statistics: Concepts of statistics for researchers · Statistics : Introduction: this is a...

Statistics:

Concepts of statistics for researchers

How to Use This Course Book

This course book accompanies the face-to-face session taught at IT Services. It contains a copy of the slideshow and the worksheets.

Software Used

We might use Excel to capture your data, but no other software is required. Since this is a Concepts course, we will concentrate on exploring ideas and underlying concepts that researchers will find helpful in undertaking data collection and interpretation.

Revision Information

Version Date Author Changes made

1.0 January 2014 John Fresen Course book version 1

2.0 October 2014 John Fresen Updates to slides

3.0 January 2015 John Fresen Updates to slides

4.0 February 2015 John Fresen Updates to slides and worksheets

…

…

14.0 December 2016 John Fresen Updates to slides and worksheets

15.0 February 2017 John Fresen Updates to slides and worksheets

16.0 May 2017 John Fresen Updates to slides and worksheets

17.0 October 2017 John Fresen Updates to slides and worksheets

Copyright

The copyright of this document lies with Oxford University IT Services.

Contents 1 Introduction ............................................................................. 1

1.1. What You Should Already Know ......................................................... 1

1.2. What You Will Learn ........................................................................... 1

2 Your Resources for These Exercises ........................................ 2

2.1. Help and Support Resources ............................................................. 2

3 What Next? .............................................................................. 3

3.1. Statistics Courses ............................................................................... 3

3.2. IT Services Help Centre ..................................................................... 3

Statistics Concepts

1 Introduction Welcome to the course Stat is t ics: Concepts .

This is a statistical concepts course, an ideas course, a think-in-pictures course. What are the basic notions and constructs of statistics? Why do we differentiate between a population and a sample? How do we summarize and describe sample information? Why, and how, do we compare data with expectations? How do hypotheses arise and how do we set about testing them? With inherent uncertainty in any sample, how can one extrapolate from a sample to the population? And then, how strong are our conclusions?

This course is designed to prepare you to get the most from the statistical applications that we teach. It involves discussion of real-life examples and interpretation of data. We strive to avoid mathematical symbols, notation and formulae.

1.1. What You Should Already Know We assume that you are familiar with entering and editing text, rearranging and formatting text - drag and drop, copy and paste, printing and previewing, and managing files and folders.

The computer network in IT Services may differ slightly from that which you are used to in your College or Department; if you are confused by the differences, ask for help from the teacher.

1.2. What You Will Learn In this course we will cover the following topics:

• Descriptive statistics and graphics

• Population and sample

• Probability and probability distributions

• Comparing conditional distributions

• Confidence intervals

• Linear regressions

• Hypothesis testing

• From problem – to data – to conclusions

Where to get help….

Topics covered in related Statistics courses, should you be interested, are given in Section 3.1.

1 IT Learning Centre

Statistics Concepts

2 Your Resources for These Exercises The exercises in this handbook will introduce you to some of the tasks you will need to carry out when working with WebLearn. Some sample files and documents are provided for you; if you are on a course held at IT Services, they will be on your network drive H:\ (Find it under My Computer).

During a taught course at IT Services, there may not be time to complete all the exercises. You will need to be selective, and choose your own priorities among the variety of activities offered here. However, those exercises marked with a star * should not be skipped.

Please complete the remaining exercises later in your own time, or book for a Computer8 session at IT Services for classroom assistance (See section 8.2).

2.1. Help and Support Resources You can find support information for the exercises on this course and your future use of WebLearn, as follows:

• WebLearn Guidance https://weblearn.ox.ac.uk/info (This should be your first port of call)

If at any time you are not clear about any aspect of this course, please make sure you ask John for help. If you are away from the class, you can get help and advice by emailing the central address [email protected].

The website for this course including reading material and other material can be found at https://weblearn.ox.ac.uk/x/Mvkigl

You are welcome to contact John about statistical issues and questions at [email protected]


https://weblearn.ox.ac.uk/info

mailto:[email protected]

https://weblearn.ox.ac.uk/x/Mvkigl


Statistics Concepts

3 What Next? 3.1. Statistics Courses

Now that you have a grasp of some basic concepts in Statistics, you may want to develop your skills further. IT Services offers further Statistics courses and details are available at http://courses.it.ox.ac.uk.

In particular, you might like to attend the course

Stat ist ics: In troduct ion: this is a four-session module which covers the basics of statistics and aims to provide a platform for learning more advanced tools and techniques.

Courses on particular discipline areas or data analysis packages include:

R: An introduct ion

R: Mul t iple Regression using R

Stat is t ics: Designing c l inical research and biostat is t ics

SPSS: An introduct ion

SPSS: An introduct ion to using syntax

STATA: An introduct ion to data access and management

STATA: Data manipulat ion and analysis

STATA: Stat is t ical , survey and graphical analyses

3.2. IT Services Help Centre The IT Services Help Centre at 13 Banbury Road is open by appointment during working hours, and on a drop-in basis from 6:00 pm to 8:30 pm, Monday to Friday.

The Help Centre is also a good place to get advice about any aspect of using computer software or hardware. You can contact the Help Centre on (2)73200 or by email on [email protected]



1

Statistics ConceptsOctober 2017

Thanks to:

Dave Baker, IT Services, University of OxfordJill Fresen, IT Services, University of OxfordJim Hanley, McGill University, Montreal, Quebec, CanadaMichael Friendly, York University, Toronto, Ontario, CanadaMargaret Glendining, Rothamsted Experimental StationIan Sinclair, REES Group, Oxford

[email protected]@gmail.com

2

Session 1: Setting the scene

We are drowning in information but starving for knowledge

– Rutherford D. Roger

4

3

Research question – particular problem

- collect data - draw conclusions

5

Statistical models: observe = truth + error observe = model + error observe = signal + noise

Fundamental assumption of statisticsnoise/error is ubiquitous

Sir Francis Galton(16 February 1822 – 17 January 1911)http://en.wikipedia.org/wiki/Francis_Galton

General: What do we inherit form our ancestors?

Particular: Do tall parents have tall children and short parents, short children?

Data: Famous 1885 study: 205 sets of parents 928 offspring

Peas: pea pods 9

4

Photo: first 12 families listed in Galton’s notebook.

Sir Ronald Fisher - The grandfather of statistics (17 February 1890 – 29 July 1962)

http://en.wikipedia.org/wiki/Ronald_Fisher

We’ll use his potato data

8

5

T. Eden and R. A. Fisher (1929) Studies in Crop Variation. VI. Experiments on the Response of the Potato to Potash and Nitrogen. J. Agricultural Science 19, 201–213.

9

H. V. Roberts (1979)

10

6

Data sets summary:

Galton: Do tall parents have tall children?

Do big peas produce big peas?

Fisher: Response of the Potatoes to Potash and Nitrogen

Roberts: Do woman earn less than men?

11

Session 2: Descriptive statistics

7

Speaking of Graphics by Paul J Lewi’s

http://www.datascope.be/sog.htm

The Visual Display of Quantitative Information by Edward Tufte

The Golden Age of Statistical Graphics by Michael Friendly in

Statistical Science, 2008, Vol 23, No 4, p502-535

The Grammar of Graphics by Leland Wilkinson

Michael Friendly’s graphics page: http://www.datavis.ca/

Strongly recommend:

14

Visualize

Model

Transform

8

15

observation/perception is interpretive . . . . describe your data. . . . . . . tell the story of your data. . . . . . . . . . .what is your data saying?

narration depends on many things. . . . extent of knowledge . . . . . . . . purpose of description

e.g. describe your research

16

9

Describe the source and location of data

• How was data obtained?• Where is it stored?• What processing has been done on the data?• Who has access to data?

17

Numerical descriptors of a data set(Usually most uninformative

- difficult to interpret)

• Order statistics – smallest to biggest• Mean/average• Variance and standard deviation• Quartiles, percentiles• Prevalence of HIV/Aids

. . . Many more18

10

Graphical descriptors of a data set: (A picture says a thousand words)

• Dot plot• Box and whisker plot• Histogram • Pie chart• Scatterplot . . . many more

19

20

11

21

degrees of freedom (df) = (total) variation = variance = standard deviation =

12

23

Your notes:

24

13

25

26

Your notes:

14

27

28

Guess means and sd’s

Histogram of sons heights (481 sons)

height (in)

Freq

uenc

y

60 65 70 75 80

050

100

Histogram of daughters heights(453 girls)

height (in)

Freq

uenc

y

60 65 70 75 80

050

100

15

Probability density histograms

29

Prob Hist sons(481 sons)

height (in)

Den

sity

60 65 70 75 80

0.00

0.10

Prob Hist daughters(453 girls)

height (in)

Freq

uenc

y

60 65 70 75 80

0.00

0.10

Den

sity

30

Your notes:

16

31

17

33

34

Do worksheet 1

Preferably work in pairs or groups

18

Session 3: Probabilityand probability distributions

What is an experiment?

36

19

37

Classical probability:assumes equally likely outcomes

(games of chance)

toss a coin

roll a die

Empirical probability:empirical probability is a percentage

probability of smokers developing lung cancer (Richard Doll: 1950)

probability of an motor insurance claim

38

20

Subjective probably:can vary from person to person

probability of a business venture being successful

probability of a successful heart replacement

probability of Oxford winning boat race

39

Discrete probability distributions

40

21

41

Your notes:

Continuous probability distributions

42

Den

sity

0.00

0.10

height (in)

Freq

uenc

y

60 65 70 75 80

0.00

0.10

Den

sity

Den

sity Sons

Daughters

22

43

Your notes:

44

60 65 70 75 80

0.00

0.10

Den

sity

60 65 70 75 80

0.00

0.10

Den

sity

0.00

0.10

Den

sity

Probability of selecting a son between 65 and 72 inches area = 0.78



23

45

Your notes:

Normal or Gaussian distribution Affectionately called the bell-curve

46

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

y

De Moivre (1733); Laplace (1783); Gauss (1809)

24

47

Do worksheet 2

Session 4 Population and Sample

25

Thanks to Dilbert Cartoons

fundamental notions of statistics:

Population

Variation in population

Sample

Describe variation by a probability distribution

26

51

fundamental strategy of statistics: compare observations with expectations

do men and women earn similar salaries?

is yield under fertilizer A same as yield under fertilizer B?

is the generic alternative as good as the brand name drug?

do ART children compare with normal children?

52

fundamental method of statistics:

compare conditional distributions

e.g.

compare salary conditional on gender

compare yield conditional on fertilizer

27

53

Population and sample

54

Your notes:

28

Population

Sample

55pro’s and con’s of these constructs?

Anything calculated from a sample is called a statistic

e.g.

average, maximum, range, proportion having HIV/Aids

or a combination of these 56

Sample

29

57

Your notes:

58

Statistical Inference

sample populationextrapolate

Going from particular to general – inductive inference Controversial issue

Hume (1777) Stanford Encyclopedia of Philosophy Wikipedia, many others

30

59

Statistical InferenceOne can describe the statistical aspects of any samplebut can only reliably extrapolate from a random sample

Why a random sample?

What is wrong with a random sample?

How do we obtain a random sample?

60

Your notes:

31

Population

61

Each time we take another random sample

different answer

sampling distribution

standard deviation of sampling distribution is called the standard error

Nearly all statistical theory assumes a random sample

If the sample is not random

- we can’t rely on statistical theory

62

32

63

Your notes:

sample average converges to population average

What happens as sample gets large?

distrib of sample average converges to normal distrib

64

Two central pillars of statistics: LLN and CLT

33

65

Same applies to most other statistics:

sample quantity converges to population quantity

sampling distrib converges to the normal distribution

66

Your notes:

34

Example of a sampling distributionAt lunch time we’ll work in pairs.Each pair take a random sample of size 20 from the population of 600 buttons.

Compute five statistics:• average ht• average wt• average BMI • proportion having hypertension (bp) • proportion having diabetes (db)

68

for comparison:

Number of random samples of size 20 from 600 buttons approx 1 X 1037

of size 30 from 600 buttons approx 4 X 1050

approx 1024 stars in observable universeapprox 1080 atoms in observable universe

https://www.space.com/26078‐how‐many‐stars‐are‐there.html

35

69

My students in at a Statistics conference in 2003

70

Your notes: What would make this group of students a population and what would make it a sample?

36

72

Your notes: How would one define the reference population for Moure open cast coal and how would one attempt to get a random sample from it?

37

73

estimate the density of stomata on undersides of loblolly pine needles

74

TABLE I: Number of stomata per centimetre on each of ten loblolly pine needles.

needle 1 2 3 4 5 6 7 8 9 10 149

143 138 131

136 139 129 143

143 142 124 134

121133126130

148121124128

129134127113

127130123125

134 137 119 130

117 128 119 118

129132131137

38

75

loblolly pine, is one of several pines native to the South Eastern United States, from central Texas east to Florida, and north to Delaware and southern New Jersey

How would one define the reference population for loblolly pine trees?

How would you attempt to obtain obtain a random sample from loblolly pine trees?

Closing RemarksOur conclusions will be no stronger than the degree to which the constructs, assumptions and mathematical models correlate with the real world:

Population (Can we define this?) Random sample (How do we achieve this?)Limited information (What variables or factors are important?)Simplified models (Linear regression, normal distribution)Probability of an error Quasi-Modus Tollens Argument (to come later)

39

77

Your notes:

Session 5: Confidence intervals

40

79

When random sampling from a population to obtain an estimate of a population parameter (mean, prevalence of HIV) the sample estimate is a random quantity.

use the CLT the sampling distribution will be approx. normal

90% CI: 1.64 95% CI: 1.96 99% CI: 2.58

(The more confidence you want, the wider is the CI)

80

Your notes:

41

For our sampling exercise of the buttons data, a 95% CI

mean (ht or wt or BMI) of the population is: 1.9 ⁄

prevalence (%diabetic or % hypertensive): 1.96

1 ⁄

81

Session 6: Regression

42

83

Do tall parents have tall children, short parents short children?

62 64 66 68 70 72 74

6062

6466

6870

7274

Midparent

Chi

ld

Frequency scatterplot of Galton Data

1

2

4

1

22

1

1

11

4

4

1

55

2

1

9

5

7

1111

7

7

5

2

1

3

3

5

2

1717

14

13

4

3

5

14

15

3638

28

38

19

11

4

1

7

11

16

2531

34

48

21

18

4

3

1

16

4

1727

20

33

25

20

11

45

1

1

1

13

12

18

14

7

4

33

1

34

3

5

10

4

9

22

1

2

1

2

7

24

1

3

14 23 66 78 211

219

183

68 43 19 457

32

59

48

117138

120

167

99

64

41

1714

84

43

85

62 64 66 68 70 72 74

6062

6466

6870

7274

Child vs Midparent

Midparent

Chi

ld h

eigh

t

62 64 66 68 70 72 7460

6264

6668

7072

74

Child vs Midparent Child jittered

Midparent

Chi

ld h

eigh

t

Regression = means of conditional distributions

86

62 64 66 68 70 72 74

6062

6466

6870

7274

Midparent

Chi

ld

trace of actual means regression of Child on Midparent

62 64 66 68 70 72 74

6062

6466

6870

7274

Midparent

Chi

ld

trace of linear regression means assumes means lie on a straight line

62 64 66 68 70 72 74

6062

6466

6870

7274

Midparent

Chi

ld

superimposing actual and linear regressions

44

Linear regression model assumes:1. Conditional distributions are normal2. Conditional means lie on a straight line3. Conditional distributions all have same spread 87

62 64 66 68 70 72 74

6062

6466

6870

7274

Chi

ld

Linear regression model

62 64 66 68 70 72 74

6062

6466

6870

7274

Chi

ld

Linear regression fitted to data

88

Your notes:

45

mid(ph) 64.0 64.5 65.5 66.5 67.5 68.5 69.5 70.5 71.5 72.5ave(ch) 65.3 65.6 66.3 66.9 67.6 68.2 68.9 69.5 70.2 70.8

What did Galton get from his linear regression?

He concluded that:

YES, tall parents do tend to have tall childrenbut their children regress down to the population average

YES, short parents tend to have short childrenbut their children tend to regress up to the population average

How fortunate. Imagine if this were not so.89

He drew similar conclusions about other hereditary factors

such as intelligence for example

Intelligence testing began to take a concrete form with Sir Francis Galton

considered to be the father of mental tests

90

46

What is the purpose of doing regression?

Prediction?

Credit scoring - predict the probability of bad debt from various characteristics of a client

Explanation?

which type of advertising, radio, TV, bill boards, is most effective for improving TESCO sales

Other? 91

92

Your notes:

47

Does the average diam of child peas depend on average diamof parent peas?

93

94

Your notes:

48

95

Do worksheet 4 on regression

Session 7:Hypothesis formulation and testing

49

H. V. Roberts (1979)

Why did Roberts collect these data? 98

50

99

Your notes: about the starting salary data

Research Hypothesis (often called the alternate hypothesis)

is what we are trying to prove, i.e.

: female starting salaries < male starting salaries

(should have expressed this in terms of conditional distributions)

100

51

Null Hypothesis

is the negation of the research hypothesis i.e. the research hypothesis is null and void

: female salaries = male salaries

(should have expressed this in terms of conditional distributions)

101

Neyman-Pearson: Based on court of law

intension of conviction assumed innocent until “proved guilty”

Of course, we can always make an error in our decision

102

52

103

Your notes:

Court of law Truth (Always unknown)

Innocent Guilty

Our Decision

Aquit

Correct Decision Type II Error

TypeIIError

Convict Type I Error Type I Error

Correct Decision 1

|

53

Neyman‐Pearson Statistical hypothesis test

Truth (Always unknown)

true true

Our Decision

Accept


TypeIIError

Accept

Type I Error Type I Error

Correct Decision 1

|

Diagnostic test

Truth (Gold standard)

True positive True negative

test positive True positives False positives

negative False negatives True negatives

Sensitivity = proportion of true positives that are correctly identified as such

Specificity = proportion of true negatives that are correctly identified as such

54

107

Your notes:

Elements of a NP Hypothesis Test

Alternative hypothesis (Research hypothesis)

Null hypothesis

Test statistic

Rejection regions

Conclusions

55

109

Your notes:

starting salaries, conditional on gender

110

56

Fisherian Inference

Null hypothesis but no Alternative Hypothesis

compute the p‐value of the test statistic under

small p‐values are thought to be evidence against Null hypothesis

Short: NHST

57

Strength of a Statistical Argument

IH

not IModus Tollens Argument (very strong) Have some hypothesis H Has some implication I ----------------------------- But evidence shows not I ----------------------------- Therefore conclude not H

58

116

Real life implementation of Modus Tollens Quasi Modus Tollens Argument (not so strong) Have some hypothesis H together with Assumptions kAAA ,,, 21

Have some implication I ----------------------------- But evidence shows that I is unlikely ----------------------------- Therefore conclude not H

59

The logic of Neyman‐Pearson statistics is to adopt

decision procedures with known long‐term error

rates (of false positives and false negatives) and

then control those errors at acceptable levels.

118

Your notes:

60

Bayesian Inference

In the Bayesian approach

= probability that is true (prior)

| = probability of data is true given (likelihood)

= probability that is true given data (posterior)

Bayesian Inference

Bayes Theorem: posterior ∝ likelihood × prior

∝

61

Bayesian Inference

Bayes factor = |

|

62

Real problemMathematical translation

Mathematical solution

Interpret solution

123

124

Please do worksheet 5

63

Session 8: ANOVA

Variation = sum of squared deviations from the meanVariance = average variation

= variation/degrees of freedom

e.g. data: 1, 2, 3, 4, 5 mean = 3

df=4variation =10 {= 1 3 2 3 3 3 4 3 + 5 3 } variance = 2.5 {10/4}

126

64

The Analysis of Variance Equations (ANOVA)total variation = variation explained by the model +

variation due to noise

127

total df = df for model + df due to noise

65

ANOVA in pictures - for a simple regression model

129

24

68

1014

Original datay

24

68

1014

SSTotal = 67.209

y

24

68

1014

SSModel = 19.881

y

24

68

1014

SSError = 47.328

y

66

Analysis of Variance Table Source Df Sum Sq Mean Sq F-ratio model 1 19.881 19.881 1.6803 error 4 47.328 11.832 Total 5 67.209

R2 = percent variation explained by model = 19.881/67.209 = 29.58%

131

Variance ratio = F-value = 1.68

Residual standard error = sqrt(11.832)

67

133

ANOVA Galton Data

Source Df Sum Sq Mean Sq F value Pr(>F)Model 1 1236.9 1236.93 246.84 <2e-16 Error 926 4640.3 5.01Total 927 5877.2

R-squared:

F-value:

Residual standard error:

68

135

Estimated Coefficients:Est Std. Err t-value Pr(>|t|)

Intercept 23.94 2.81 8.517 <2e-16 Midparent 0.646 0.041 15.711 <2e-16

Regression equation:

average height = 23.94 + 0.646 × Midparent height

69

137

Conclusions:Although the model is crude,

based on midparent height alone,

does not include gender,

or any other factors that might explain

variation in child height,

it is nonetheless a highly significant

model

138

Your notes:

70

T. Eden and R. A. Fisher (1929) Studies in Crop Variation. VI. Experiments on the Response of the Potato to Potash and Nitrogen. J. Agricultural Science 19, 201–213.

139

140

71

141

72

143

ANOVA - Fisher’s potato data Source Df Sum-Sq Mean-Sq F-ratio Pr(>F) nitrog 3 209646 69882 31.220 4e-12 potash 3 32926 10975 4.903 0.004 Resid 57 127589 2238 TOTAL 63 370161

Conclusions?

73

145

Two‐way ANOVA without interaction

Two‐way ANOVA with interaction

146

Your notes:

74

147

Do worksheet 6

Bibliography 1. Cole, T. J. (2000), “Galton’s Midparent Height Revisited,” Annals of Human Biology, 27, 401–

405. 2. Daly, C. (1964) Statistical Games Journal of the Royal Statistical Society. Series C (Applied

Statistics), Vol. 13, No. 2, pp. 74-83 3. Friendly, M. (2008) The Golden Age of Statistical Graphics. Statistical Science, Vol. 23, No. 4,

pp. 502-535 4. Friendly, M. and Denis, D. (2005) The early origins and development of the scatterplot.

Journal of the History of the Behavioral Sciences, Vol. 41(2), 103–130 5. Friendly, M. (2004) The Past, Present and Future of Statistical Graphics. (An Ideo-Graphic

and Idiosyncratic View). http://www.math.yorku.ca/SCS/friendly.html 6. Eden, T. and Fisher, R.A. (1929) Experiments in the response of potato to potash and

nitrogen. Studies in Crop Variation, Vol XIX, pp 201 - 213. 7. Fisher, R.A. (1921) An examination of the yield of dressed grain. Studies in Crop Variation.

Vol. XI, pp107 – 135. 8. Fisher, R.A. (1934) The Contributions of Rothamsted to the Development of Statistics

Rothamsted Experimental Station Report For 1933 pp 43 – 50 9. Galton, F. (1869). Hereditary Genius: An Inquiry into its Laws and Consequences. London:

Macmillan. 10. Galton, F. (1886). Regression towards mediocrity in hereditary stature. Journal of the

Anthropological Institute of Great Britain and Ireland, 15, 246263. 11. Galton, F. (1877), “Typical Laws of Heredity,” in Proceedings of the Royal Institution of Great

Britain, 8, pp. 282–301. 12. (1886), “RegressionTowards Mediocrity in Hereditary Stature,” Journalof the Anthropological

Institute of Great Britain and Ireland, 15, 246–263. (1889), Natural Inheritance, London: Macmillan. (1901), “Biometry,” Biometrika, 1, 7–10. (1908), Memories of My Life (2nd ed.) London: Methuen.

13. Gunst, R. (2000) Classical Studies That Revolutionized the Practice of Regression Analysis. Technomerics, February 2000, 42, 1, 62-64

14. Hadley Wickham, Dianne Cook, Heike Hofmann, and Andreas Buja, (2010) 15. Hanley, J. (2004) “Transmuting Graphical Inference for Infovis” Women into Men: Galton’s

Family Data on Human Stature. The American Statistician, Vol. 58, No. 3 1 16. Hanley, J. A. (2004), Digital photographs of data in Galton’s notebooks, and

related material, available online at http://www.epi.mcgill.ca/hanley/galton. 17. Hanley, J and Turner, E. (2010) Age in medieval plagues and pandemics: Dances of Death

or Pearson’s bridge of life? Significance, June 2010, 85-87 18. Handley, J., Julien, M., and Moodie, E.E.M. (2008) Student’s z, t, and s: What if Gosset had

R? The American Statistician, February 2008, Vol. 62, No. 1 19. Jacques, J.A. and Jacques, G.M. (2002) Fisher’s randomization test and Darwin’s data – A

footnote to the history of statistics. Mathematical Biosciences, Vol 180, 23–28 20. Jaggard, K.W., Qi, A. and Ober, E.S. Possible changes to arable crop yields by 2050. Phil.

Trans. R. Soc. B, 365, 2835–2851 21. Nievergelt, Y. (2000). A tutorial history of least squares with applications to astronomy and

geodesy. Journal of Computational and Applied Mathematics ,121, 37-72. 22. Pagano, M. and Anoke, S. (2013) Mommy's Baby, Daddy's Maybe: A Closer Look at

Regression to the Mean. CHANCE, 26:3, 4-9 23. Pearson, K. (1930). The Life, Letters and Labours of Francis Galton, Vol.III: Correlation,

Personal Identification and Eugenics . Cambridge University Press.

http://www.math.yorku.ca/SCS/friendly.html

24. Stigler, S. M. (1986). The History of Statistics: The Measurement of Uncertainty before 1900. Harvard University Press.

25. Stigler, S. M. (1999). Statistics on the Table: The History of Statistical Concepts and Methods. Harvard University Press.

26. Sung Sug Yoon, R.N., Vicki Burt, R.N., Tatiana, L., Carroll, M.D. (2012). Hypertension Among Adults in the United States, 2009–2010. NCHS Data Brief, No. 107, October 2012.

27. Wachsmuth, A. and Wilkinson, L. (2003) Galton’s Bend: An Undiscovered Nonlinearity in Galton’s Family Stature Regression Data and a Likely Explanation Based on Pearson and Lee’s Stature Data. Publication details unknown.

28. Wright, K (2013) Revisiting Immer’s Barley Data. The American Statistician, 67:3, 129-133 29. A review of basic statistical concepts. Author unknown. 30. Diabetes in the UK 2012. 31. Pearson, K. (1896), “Mathematical Contributions to the Theory of Evolution. III Regression,

Heredity and Panmixia,” Philosophical Transactions of the Royal Society of London, Series A, 187, 253–318.

32. (1930), The Life, Letters and Labours of Francis Galton, (Vol. IIIA), London: Cambridge University Press.

33. Pearson, K., and Lee, A. (1903), “On the Laws of Inheritance in Man: I. Inheritanceof Physical Characters,” Biometrika, 2, 357–462.

34. Stigler, S. (1986), “The English Breakthrough: Galton,” in The History of Statistics: 35. The Measurement of Uncertainty before 1900, Cambridge, MA: The Belknap Press of

Harvard University Press, chap. 8. 36. Tredoux, G. (2004), Web site http://www.galton.org. 37. Wachsmuth, A., Wilkinson, L., and Dallal, G. E. (2003), “Galton’s Bend: A Previously

Undiscovered Nonlinearity in Galton’s Family Stature Regression Data,” The American Statistician, 57, 190–192.

38. Paul J Lewi Speaking of Graphics that can be found at http://www.datascope.be/sog.htm The Power to See: A New Graphical Test of Normality," Aldor-Noiman, S., Brown, L. D., Buja, A., Rolke, W., Stine, R.A., The American Statistician, 67 (4), 249{260 (2013). Valid Post-Selection Inference," Berk, R., Brown, L., Buja, A., Zhang, K., Zhao, L., The Annals of Statistics, 41 (2), 802{837 (2013). Statistical Inference for Exploratory Data Analysis and Model Diagnostics," Buja, A., Cook, D., Hofmann, H., Lawrence, M., Lee, E.-K., Swayne, D.F., and Wickham, H., Philosophical Transactions of the Royal Society A., 367, 4361{4383 (2009). The Plumbing of Interactive Graphics," Wickham, H., Lawrence, M., Cook, D., Buja, A., Hofmann, H., and Swayne, D.F., Computational Statistics, (April 2008). Visual Comparison of Datasets Using Mixture Distributions," Gous, A., and Buja, A., Journal of Computational and Graphical Statistics, 13 (1) 1{19 (2004). Exploratory Visual Analysis of Graphs in GGobi," Swayne, D.F., Buja, A., and Temple-Lang, D., refereed proceedings of the Third Annual Workshop on Distributed Statistical Computing (DSC 2003), Vienna. GGobi: Evolving from XGobi into an Extensible Framework for Interactive Data Visualization," Buja, A., Lang, D.T., and Swayne, D.F., Journal of Computational Statistics and Data Analysis, 43 (4), 423-444 (2003).



Murrell, P. (2011). R graphics (2nd ed.). London, United Kingdom: Chapman & Hall.

Pashler, H. and Wagenmakers, E–J. (Eds.) (2012). Editors’ Introduction to the Special Section on

Replicability in Psychological Science: A Crisis of Confidence?, Perspectives on

Psychological Science, 7(6): 528–530.

R Development Core Team. (2015). R: A Language and Environment for Statistical Computing.

Vienna, Austria: R Foundation for Statistical Computing. Retrieved from http://www.R-

project.org

Tay, l., Parrigon S., Huang, Q., and James M. LeBreton2, J.M. (2016) Graphical Descriptives: A

Way to Improve Data Transparency and Methodological Rigor in Psychology. Perspectives on

Psychological

XGobi: Interactive Dynamic Data Visualization in the X Window System," Swayne, D.F., Cook, D., and Buja, A., Journal of Computational and Graphical Statistics, 7, 113{130 (1998). https://rbertolusso.github.io/posts/LR03-residuals-RMSE#code-to-load-data-and-initial-calculations Pearson Father – Son data – beautiful discussion

https://rbertolusso.github.io/posts/LR03-residuals-RMSE#code-to-load-data-and-initial-calculations

https://rbertolusso.github.io/posts/LR03-residuals-RMSE#code-to-load-data-and-initial-calculations

Worksheet 1: Descriptive statistics Question 1 Plot the data 2, 3, 5, 8, 12 on a hand drawn dot diagram below Compute the mean ( mean=sum/no of data points) Compute the variation (sum of squares of data about their mean) Compute the variance (variation/df) Compute the sd (square root of variance) Answers: sum=30; mean=30/5=6; variation=66; variance=variation/df=66/4=16.5; sd=sqrt(16.5)=4.06

Your answers:

Question 2 The number of people in the lunch queue at noon at the Mathematical Institute on 20 working days during January was: 15, 8, 10, 0, 17, 12, 18, 8, 13, 14, 17, 0, 10, 12, 3, 6, 0, 2, 6, 6 order statistics: 0, 0, 0, 2, 3, 6, 6, 6, 8, 8, 10, 10, 12, 12, 13, 14, 15, 17, 17, 18, plotted on the diagram below

Compute the three quartiles Create a hand drawn Box-and-whisker plot above the dotplot Visually estimate the mean and sd

LQ = any number between 3 and 6. Some would use 4.5 Median = any number between 8 and 10. Some would use 9

UQ = any number between 13 and 14. Some would use 13.5 about 9 and 5; actual values 8.85 and 5.91

Your answers:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

02

46

81

0

Dotplot of data

y

Question 3 The following scatterplot shows wt (the weight in lbs of an individual) plotted against ht (height in inches of an individual) for the button data (to come later). Estimate visually the means and sd’s for the conditional distributions of weights at heights of 65 and 70 inches respectively.

Your answers: Actual values: at 65 mean = 139.6, sd=17.1; at 72 mean=160.3, sd=15.8 R code for dotplot

x <- c(0, 0, 0, 2, 3, 6, 6, 6, 8, 8, 10, 10, 12, 12, 13, 14, 15, 17, 17, 18)

y <- c(1, 2, 3, 1, 1, 1, 2, 3, 1, 2,1,2,1,2,1,1,1, 1, 2, 1)

points<- 0:20

plot(x,y,pch=19, cex=1.5,xlab="Dotplot of data", xlim=c(0,20), ylim=c(0,10),

at=points,cex.lab=2)

R code for plot of weight vs height

bd<- read.table("E:/buttondata.txt",header=T)

attach(bd)

plot(ht,wt,xlab="height",ylab="weight",cex.lab=1.5,main="Plot of weight vs height for buttons data")

abline(v=65,lty=2);abline(v=72,lty=2)

wt.65 <- wt[ht==65]; wt.72 <- wt[ht==72];

mean.65 <- mean(wt.65);mean.72<- mean(wt.72)

sd.65 <- sd(wt.65);sd.72<- sd(wt.72);

mean.65;mean.72;sd.65;sd.72

Worksheet 2: Probability Question 1: The standard statistical model is: what you observe = truth + error/noise What contributes to the noise in the reduced Galton data on midparent height and child height?

Your answers:

Question 2: What meanings of probability are invoked in the following statements: a Smokers are 23 times more likely to get lung cancer than are non-smokers

How would one compute this? How might this statement be re-phrased?

b Insurance costs are usually based on risk. Women get a 40% discount on motor insurance in South Africa. Does this mean that women are better drivers? Discuss.

c You bought four tickets in a lottery. What are your chances of winning?

Your answers:

Worksheet 3: Sampling and confidence intervals Do this during lunch break – work in pairs 1. Take a random sample of 20 buttons. Record ht and wt data in Excel. Don’t record BMI, bp and db. 2. Use Excel to compute means, sd’s. Compute the quantiles by sorting data from smallest to largest. 3. Compute 95% confidence intervals for each of the parameters measured.

case ht wt BMI bp db

1

2

3

•

•

•

20

mean

sd

min

LQ

MED

UQ

max

LCL

UCL

LCL = Lower Confidence Limit UCL = Upper Confidence Limit a 95% CI for the mean (ht or wt or BMI) of the population is:

𝑠𝑎𝑚𝑝𝑙𝑒 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 ± 1.96 × 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 √𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒⁄

a 95% CI for the prevalence (%diabetic or % hypertensive):

𝑠𝑎𝑚𝑝𝑙𝑒 𝑝𝑟𝑒𝑣𝑎𝑙𝑒𝑛𝑐𝑒 ± 1.96 × √𝑝𝑟𝑒𝑣𝑎𝑙𝑒𝑛𝑐𝑒 × (1 − 𝑝𝑟𝑒𝑣𝑎𝑙𝑒𝑛𝑐𝑒) 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒⁄ Use the following functions in Excel: average, stdev.s, min,max To compute the quantiles order data from smallest to largest

Worksheet 4: Regression

1 Assume the linear regression model holds for these data.

What are the assumptions of the linear regression model?

2 Locate visually the means of the conditional distributions at ht = 65 and 72 Fit a linear regression line through these data by hand.

Actual values: at 65 mean = 139.6, sd=17.1; at 72 mean=160.3, sd=15.8

3 Estimate the slope of your fitted line. NB: Slope = rise/run Estimated from the two conditional distributions slope=(160.3-139.6)/7 = 2.96 Estimated from fitting the linear regression model to all the data slope = 3.07

4 Estimate visually the standard deviation of the data about the regression visually. Estimated from fitting linear regression model to the full data: Residual standard error: 15.36 Continued overleaf

5 Describe the conditional distributions of weights at heights of 65 and 72 inched.

6 Can one predict weight from height? What can one predict? What are the uncertainties?

7 Can you describe or interpret what the regression model is telling us?

R code for plot of weight vs height

bd<- read.table("E:/buttondata.txt",header=T)

attach(bd)

plot(ht,wt,xlab="height",ylab="weight",cex.lab=1.5,main="Plot of weight vs height for buttons data")

abline(v=65,lty=2);abline(v=72,lty=2)

wt.65 <- wt[ht==65]; wt.72 <- wt[ht==72];

mean.65 <- mean(wt.65);mean.72<- mean(wt.72)

sd.65 <- sd(wt.65);sd.72<- sd(wt.72);

mean.65;mean.72;sd.65;sd.72

Worksheet 5: Hypothesis formulation and testing

Question 1: Discuss possible the type I and type II errors in the following settings. One might also want to consider the problem from the perspective of different individuals or groups.

1 A new drug for treating HIV/Aids is proposed and a clinical trial is designed to compare it with the existing drug. Possible individual or groups: Researchers who propose the drug. The person with HIV/Aids and their family. The drug company who produce drug for treating HIV/Aids. Research hypothesis: Null hypothesis: Type I Error: Consequences of a Type I Error: Type II Error: Consequences of a Type II Error:

2 A teenager gets caught up in gang violence and is given a life sentence, at age 18, for the conviction of second degree murder of an opposing gang member. Possible individual or groups: The teenager who is sentenced. Her family. The family of the person who was killed. The legal or justice community who carry the burden of justice. The community at large who expect justice from the legal system. Research hypothesis: Null hypothesis: Type I Error: Consequences of a Type I Error: Type II Error: Consequences of a Type II Error:


Innocent Guilty

Our Decision

Aquit


𝛽 = 𝑃𝑟(Type II Error)

Convict Type I Error

𝛼 = 𝑃𝑟(Type I Error)

Correct Decision 1 − 𝛽 = 𝑃𝑜𝑤𝑒𝑟

= 𝑃𝑟(𝐶𝑜𝑛𝑣𝑖𝑐𝑡𝑖𝑛𝑔|𝐺𝑢𝑖𝑙𝑡𝑦)


𝐻𝑂 true 𝐻𝐴 true

Our Decision

Accept 𝐻𝑂


𝛽 = 𝑃𝑟(Type II Error)

Accept 𝐻𝐴

Type I Error 𝛼 = 𝑃𝑟(Type I Error)

Correct Decision 1 − 𝛽 = 𝑃𝑜𝑤𝑒𝑟

= 𝑃𝑟(𝑅𝑒𝑗𝑒𝑐𝑡𝑖𝑛𝑔 𝐻𝑂|𝐻𝑂 𝐹𝑎𝑙𝑠𝑒)

3 A student is accused of plagiarism, her dissertation is rejected and she is excluded from the university because of it. Possible individual or groups: The student. The university. The community. Research hypothesis: Null hypothesis: Type I Error: Consequences of a Type I Error: Type II Error: Consequences of a Type II Error:

4 For the starting salary data, the null hypothesis is that the average salaries paid to men and women are equal, and that the observed differences were due to other circumstantial factors. Possible groups: Males. Females. The bank. Community. Research hypothesis: Null hypothesis: Type I Error: Consequences of a Type I Error: Type II Error: Consequences of a Type II Error:

Question 2: In a Galton regression setting, where one is comparing the conditional distributions of child height at given values of midparent heights, what is the null hypothesis of interest for this research problem. What would be considered sufficient or convincing evidence for rejecting the null hypothesis. Hint: Whatever is computed from a random sample, is itself a random variable, whose sampling distribution we can compute.

Worksheet 6: ANOVA

Question 1: For the Fisher data there are a number of hypotheses one might wish to consider about the effect of potash, nitrogen and the interaction of potash and nitrogen.

1. Formulate conceptually and in words the mathematical model for these data.

2. What would be sensible null hypotheses for this experiment?

3. How convincing is the evidence against the null hypotheses judging from the graphs? Does one in fact need a statistical test against the null hypotheses? Or is the evidence against the null hypotheses simply overwhelming?

ANOVA for Potato Data including an interaction term

Source Df Sum Sq Mean Sq F value p-value

nitrogen 3 209646 69882 32.299 2.72e-11 (main effect of nitrogen)

potash 3 32926 10975 5.073 0.00413 (main effect of potash)

nitrogen:potash 9 18925 2103 0.972 0.47556 (interaction term)

Residuals 45 97361 2164 (noise term)

TOTAL 60 358858

p-value = approximate measure of type I error probability. Small p-values provide

strong evidence against the null hypothesis.

Interpret the ANOVA for these data keeping in mind the assumptions of the model and the possible effect on the computed statistics of the model assumptions being violated.

Question 2: Practical ANOVA computation for fictitious data to compare Apples, Pears and Oranges. Construct the ANOVA for the data in the table below, chosen for ease of mental arithmetic. ANOVA model: Total variation = variation due to model + variation about model Total df = df for model + df due to noise Strategy: We'll compute the Total Variation, the variation about the model and then compute the variation explained by model by subtraction. Similarly the df computations. Of course we could compute the variation and df due to the model directly. Compute the following: sum for each group mean for each group variation for each group = sum of squared deviations of group data points about group means variation about model = sum of variations for each group about their group means df for each group = number of observations in group - 1 sum of all data average of all data total variation for all data = sum of squared deviations of data points about overall means total df = total number of data points - 1

Apples Pears Oranges

1 2 3 4 5

2 3 4 5 6

3 4 5 6 7

sum for each group

mean for each group

variation about group mean

df for each group

ANOVA

Souce Variation

Also called Sum of Squares SS

df Variance

Also called Mean Square MS = SS/df

F variance ration

model

error/noise

Total

Your rough work Remark: Clearly the assumption of normality does not apply for these data, because the observations in the groups are uniformly distributed; yet one can still apply the ANOVA methodology. Thus the distribution of the F-test, based on normality of the data, will not be correct. Hence our inferences will be wobbly. We should therefore always check the model assumptions, to evaluate the suitability of the F-test. A good question would be how to check the model assumptions. That would need an applied statistics course.

Source SS df MS F

model 10 2 5 2

noise 30 12 2.5

Total 40 14

Date post:	15-Aug-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Statistics: Concepts of statistics for researchers · Statistics : Introduction: this is a...

Documents