1
Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 1
Overview; Variables, Constants, Tables & Graphs
Dr Gwilym Pryce
2
Aims and Objectives of the Maths & Stats Induction
Aim: to revise basic maths relevant to the course.
Objectives: by the end of the Induction Programme students should be able to:
• Understand the meaning and types of variables and constants
• Understand how to graph scale and categorical variables• Be familiar with basic algebraic notation• Understand the simple mathematical representation of
relationships, both algebraically and graphically• Understand the basic principles and laws of probability• Outline the main issues surrounding sampling.
3
Why do social scientists need to learn about statistics? Theories have to be verified empirically otherwise
they remain conjectures Need for evidenced based practice & policy:
– medicine
– public health
– economics informed decisions better than uninformed
decisions information is complex and needs summarising in a
way that reflects the underlying data in a meaningful way
4
Why do we need mathematics?
Statistics can be represented in a non-mathematical way, but some understanding and application of maths will help us:– spoken language can be ambiguous varies
across countries and cultures
5
Different cultures find different things funny Different cultures and languages express ideas
differently But mathematical notation is:
– unambiguous and concise– common notation is understood across cultures
and languages Research & ideas expressed mathematically
can easily reach an international audience
6
Plan of Maths & Stats Induction Lecture 1: Variables, Constants, Tables &
Graphs Lecture 2: Algebra and Notation Lecture 3: Precise and Approx Relationships
between variables Lecture 4: Probability Lecture 5: Inference Lecture 6: Hypothesis tests Tutorial: Samples and populations; Validity
and Reliability
7
Plan of Maths & Stats Lecture 1: Variables and Constants 1. What is a variable? 2. What is a constant? 3. Types of variables 4. Graphs of single variables
– Why summarise?– Tables & graphs of categorical data– Tables & Graphs of Continuous /
Quantitative/Scale variables
8
1. What is a variable?– A measurement or quantity that can take on
more than one value:• E.g. size of planet: varies from planet to planet• E.g. weight: varies from person to person• E.g. gender: varies from person to person• E.g. fear of crime: varies from person to person• E.g. income: varies from HH to HH
– I.e. values vary across ‘individuals’ = the objects described by our data
9
Individuals = basic units of a data set whom we observe or experiment on in a controlled way– not necessary persons
• (could be schools, organisations, countries, groups, policies, or objects such as cars or safety pins)
Variables = information that can vary across the individuals we observe– e.g. age, height, gender, income, exam scores,
whether signed Nuclear Test Ban Treaty
10
2. What is a constant?– A measurement or quantity that has only one
value for all the objects described in our data– Also called a ‘scalar’ or ‘intercept’ or ‘parameter’
• E.g. speed of light in a vacuum: constant for all light transmissions• E.g. ratio of diameter to circumf.: constant for all circles• E.g. ave. increase in life expectancy: constant at 1 year pa since 1900
11
Often it is a constant that want to estimate:– we employ statistical techniques to estimate
‘parameters’ or ‘constants’ that summarise or link variables.
• e.g. mean = ‘typical’ value of a variable = measure of central tendency
• e.g. standard deviation = measure of the variability of a variable = measure of spread
• e.g. correlation coefficient = measures the correlation between two variables
• e.g. slope coefficients = how much y increases when x increases
12
3. Types of variables:
Numeric = values are numbers that can be used in calculations.
String = Values are not numeric, and hence not used in calculations. – But can often be coded: I.e. transformed into a
numerical variable:• e.g. If (country = ‘Argentina’) X = 1.
If (country = ‘Brazil’) X = 2. etc.
13
Scale or quantitative Variables = data values are numeric values on an interval or ratio scale – (e.g., age, income). Scale variables must be numeric.– E.g. dimmer switch: brightness of light can be measured
along a continuum from dark to full brightness
Categorical Variables = variables that have values which fall into two or more discrete categories – E.g. conventional light switch: either total darkness or full
brightness, on or off.– Male or female, employment category, country of origin
14
Two types of Ordinal variables:
Ordinal variables = Data values represent categories with some intrinsic order – (e.g., low, medium, high; strongly agree, agree,
disagree, strongly disagree). – Ordinal variables can be either string
(alphanumeric) or numeric values that represent distinct categories (e.g., 1=low, 2=medium, 3=high).
15
Ordinal variables: Values fall within discrete but ordered
categories– I.e. the sequence of categories has meaning
• e.g. education categories:– 1 = primary
– 2 = secondary
– 3 = college
– 4 = university undergraduate
– 5 = university postgraduate masters
– 6 = university postgraduate phd
• e.g. 1= Very poor, 2= poor, 3=good, 4=very good
16
Nominal variables Nominal Variables = Data values represent
categories with no intrinsic order – sequence of categories is arbitary --
ordering has no meaning in and of itself:• e.g. country of origin: Wales, Scotland,
Germany…• e.g. make of car: Ford, Vauxhall• e.g. job category • e.g. company division
– Nominal variables can be either string (alphanumeric) or numeric values that represent distinct categories (e.g., 1=Male, 2=Female).
17
4. Graphs of Variables:
Why summarise? Tables & graphs of categorical data Tables & Graphs of Continuous /
Quantitative/Scale variables
18
Why Summarise? Small data sets can be presented in their
entirety• e.g. if only have 10 observations and 3 variables, can
list all data• but even then we might want to know what is the
typical value of a variable
Large data sets require summary Lots of information can be confusing,
particularly if numerical• most of us need headline figures or stylised facts to be
able to absorb information.
19
Graphical summaries:– allow us to visualise the distribution of data
across different values or categories• how many (or what proportion) of cases fall
within certain categories or ranges of values?
Summary statistics:– describe the distribution of a single variable
20
Categories are listed either in columns or rows (respecting order if ordinal)
• Count or % of cases in each category listed
If number of categories is large, may be useful to group categories together:
• e.g. Country of origin ---> collapse to continents Good tables:
• give clear messages: tell a story• too much info in a table defeats its purpose• Source always given
Tables of Categorical Data
21
Income Support claimants with housing costs by statistical group:May 1999
Total(All
Claimantswith
MortgageInterest)
Aged 60 orover
LoneParents
Disabled Other
000s 000s % 000s % 000s % 000s %1999 263 93 35 67 25 88 33 15 6
DSS Quarterly Statistical Enquiry
22
Income Support claimants with housing costs by statistical group:May 1993 to May 1999
Total(All
Claimantswith
MortgageInterest)
Aged 60 orover
LoneParents
Disabled Other
000s 000s % 000s % 000s % 000s %1993 284 88 31 107 38 63 22 27 101994 310 96 31 115 37 74 24 25 81995 329 102 31 115 35 87 26 25 81996 322 103 32 104 32 89 28 25 81997 301 98 33 92 31 89 30 22 71998 281 96 34 78 28 90 32 17 61999 263 93 35 67 25 88 33 15 6
DSS Quarterly Statistical Enquiry
23
Graphs of Categorical Data
Pie Charts– If all the categories sum to a meaningful
total, then you can use a pie chart– Pie charts emphasise the differences in
proportions between categories– OK for a single snapshot, but not very
good for showing trends• would need to have a separate pie chart for
each year
24
Income Support claimants with housing costs by statistical group in May 1999
Aged 60 or over36%
Lone Parents25%
Disabled33%
Other6%
What’s missing?
25
Bar Charts– can show either % or count– not very good for showing trends in more
than one category
26
Income Support claimants with housing costs by statistical group in May 1999
0
10
20
30
40
50
60
70
80
90
100
Aged 60 or over Lone Parents Disabled Other
Category of Claimant
00
0's
DSS Quarterly Statistical Enquiry
27
Income Support claimants with housing costs by statistical group in May 1999
0
10
20
30
40
50
60
70
80
90
100
Aged 60 or over Lone Parents Disabled Other
Category of Claimant
00
0's
DSS Quarterly Statistical Enquiry
28
1993 19941995
19961997
19981999
Other 000s
Disabled 000s
Lone Parents 000s
Aged 60 or over 000s0
20
40
60
80
100
120
000's
Year
Category
Income Support claimants with housing costs by statistical group: May 1993 to May 1999
DSS Quarterly Statistical Enquiry
29
Income Support claimants with housing costs by statistical group: May 1993 to May 1999
0
20
40
60
80
100
120
140
1993 1994 1995 1996 1997 1998 1999Year
000's
Other 000s
Disabled 000s
Lone Parents 000s
Aged 60 or over 000s
31
Income Support claimants with housing costs by statistical group: May 1993 to May 1999
60
70
80
90
100
110
120
1993 1994 1995 1996 1997 1998 1999Year
000'sLone Parents 000s
32
Income Support claimants with housing costs by statistical group: May 1993 to May 1999
0
20
40
60
80
100
120
140
160
180
200
1993 1994 1995 1996 1997 1998 1999
Year
000'
s
Lone Parents 000s
33
Beware of small print...Notes:1. Numbers are based on a 1 per cent sample up to May 1994, and a 5 per cent sample thereafter.Numbers are subject to a degree of sampling error.2. Children are defined as aged 0 to 15.3. Single with children includes all single people with a child aged between 0 and 15 years of age, andwill include claimants regardless of client group i.e. may include claimants normally classified as"disabled" or "pensioners". Numbers will exclude some single people who onlyhave dependants aged 16 to 19.4. Income Support for the unemployed was replaced by Income-based Jobseeker's Allowance inOctober 1996. To provide a consistent time-series, the unemployed have been removed from thefigures quoted in these tables from 1993 to August 1996.Source: Income Support Statistics Quarterly Enquiries, May 1993 to May 2000.
34
Tabulating and Graphing Scale Data Scale or quantitative data: usually a
measurement of size or quantity– not meaningful to report % or count unless
break into categories (& then it becomes categorical data!)
• e.g. income
Tables of raw data not much use unless only a few values...
35
How tabulate 129,000 observations?CM SML 1988 CM SML 1988 CM SML 1988 CM SML 1988
Borrower Total Income Borrower Total IncomeBorrower Total IncomeBorrower Total Income1 . 21 10800 41 . 61 .2 . 22 . 42 7216 62 .3 . 23 19072 43 . 63 .4 . 24 . 44 12000 64 .5 . 25 . 45 9758 65 .6 . 26 . 46 6084 66 .7 . 27 . 47 . 67 .8 . 28 . 48 . 68 .9 . 29 . 49 . 69 18336
10 . 30 . 50 9345 70 1509611 . 31 . 51 9810 71 .12 . 32 . 52 14406 72 1259713 . 33 . 53 9190 73 970014 . 34 . 54 . 74 .15 18720 35 . 55 . 75 .16 16000 36 . 56 . 76 .17 16455 37 . 57 . 77 .18 . 38 11500 58 . 78 529519 7020 39 2912 59 . 79 453920 4576 40 11745 60 . 80 .
36
What are we interested in when describing the income data?– Is income evenly spread?– Or are most people rich?– Or are most people poor?– Or are most reasonably well off?
This are all questions about the variable’s Distribution– We can represent the whole data set with
one picture...
37TOTAL INCOME OF BORROWER(S)5
70
00
.0 -
58
50
0.0
54
00
0.0
- 5
55
00
.05
10
00
.0 -
52
50
0.0
48
00
0.0
- 4
95
00
.04
50
00
.0 -
46
50
0.0
42
00
0.0
- 4
35
00
.03
90
00
.0 -
40
50
0.0
36
00
0.0
- 3
75
00
.03
30
00
.0 -
34
50
0.0
30
00
0.0
- 3
15
00
.02
70
00
.0 -
28
50
0.0
24
00
0.0
- 2
55
00
.02
10
00
.0 -
22
50
0.0
18
00
0.0
- 1
95
00
.01
50
00
.0 -
16
50
0.0
12
00
0.0
- 1
35
00
.09
00
0.0
- 1
05
00
.06
00
0.0
- 7
50
0.0
30
00
.0 -
45
00
.00
.0 -
15
00
.0
12000
10000
8000
6000
4000
2000
0
Std. Dev = 12830.02
Mean = 17993.3
N = 125541.00
38LTV16
.75
- 17
.25
15.7
5 -
16.2
514
.75
- 15
.25
13.7
5 -
14.2
512
.75
- 13
.25
11.7
5 -
12.2
510
.75
- 11
.25
9.75
- 1
0.25
8.75
- 9
.25
7.75
- 8
.25
6.75
- 7
.25
5.75
- 6
.25
4.75
- 5
.25
3.75
- 4
.25
2.75
- 3
.25
1.75
- 2
.25
.75
- 1.
25-.
25 -
.25
LTV Frequency Distribution
All HHs in Low Price Areas
(1995-1998 CML SML Data)
Fre
quen
cy
60000
50000
40000
30000
20000
10000
0
Std. Dev = .25
Mean = .80
N = 74736.00
39LTV
1.45
- 1
.50
1.40
- 1
.45
1.35
- 1
.40
1.30
- 1
.35
1.25
- 1
.30
1.20
- 1
.25
1.15
- 1
.20
1.10
- 1
.15
1.05
- 1
.10
1.00
- 1
.05
.95
- 1.
00.9
0 -
.95
.85
- .9
0.8
0 -
.85
.75
- .8
0.7
0 -
.75
.65
- .7
0.6
0 -
.65
.55
- .6
0.5
0 -
.55
.45
- .5
0.4
0 -
.45
.35
- .4
0.3
0 -
.35
.25
- .3
0.2
0 -
.25
.15
- .2
0.1
0 -
.15
.05
- .1
00.
00 -
.05
LTV Frequency Distribution
All HHs in Low Price Areas
(1995-1998 CML SML Data)
Fre
quen
cy
30000
20000
10000
0
Std. Dev = .22
Mean = .80
N = 74552.00
40LTV
.95
- 1.
00.9
0 -
.95
.85
- .9
0.8
0 -
.85
.75
- .8
0.7
0 -
.75
.65
- .7
0.6
0 -
.65
.55
- .6
0.5
0 -
.55
.45
- .5
0.4
0 -
.45
.35
- .4
0.3
0 -
.35
.25
- .3
0.2
0 -
.25
.15
- .2
0.1
0 -
.15
.05
- .1
00.
00 -
.05
LTV Frequency Distribution
All HHs in Low Price Areas
(1995-1998 CML SML Data)
Fre
qu
en
cy
30000
20000
10000
0
Std. Dev = .22
Mean = .78
N = 70545.00