Exploring Data:Frequencies, Central Tendency,
Dispersion and Standard Deviation
SIT094The Collection and Analysis of Quantitative Data
Week 3
Luke Sloan
About Me
• Name: Dr Luke Sloan• Office: 0.56 Glamorgan• Email: [email protected]
• To see me: please email first
Introduction
• Collecting Quantitative Data
• Levels of Measurement
• Frequencies & Fidelity
• Central Tendency
• Dispersion
• Summary
Collecting Quantitative Data I
“Research involving the collection of data in numerical form… the defining factor is that
numbers result from the process, whether the initial data collection produced numerical values,
or whether non-numerical values were subsequently converted to numbers as part of
the analysis process…”
Source: Jupp 2006:250
Collecting Quantitative Data II• Operationalising of social concepts
• Quantifying ‘fuzzy’ data into VARIABLES
• How to measure feelings, attitudes, behaviours, beliefs and attributes?
• Numbers allow statistical tests
• Statistical tests allow generalisations to made
• Characterisation from samples to populations
Collecting Quantitative Data III• Capture data using instruments
• Surveys (paper, online, telephone, in person)
• Secondary data analysis
• Experiments – difficult outside of the natural sciences
• But social scientists try to emulate the natural science model (remember Popper’s Falsification Principle?)
• But not all data is equal (some are more equal than others!)
Levels of Measurement IData Level Description Examples
Nominal (categorical)
Response categories cannot be placed in a specific order – impossible to judge ‘distance’ between categories
Sex (Male/Female)Ethnicity (White/Black…)Party (Lab/Con/LD…)
Ordinal (categorical)
Response categories can be placed in rank order – distance between categories cannot be measured mathematically
Likert (Agree/Neutral/Disagree)Rank Preference (Coke/Pepsi…)Education (GCSE/A-Level…)
Interval (or continuous)*
Responses measured on a continuous scale with rank order – uniform distance between responses allows mathematical measurement
Age (in years)Income (in £)
*NOTE: Interval = no true zero point (e.g. height), Ratio = true zero point (e.g. income)
Source: David & Sutton (2004)
Levels of Measurement II
• Level of measurement for certain variables is not pre-defined:
– AGE (in years e.g. 22, 34, 54)– AGE (pre-set bands e.g. 18-30, 31-50)– AGE (group membership e.g. mature student)
• There is a hierarchy of data – always try to collect the highest level possible to maximise usefulness!
– Are you bored? (Yes/No)– On a scale of 1-10, how bored are you [where 1=‘practically in tears of
boredom’ and 10=‘riveted’]
Frequencies & Fidelity I
• Not as interesting as it sounds – sorry!
• Frequency tables display the number of times that a value appears in your dataset (per variable across all cases)
• They are always the first thing you do once your data is in electronic form
• Highlights data errors
• Indicative of potential analysis
Frequencies & Fidelity II
Parties coded
Frequency Percent Valid Percent Cumulative Percent
Valid -9 1 .0 .0 .0Conservative 1331 29.9 29.9 30.0
Labour 1103 24.8 24.8 54.8
Lib Dem 1044 23.5 23.5 78.2
Green 368 8.3 8.3 86.5
UKIP 171 3.8 3.8 90.4
BNP 78 1.8 1.8 92.1
Independent 216 4.9 4.9 97.0
Others 135 3.0 3.0 100.0
Total 4447 100.0 100.0
Missing System 1 .0
Total 4448 100.0
What can we say about this table?
A simple frequency table can tell you quite a bit!
Error?
What we would expect?
Look at %s
More than UKIP
Really? Only 1?
What’s this?
Central Tendency I
You have all done quantitative research and you all use measures of central tendency in your normal lives – the average, middle and most
common values
Maintenance grant allowance per week
Divide total grant by number of weeks at uni
Average(MEAN)
How long do you cook a chicken?
Cookbook says 2 hours but internet says 3
Middle(MEDIAN)
What to watch on TV with housemates
Decide based on the most popular choice
Most Common(MODE)
Central Tendency II
MODEthe value that occurs the
most frequently in the data
HighDate Temperature2-Jan 593-Jan 604-Jan 435-Jan 426-Jan 357-Jan 32 <===Mode8-Jan 32 <===Mode9-Jan 4610-Jan 4111-Jan 52
MODE = 32
Central Tendency III
Main reason for going to gym
9 10.0 10.0 10.0
31 34.4 34.4 44.4
33 36.7 36.7 81.1
17 18.9 18.9 100.0
90 100.0 100.0
Relaxation
Fitness
Lose weight
Build strength
Total
ValidFrequency Percent Valid Percent
CumulativePercent
What is the most frequent (MODAL) response?
The mode is useful for thinking about NOMINAL data
Central Tendency IV
Relaxation Fitness Lose weight Build strength
Main reason for going to gym
0
10
20
30
40
Co
un
t
NOMINAL data can be displayed using a bar chart
Central Tendency V
MEDIANthe middle value of the ordered
sample data
HighDate Temperature7-Jan 328-Jan 326-Jan 3510-Jan 415-Jan 42 <===Middle values4-Jan 43 <===Middle values9-Jan 4611-Jan 522-Jan 593-Jan 60
When the sample size if odd, the median is the middle
valueWhen the sample size if even,
the median is the midpoint (mean) of the two middle
valuesMEDIAN = 42.5
There is a general lack of public knowledge about local government
Frequency Percent Valid Percent
Cumulative Percent
Valid Strongly Agree 1911 41.1 41.8 41.8Agree 2281 49.1 49.9 91.6
Neutral 255 5.5 5.6 97.2
Disagree 111 2.4 2.4 99.6
Strongly Disagree 17 .4 .4 100.0
Total 4575 98.5 100.0 Missing System 71 1.5 Total 4646 100.0
Central Tendency VI
The mode and median are useful for thinking about ORDINAL data
What is the most frequent (MODAL) response?
What is the middle (MEDIAN) response?
Central Tendency VII
ORDINAL data can also be displayed using a bar chart
Central Tendency VIII
MEANsum of the value divided by the
number of cases
HighDate Temperature2-Jan 593-Jan 604-Jan 435-Jan 426-Jan 357-Jan 328-Jan 329-Jan 4610-Jan 4111-Jan 52
Sum 442
MEAN = 44.2
Central Tendency IX
The mean, mode and median are useful for thinking about INTERVAL data
What is the average (MEAN) age?
What is the middle (MEDIAN) age?
What is the most common (MODAL) age?
Statistics
What was your age last birthdayN Valid 4290
Missing 158
Mean 54.74
Median 57.00
Mode 62
Central Tendency X
INTERVAL data can be displayed using a histogram
Dispersion I
• Measures of central tendency are heuristics
• They can hide important details in the data
Dataset 1: 1 2 3 4 5 6 7 8 9
Dataset 2: 1 2 3 4 5 6 7 8 90
MEAN = 5MEDIAN = 5MEAN = 14MEDIAN = 5
Need to consider RANGE and STANDARD DEVIATION
Dispersion II
• RANGE measures the difference between the lowest and highest values– Large range may reveal outliers (dataset 2!)– Small range suggests tight grouping of data
• STANDARD DEVIATION (SD) measures the distance (deviation) of each value from the mean– Large SDs occur when data points are a long way from the
mean (wide range of different values)– Small SDs occur when data points are close to the mean
(values do not differ very much)
Dispersion III
• For example:Age
(Sample 1)Age
(Sample 2)18302331211920192821
85553131252791110
Descriptive Statistics
N Range Minimum Maximum Mean
Std. Deviation
Age 10 13.00 18.00 31.00 23.0000 4.85341
Valid N (listwise) 10
Descriptive Statistics
N Range Minimum Maximum Mean
Std. Deviation
Age 10 48.00 7.00 55.00 23.0000 21.01851Valid N (listwise) 10
Summary• Levels of measurement determine how data can be analysed
• Vital to understand what your data represents and into which level of measurement it falls
• Frequency tables help us to screen data for errors
• Frequency tables also help us to identify the median and mode
• Central tendency is a heuristic, but very common because of this
• Dispersion plays a vital role in critically evaluating central tendency
• These modes of analyses are often referred to as DESCRIPTIVE STATISTICS or UNIVARIATE ANALYSIS (literally ‘one variable’!)
Lies, Damn Lies and Statistics?
90% of Sun readers want a cap on immigration
The average Yale graduate earns $30,000 within six months of graduating
The Green Party is not well supported as it received less than 5% of the national vote in the 2010 General Election
House prices drop by 10% in the UK
90% of students at Cardiff University are binge drinkers