Beyond 2011 The future for population statistics?IMA Mathematics 2012
Pete BentonBeyond 2011 Programme DirectorOffice for National Statistics
Outline
• Background to the Census• The Beyond 2011 Programme • Statistical options for the future• Key mathematical challenges• Timeframes• Next steps
The purpose of the census• The basis for national decision
making:Service planning
• where to locate schools, hospitals, etc.• housing plans• transport
Resource allocation • health and local govt • £100bn each per year
Policy making and monitoring• Equality – age, sex, ethnicity, disability• Ageing population – pensions etc
Academic and social research
Key Census outputs
• Benchmark statistics on:Population units:
• people and housing• with key demographics (age, sex,
ethnicity) Population structures:
• households, familiesPopulation and housing attributes
• For small areas and small population groups
• With multivariate analysis• Consistent and comparable
The 2011 Census
• Very successful- 94% response overall- Over 90% across London overall- Over 80% response in every Local Authority
• Significant improvement in key Local Authorities• The result of extensive mathematical modelling
- Response targets to achieve required output quality- Predicted initial response from key groups / areas- Numbers of field staff required to reach final targets- Daily live response rate modelling to support
operational decisions
The Beyond 2011 Programme
•Why change? – Why look beyond 2011?Rapidly changing society
Evolving user requirementsNew opportunities – data sharing
Traditional census – costly and infrequent??
• UK Statistics Authority to Minister for Cabinet Office
“As a Board we have been concerned about the increasing costs and difficulties of traditional Census-taking. We have therefore already instructed the ONS to work urgently on the alternatives, with the intention that the 2011 Census will be the last of its kind.”
Beyond 2011 : Statistical options
Aggregate analysis
100% linkage to create ‘statistical population spine’
(Intermediate) Sample linkage e.g. 1% of postcodes
Address register + Survey
Administrativedata options
Traditional Census (long form to everyone)
Rolling Census (over 5/10 year period)
Short Form (everyone), Long form (Sample)
Short Form + Annual Survey (US model)
Censusoptions
Surveyoption(s)
SOURCESFRAME DATA ESTIMATION OUTPUTSAll National to Small Area
Beyond 2011 – statistical options
Population Data
Socio demographicAttribute Data
Address
Register
Household
Communal
Maintained national address gazetteer – provides frame for
population data & surveys
Population estimates
Attribute estimates
InteractionalAnalysis
E.g. TTWA
Longitudinaldata
Household structure etc
CENSUS
Adjusting for
Adjusting for non response
CoverageAssessment
incl. under & over-coverage- by survey and admin data?
missing data and error
bias in survey (or sources)
Qualitymeasurement
Population distribution provides weighting
for attributes
Socio demographic
Survey(s)
Admin Source
Admin Source
Admin Source
Commercial sources?
Comm Source
??
increasing later?
Surveys to fill gaps
Potential data sources
• Population data• NHS Patient Register• DWP/HMRC Customer Information System• Electoral roll (> 17 yrs)• School Census (5-16 yrs)• Higher Education Statistics Agency data (Students)• Birth and Death registrations
• Socio-demographic sources• Surveys• DVLA?• Commercial sources?• Utilities?• TV licensing?
DWP CIS population counts compared with ONS Mid Year population estimates
Patient Register population counts compared with ONS Mid Year population estimates
Electoral Roll population counts compared with ONS Mid Year population estimates
Higher Education StudentsCustomer Information SystemCoverage Of Main Administrative Sources
Extras includes:Some duplicatesInternational students on short-term coursesStudents ceased studying, not formally deregistered
Extras includes:Short-term migrant children
Missing includes: Under 17s Ineligible votersNon responders
Missing includes: Non school aged peopleIndependent school childrenHome schooled children
Missing includes:
Some migrant worker dependants
Some international students
Undocumented asylum seekers
Missing includes: Migrants not (yet) registeredNewborn babiesSome private only patients
Missing includes:Non higher education studentsIndependent University students
Extras includes:Some duplicatesSome ex-patsSome deceasedShort-term migrants
Extras includes:Multiple registrationsSome ex-patsSome deceased Short-term migrants
Extras includes:Some ex-patsSome deceasedShort-term migrants
Missing includes: Non-driversUnder 17’sSome foreign-licence holders
Extras includes:Some ex-patsSome deceased
UK Driving Licence
Resident PopulationCIS
PRD
Electoral RollPatient Register DataSchool Census
SCER SC
ER
DVLADVLAHESA
CISPRD
Key risks of non census alternatives
• Public opinion• Technical challenge• Changes in administrative datasets• UK harmonisation• Getting a decision
Key mathematical challenges
• Methods for Production of statisticsCoverage assessment and adjustmentData matchingCorrecting for missing dataSmall area population attribute modelling
• Methods for Protection of confidentialityData pre-processing and encryptionStatistical Disclosure Control
• EvaluationQuantifying financial benefitsDefining what is an ‘acceptable’ level of quality
Coverage assessment
• How many fish in your pond?Day 1, catch 100, tag them, put them backDay 2, catch 50, find 25 already taggedHow many fish in your pond?
• Answer: 200 (ish)According to day 2, half in the pond are markedWe marked 100, so there must be about 200 altogether
• “Dual System Estimation”
Application to the census
• We ‘fish’ twice, in 1% of postcodesCensusThen census coverage survey (CCS) 6 weeks later
• No need for tagsThey have names, addresses, dates of birthWe match the two separate lists of people (500k) to
work out• What percentage of people in the CCS had first been
‘caught’ in the census • Thus, the total population in each postcode
Coverage adjustment
• Apply the adjustment factor to the other 99% of postcodes where we did no CCSWith appropriate stratification
• Add ‘synthetic’ recordsExtra householdsExtra peopleWith the right key characteristicsIn roughly the right locationsUsing ‘Donor imputation’ to complete each recordSo that all the final tables add up to the right number
Dual system estimation - formulae
Counted By CCS?Yes No TOTAL
Counted Yes n11 n10 n1+
By Census? No n01 n00 n0+
TOTAL n+1 n+0 n++
Total population n++ = n1+ n+1
n11
• We can make life very complicated for people who aren’t mathematicians!
Application to administrative data
• Administrative data sources also have undercount
• But the bigger problems are due to time lags- Emigration; deaths
Results in overcount in administrative sources- Internal migration
Results in people recorded in the wrong location - overcount in one area, undercount in
another• Just applying Dual System Estimation would
result in significant over-estimation
Potential overcount estimation approaches (1)
• Redesigned coverage survey asking:who usually lives here?when did you move in?where are you registered to vote?where are you registered with a GP?who lived here before you?where do they live now?does John Smith still live here?
• Increasing sensitivity• Reducing appropriateness / legality
Potential overcount estimation approaches (2)
• Match new coverage survey to admin data• Measure coverage patterns, develop models• Intermediate model
Match records only in CS postcodes• Full linkage model
Match records in all sources across all postcodes Keep records if same location on all datasets
=> more likely to be correct • Particularly if recently recorded ‘activity’
Develop intelligent rules to resolve residual recordsReduces scale of overcount - but increases undercount
Small Area Estimation
• Surveys only give sufficient precision at relatively high levels of geography
• Users require information at lower levelsCensus ‘output area’ ~ 125 households / 300 people
• SAE - family of methods to increase precision of survey estimates at lower geographies by “borrowing strength” from other, more detailed
data sources, or neighbouring areas• Widely used by National Statistical Institutes
e.g. unemployment, income, households in poverty- but generally univariate, estimating means
CVs Sample size= 1,000,000 people
Prevalence
0.5% 1% 5% 10% 15% 20% 50% Population size
National
50,000,000 1.4% 1.0% 0.4% 0.3% 0.2% 0.2% 0.1%
Region 5,500,000 4.3% 3.0% 1.3% 0.9% 0.7% 0.6% 0.3%
LA
150,000 25.8% 18.2% 8.0% 5.5% 4.3% 3.7% 1.8%
LA (small)
50,000 44.6% 31.5% 13.8% 9.5% 7.5% 6.3% 3.2%
MSOA (avg)
7,200 117.6% 82.9% 36.3% 25.0% 19.8% 16.7% 8.3%
MSOA (min)
5,000 141.1% 99.5% 43.6% 30.0% 23.8% 20.0% 10.0%
LSOA (avg)
1,600 249.4% 175.9% 77.1% 53.0% 42.1% 35.4% 17.7%
LSOA (min)
1,000 315.4% 222.5% 97.5% 67.1% 53.2% 44.7% 22.4%
OA
300 575.9% 406.2% 178.0% 122.5% 97.2% 81.6% 40.8%
Ward (Eng)
7,000 119.2% 84.1% 36.8% 25.4% 20.1% 16.9% 8.5%
Ward (Wales)
3,500 168.6% 118.9% 52.1% 35.9% 28.5% 23.9% 12.0%
Precision of direct survey outputs
Potential components
• (Very?) Large survey• Administrative sources
aggregate (area based) or unit recordavailable for lower geographic levels than survey
outputs• Possible models
Generalised Linear Models (GLM):multi-level modelsspatial / temporal extensions can add powerBayesian or frequentist estimation frameworks
Micro-simulation
Small area modelling - issues
• Quality of ancillary data is absolutely critical• Most existing applications use census
covariates• More powerful models incorporate time and
space effects, but are more complex• Every variable is different, and requires
different models• There’s often no substitute for geography as
a predictor‘similar people gather in similar areas’
• BUT clear academic view – the methods exist, it just depends on data
2015 2016 2017 2018 2019 2020 2021 2022 2023
populationestimates
populationcharacteristics
outputs
detaileddesign
procure /develop
develop /test
ADMIN DATA SOLUTION
2015 2016 2017 2018 2019 2020 2021 2022 2023
detaileddesign
procure / develop
develop /test rehearse run outputs
TRADITIONAL CENSUS SOLUTION
2011 2012 2013 2014
research /definition
initiation
BEYOND 2011‘Phase 1’
Sept 2014 recommendation& decision point
Beyond 2011 - Timeline - the key decision
2015 2016 2017 2018 2019 2020 2021 2022 2023
populationestimates
populationcharacteristics
outputs
detaileddesign
procure /develop
develop /test
2011 2012 2013 2014
research /definition
initiation
2024
addressregister
adminsources
required on an ongoing basis – ideally the National Address Gazetteer – subject to confirmation of quality
public sector & commercial ?
developing over time
coveragesurveys testing continuous assessment
attributesurveys
info from existing surveys – e.g. labour force survey, integrated household survey etc
supplemented by new targeted surveys as required
modelling increasing modelling over time
Beyond 2011 - Timeline (non census solution)
test
linkageincreasing linkage over time
2027 2028 2029 2030 2031 2032 2033 2034 20352024 2025 2026 2036
address register required on an ongoing basis
administrative sources will change and disappear and be added & develop over time
continuous coverage survey
existing surveys
increasing linkage over time
increasing modelling over time
need for attribute surveys declines over time ?
2037 2038
regular production of population and attribute estimatesongoing methodology refinement
Beyond 2011 - and into the future
Improving quality & quantity
accuracy of population estimates
accuracy of characteristics estimates
range of topics
small area detail
multivariate small area detail
experimental statistics develop to become national statistics
2013 2031 2021
Statistical benefit profile
2011 2021 2031 2041
Ben
efit
Census Alternativemethod
loss
gain
loss
gain
Cost profile (real terms)
2011 2021 2031 2041
Cos
t
Census
???Alternative method
Next steps
• Research potential methods and models• Using census data
To understand coverage patterns in admin dataTo simulate new survey designsAs a gold standard – how well can we replicate census
results?• Assess quality, costs, benefits, risks• Discuss with stakeholders (!)• Public acceptability research• Report progress every six months• Make recommendations in 2014
Advice and assistance very welcome!