Post on 25-Jun-2020
transcript
Approach to using alternative data sources to support the 2021 Census in England and Wales
Cal GheeOffice for National Statistics26 September 2018
Content
• Introduction• What are ‘alternative data’?• How we could use them
o Prepare and collecto Process and analyseo Outputs
• Vision• Aims• Criteria for inclusion in design• Next steps
Introduction: the 2021 Census for England and Wales
National Statistician’s 2014 recommendation for the future provision of population statistics and the next census included:
Increased use of administrative data and surveys in order to enhance the statistics from the 2021 Census ...
… make the best use of all available data
Introduction: meeting specific objectives of the census
1. to produce census statistics of the right quality and timeliness to meet user needs
2. to produce integrated outputs from census, administrative and survey data
What are ‘alternative data’?Administrative data• collected primarily for
administrative reasons• statistical use usually
secondary
Survey data• gathered from statistical surveys
including earlier censuses
Big data• large, often unstructured• potentially available in real time• difficult to process efficiently using
traditional methods and technologies. • many formats, including audio, video,
computer logs, purchase transactions, sensors, social networking sites.
• freely available on the web or held by the private sector
Paradatadata that describe the process by which the data were collected, eg:
o the times of day responses were submitted o time taken to complete the questionnaire o number of attempts to complete the questionnaireo mode of communication/responseo how many times field officers called, day of week, time of day, how many times they made contact and whether a response was subsequently received
How could we use them? 2021 Census Operation
Hard to count; field workload allocation***
Collection Target Populations; predicted response profiles***
Helping to create address frame in advance***
Coding: Adding to indexes/classifications***
Coverage bias adjustments**
QA detailed matching in areas of greatest uncertainty; CQS triangulation**
Validation to reduce field visits*
Cleaning and editing*
Edit & Imputation of single year of age*
Placeholders in record imputation*
Adjust for collected data in communal establishments**
PREP
ARE
&
COLL
ECT
PRO
CESS
& A
NAL
YSE
DATA
Key Aggregated data
Record level data
***Confirmed use
**Likely use (have demonstrated)
*Possible use (research still to do)
Replace previously collected or creation of new variables**
Extended output categories eg qualifications***
Maintenance of Output Areas**
OU
TPU
TS
QA collected data, population estimates and characteristics***
Journalistic topic analysis***
Prepare and collect
Hard to count index; field workload allocation***
Collection Target Populations; predicted response profiles***
Helping to create address frame in advance***
Validation to reduce field visits*
PREP
ARE
& C
OLL
ECT
Key Aggregated data
Record level data
***Confirmed use
**Likely use (have demonstrated)
*Possible use (research still to do)
Example: validate field outcomes and making field visits more efficientAddress
List
Initial contact lettersExtract
taken before Census Day
Response
Undelivered as addressed (return to sender)
Non-response
?Follow up
Undelivered as addressed (return to sender)
Desk check against alternative sources Field
check
Remove from list
Re-send letter, and field visit to encourage response
Address doesn’t exist / non-residential
Address is current / has signs of residence
Example: validate field outcomes and making field visits more efficient
Non-responseField visit
Remove from further visits
Maintain field visits
Address doesn’t exist / non-residential
Address is current / has signs of residence
?Desk check against alternative sources
Confirm address doesn’t exist / non-res
Example: validate field outcomes and making field visits more efficient
Example: hard to count index, target groups and live response-chasing
Hard-to-Count index
Response Profiles
Predicts relative likelihood of self-response for each LSOA
Field Operations Simulation
Response Chasing Algorithm tool
Predicts responses over time for groups of LSOAs sharing similar characteristicsCensus day
Models field staff hours, number of paper questionnaires and reminders needed and impacts of interventions
Actual returns
Predicted responses
Identifies gaps between predicted and actual returns and suggests interventions
Hard-to-Count
Response Profiles
Predicts relative likelihood of self-response for each LSOA
Field Operations Simulation
RCA
This is the response profile expected for this type of LSOA, characterised by HtC group and age profile
Census day
248 field staff hours are required to reach target response rates
Actual returns
Predicted responses
Our actual returns are falling short of our predictions. To meet our targets, we need to move additional field staff to this area.
HtC willingness 5 (low self-response), digital group 2 (high digital take-up)
62% Self-response
E.g. LSOA ‘X’
Process and analyse
Coding: Adding to indexes/classifications***
Coverage bias adjustments**
QA detailed matching in areas of greatest uncertainty; CQS triangulation**
Cleaning and editing*
Edit & Imputation of single year of age*
Placeholders in record imputation*
Adjust for collected data in communal establishments**
PRO
CESS
& A
NAL
YSE
DATA
Maintenance of Output Areas**
QA collected data, population estimates and characteristics***
Key Aggregated data
Record level data
***Confirmed use
**Likely use (have demonstrated)
*Possible use (research still to do)
Example: placeholders in record imputation
Varia
bles
from
Adm
in re
cord
s
Imputed non-response records
Collected responses
variables
Direct use of admin data – pull data across from admin data linked to a non-responding address
Indirect useso ‘dummy’ forms (operational
paradata – eg type of accomm, likely number of residents)
o intelligence from the admin data (eg 4 people usually live here)
Modelling – eg postcode level data where HE students most likely to live –model imputed records to fit proportion of residents likely to be students
How?
1. Estimate basic information from coverage survey (dual system estimation), create skeleton records to populate
2. Use donor system to populate the rest of the record
Outputs
Replace previously collected or creation of new variables**
Extended output categories egqualifications***O
UTP
UTS Journalistic topic analysis***
Key Aggregated data
Record level data
***Confirmed use
**Likely use (have demonstrated)
*Possible use (research still to do)
Example: replacing ‘number of rooms’ question with admin dataUser need still exists, but historically quality of collected data is poor
Link data to address list
• Valuation Office Agency data
• Linking by UPRN in advance of census
• Number of rooms
• Potential for more: property type, size of property
Process
• Clean data
• Edit rules*
• Impute missing values
Outputs
• Additional data available for outputs
• Reduced respondent burden
• Improved quality
* Census collects number of bedrooms: need to ensure consistency across variables
- improve timeliness, meet more user needs
Integrate alternative data to provide the best quality 2021 census and leave a positive legacyVI
SIO
NAI
MS
Continue to add value post census
Collection - value for money and optimise response ratesProcessing - improve quality and trust through assurance Outputs
Improve and enhance:
VISION AND AIMS
DESIGN PRINCIPLES
- Believe the collected census data - Add value to the census - Assure users of its quality- Weigh up gains v costs and knock-on
effects- Think: Quality – Value – Trust- Just because we can, doesn’t mean we
should
INCLUSION CRITERIA
Balance – Improved quality, trust, value balanced against risk to accuracy, timeliness, interpretability
Detail – improved quality for a population sub-set or for core variables -quality v effort
Quality – Relevance – still meeting user needs? Includes granularity/low level geography.Accuracy – improving? To what extent are we adding more uncertainty? Issues clustered in same areas/same groups of people – are we risking the whole but not actually improving sufficiently where needed?Timeliness – does adding more processes risk our timetable?Accessibility – does increased use of data available in public domain risk what we can publish?Interpretability – can we explain the use clearly? Especially regarding the quality of sources used. Circularity of use.Coherence – does the alternative data cover the same definitions? UK harmonisation, coherence over time.
Value – from integrating data appropriately. Use of and input into corporate transformation programmes; improving efficiency of collection and processing
Trustxx – ethical considerations, reduce respondent burden, build user assurance
Potential uses – current thinking
Validation to reduce field visits*
Cleaning and editing*
Edit & Imputation of single year of age*
Placeholders in record imputation*
?
Next steps
• Still in the research phase• Large-scale rehearsal in 2019/20• Gather and use as much alternative data sources
as possible for the rehearsal• Finalise the design for the 2021 Census
Dependencies• Getting the data• Quality of available data• Development of methods
Thank you