Approach to using alternative data sources to … › fileadmin › DAM › stats › documents ›...

transcript

Approach to using alternative data sources to support the 2021 Census in England and Wales

Cal GheeOffice for National Statistics26 September 2018

Content

• Introduction• What are ‘alternative data’?• How we could use them

o Prepare and collecto Process and analyseo Outputs

• Vision• Aims• Criteria for inclusion in design• Next steps

Introduction: the 2021 Census for England and Wales

National Statistician’s 2014 recommendation for the future provision of population statistics and the next census included:

Increased use of administrative data and surveys in order to enhance the statistics from the 2021 Census ...

… make the best use of all available data

Introduction: meeting specific objectives of the census

1. to produce census statistics of the right quality and timeliness to meet user needs

2. to produce integrated outputs from census, administrative and survey data

What are ‘alternative data’?Administrative data• collected primarily for

administrative reasons• statistical use usually

secondary

Survey data• gathered from statistical surveys

including earlier censuses

Big data• large, often unstructured• potentially available in real time• difficult to process efficiently using

traditional methods and technologies. • many formats, including audio, video,

computer logs, purchase transactions, sensors, social networking sites.

• freely available on the web or held by the private sector

Paradatadata that describe the process by which the data were collected, eg:

o the times of day responses were submitted o time taken to complete the questionnaire o number of attempts to complete the questionnaireo mode of communication/responseo how many times field officers called, day of week, time of day, how many times they made contact and whether a response was subsequently received

How could we use them? 2021 Census Operation

Hard to count; field workload allocation***

Collection Target Populations; predicted response profiles***

Helping to create address frame in advance***

Coding: Adding to indexes/classifications***

Coverage bias adjustments**

QA detailed matching in areas of greatest uncertainty; CQS triangulation**

Validation to reduce field visits*

Cleaning and editing*

Edit & Imputation of single year of age*

Placeholders in record imputation*

Adjust for collected data in communal establishments**

Key Aggregated data

Record level data

***Confirmed use

**Likely use (have demonstrated)

*Possible use (research still to do)

Replace previously collected or creation of new variables**

Extended output categories eg qualifications***

Maintenance of Output Areas**

QA collected data, population estimates and characteristics***

Journalistic topic analysis***

Prepare and collect

Hard to count index; field workload allocation***

Collection Target Populations; predicted response profiles***

Helping to create address frame in advance***

Key Aggregated data

Record level data

***Confirmed use

Example: validate field outcomes and making field visits more efficientAddress

Initial contact lettersExtract

taken before Census Day

Response

Undelivered as addressed (return to sender)

Non-response

?Follow up

Undelivered as addressed (return to sender)

Desk check against alternative sources Field

Remove from list

Re-send letter, and field visit to encourage response

Address doesn’t exist / non-residential

Address is current / has signs of residence

Example: validate field outcomes and making field visits more efficient

Non-responseField visit

Remove from further visits

Maintain field visits

Address doesn’t exist / non-residential

Address is current / has signs of residence

?Desk check against alternative sources

Confirm address doesn’t exist / non-res

Example: validate field outcomes and making field visits more efficient

Example: hard to count index, target groups and live response-chasing

Hard-to-Count index

Response Profiles

Predicts relative likelihood of self-response for each LSOA

Field Operations Simulation

Response Chasing Algorithm tool

Predicts responses over time for groups of LSOAs sharing similar characteristicsCensus day

Models field staff hours, number of paper questionnaires and reminders needed and impacts of interventions

Actual returns

Predicted responses

Identifies gaps between predicted and actual returns and suggests interventions

Hard-to-Count

Response Profiles

Predicts relative likelihood of self-response for each LSOA

Field Operations Simulation

This is the response profile expected for this type of LSOA, characterised by HtC group and age profile

Census day

248 field staff hours are required to reach target response rates

Actual returns

Predicted responses

Our actual returns are falling short of our predictions. To meet our targets, we need to move additional field staff to this area.

HtC willingness 5 (low self-response), digital group 2 (high digital take-up)

62% Self-response

E.g. LSOA ‘X’

Process and analyse

Coding: Adding to indexes/classifications***

Coverage bias adjustments**

QA detailed matching in areas of greatest uncertainty; CQS triangulation**

Adjust for collected data in communal establishments**

Maintenance of Output Areas**

QA collected data, population estimates and characteristics***

Key Aggregated data

Record level data

***Confirmed use

Example: placeholders in record imputation

Imputed non-response records

Collected responses

variables

Direct use of admin data – pull data across from admin data linked to a non-responding address

Indirect useso ‘dummy’ forms (operational

paradata – eg type of accomm, likely number of residents)

o intelligence from the admin data (eg 4 people usually live here)

Modelling – eg postcode level data where HE students most likely to live –model imputed records to fit proportion of residents likely to be students

1. Estimate basic information from coverage survey (dual system estimation), create skeleton records to populate

2. Use donor system to populate the rest of the record

Outputs

Replace previously collected or creation of new variables**

Extended output categories egqualifications***O

UTS Journalistic topic analysis***

Key Aggregated data

Record level data

***Confirmed use

Example: replacing ‘number of rooms’ question with admin dataUser need still exists, but historically quality of collected data is poor

Link data to address list

• Valuation Office Agency data

• Linking by UPRN in advance of census

• Number of rooms

• Potential for more: property type, size of property

Process

• Clean data

• Edit rules*

• Impute missing values

Outputs

• Additional data available for outputs

• Reduced respondent burden

• Improved quality

* Census collects number of bedrooms: need to ensure consistency across variables

- improve timeliness, meet more user needs

Integrate alternative data to provide the best quality 2021 census and leave a positive legacyVI

Continue to add value post census

Collection - value for money and optimise response ratesProcessing - improve quality and trust through assurance Outputs

Improve and enhance:

VISION AND AIMS

DESIGN PRINCIPLES

- Believe the collected census data - Add value to the census - Assure users of its quality- Weigh up gains v costs and knock-on

effects- Think: Quality – Value – Trust- Just because we can, doesn’t mean we

should

INCLUSION CRITERIA

Balance – Improved quality, trust, value balanced against risk to accuracy, timeliness, interpretability

Detail – improved quality for a population sub-set or for core variables -quality v effort

Quality – Relevance – still meeting user needs? Includes granularity/low level geography.Accuracy – improving? To what extent are we adding more uncertainty? Issues clustered in same areas/same groups of people – are we risking the whole but not actually improving sufficiently where needed?Timeliness – does adding more processes risk our timetable?Accessibility – does increased use of data available in public domain risk what we can publish?Interpretability – can we explain the use clearly? Especially regarding the quality of sources used. Circularity of use.Coherence – does the alternative data cover the same definitions? UK harmonisation, coherence over time.

Value – from integrating data appropriately. Use of and input into corporate transformation programmes; improving efficiency of collection and processing

Trustxx – ethical considerations, reduce respondent burden, build user assurance

Potential uses – current thinking

Next steps

• Still in the research phase• Large-scale rehearsal in 2019/20• Gather and use as much alternative data sources

as possible for the rehearsal• Finalise the design for the 2021 Census

Dependencies• Getting the data• Quality of available data• Development of methods

Thank you

Approach to using alternative data sources to … › fileadmin › DAM › stats › documents ›...

Documents