11
The newThe new B Bank of ank of Italy taly RRemote emote
access to micro access to micro DData (BIRD)ata (BIRD)
G. Bruno, L. D’Aurizio, G. Bruno, L. D’Aurizio, R. Tartaglia-PolciniR. Tartaglia-PolciniQ2008 – Rome, July 10, 2008Q2008 – Rome, July 10, 2008
22
MotivationMotivation
• Information release and data protection Information release and data protection as competing goalsas competing goals
•The risk-utility tradeoff: The risk-utility tradeoff: •riskrisk of data disclosure of data disclosure•utility utility of widespread availability of dataof widespread availability of data
for research for research
33
MotivationMotivation
GOALS (UTILITY):GOALS (UTILITY):• satisfy growing demand from external researchers for business datasatisfy growing demand from external researchers for business data• improve the improve the accountability of the Central Bank as economic research centreof the Central Bank as economic research centre• provide a service to the scientific communityprovide a service to the scientific community
CONSTRAINTS (RISK):CONSTRAINTS (RISK):• Data confidentiality must be guaranteed:Data confidentiality must be guaranteed:• as a prerequisite for respondents’ collaborationas a prerequisite for respondents’ collaboration• to foster quality of the data providedto foster quality of the data provided• is required by the lawis required by the law• Public Use File (PUF) with individual data judged unfeasible: anonymisation Public Use File (PUF) with individual data judged unfeasible: anonymisation
very problematic with business datavery problematic with business data
44
MotivationMotivation
SYNTHETIC DATA LIMITATIONS:SYNTHETIC DATA LIMITATIONS:
• Identity disclosure impossible in principle, but, particularly with extreme values, it may be possible to re-identify a source record
• Attribute disclosure may happen
• Ample literature on data confounding and synthetic data Ample literature on data confounding and synthetic data (Duncan & Lambert 1989; Rubin 1993; Little 1993; Fuller 1993; (Duncan & Lambert 1989; Rubin 1993; Little 1993; Fuller 1993; Fienberg et al. 1996; Kennickell 1997; Abowd & Woodcock Fienberg et al. 1996; Kennickell 1997; Abowd & Woodcock 2001; Reiter 2002; Raghunathan et al. 2003; etc.) 2001; Reiter 2002; Raghunathan et al. 2003; etc.)
55
ChoicesChoices
• Data confounding: create a PUF containing Data confounding: create a PUF containing perturbed data to prevent identification of perturbed data to prevent identification of individual information. Downside: results individual information. Downside: results (esp. regressions) may heavily depend on (esp. regressions) may heavily depend on the confounding technique adopted - the confounding technique adopted - controversial literature controversial literature
• Data lab (Data lab (à laà la Istat: ADELE) – the Istat: ADELE) – the researcher has to go to the lab in person.researcher has to go to the lab in person.
• Remote processing, using internet, without Remote processing, using internet, without direct access to individual data (direct access to individual data (à laà la Luxembourg Income Study: LISSY)Luxembourg Income Study: LISSY)
66
Other remote processing systemsOther remote processing systems
• Luxembourg Income Study (LISSY, 1987)Luxembourg Income Study (LISSY, 1987)• Statistics Canada (2001)Statistics Canada (2001)• Statistic Denmark (2001)Statistic Denmark (2001)• Statistic Netherlands (2002)Statistic Netherlands (2002)• Australian Bureau of Statistics (2003)Australian Bureau of Statistics (2003)• Statistic Sweden (2003)Statistic Sweden (2003)• US Federal Agencies: NCHS (1997), US Federal Agencies: NCHS (1997),
NCES (1998), Census Bureau (2003)NCES (1998), Census Bureau (2003)
77
The solution adopted at the Bank of ItalyThe solution adopted at the Bank of Italy
BIRDBIRD• Modeled on LISSYModeled on LISSY• Low setup costLow setup cost• Easily customisableEasily customisable• Supports multiple packagesSupports multiple packages• Maximum accessibility for usersMaximum accessibility for users• Multi-level control (user/group, dataset, Multi-level control (user/group, dataset,
keyword)keyword)• Automatic and manual checks & reviewAutomatic and manual checks & review
88
How BIRD worksHow BIRD works
USER ELIGIBILITY CRITERIAUSER ELIGIBILITY CRITERIA
• Researcher status (not necessarily academic) Researcher status (not necessarily academic) proved by a presentation letterproved by a presentation letter
• Identification via valid personal idIdentification via valid personal id• Detailed information via form to be filled inDetailed information via form to be filled in
99
How BIRD worksHow BIRD works
USER PROFILE CREATIONUSER PROFILE CREATION
• The researcher indicates an e-mail address The researcher indicates an e-mail address which will be recognised by the system.which will be recognised by the system.
• The researcher indicates her own user and The researcher indicates her own user and passwordpassword
• User-chosen parameters are input in the user User-chosen parameters are input in the user databasedatabase
• Access profile is createdAccess profile is created
1010
How BIRD worksHow BIRD works
SUBMISSION PROCEDURESUBMISSION PROCEDURE
• Communication with the processing environment Communication with the processing environment via e-mailvia e-mail
• Send a message containing user authentication Send a message containing user authentication info + statements to be submittedinfo + statements to be submitted
• Input message is parsed and checks are performedInput message is parsed and checks are performed• If no error/security violation If no error/security violation submit statements submit statements• Output is parsed (automatically / manually)Output is parsed (automatically / manually)• If no security violation If no security violation forward to the user via e- forward to the user via e-
mailmail
1111
Confidentiality safeguardsConfidentiality safeguards
•User levelUser level•Data levelData level•Processing levelProcessing level
1212
Confidentiality safeguardsConfidentiality safeguards
User level:User level: • Users are identified, qualified and registeredUsers are identified, qualified and registered• Registered mailboxes are whitelisted; ordinarily Registered mailboxes are whitelisted; ordinarily
only one mailbox per useronly one mailbox per user• Outputs are monitored and archived Outputs are monitored and archived • Deontological code, privacy law, specific penaltiesDeontological code, privacy law, specific penalties
SanctionsSanctions• Forbidden submissions or outputs are deletedForbidden submissions or outputs are deleted• Grant of access for users trying to perform Grant of access for users trying to perform
forbidden commands may be revokedforbidden commands may be revoked• Any other sanctions or penalties required by the Any other sanctions or penalties required by the
law where applicablelaw where applicable
1313
Data level:Data level:
• Extreme data are censored (Winsorized)Extreme data are censored (Winsorized)• Identifying variables (ids, names, Identifying variables (ids, names,
addresses) are expunged from the addresses) are expunged from the datasets used for remote processingdatasets used for remote processing
• Stratification variables are collapsed Stratification variables are collapsed (geographical areas and not regions; Ateco (geographical areas and not regions; Ateco aggregations and not codes)aggregations and not codes)
Confidentiality safeguardsConfidentiality safeguards
1414
Confidentiality safeguardsConfidentiality safeguards
Processing level:Processing level:
• Formally forbidden to display individual dataFormally forbidden to display individual data• Keyword parserKeyword parser implementedimplemented with ceiling, with ceiling,
blacklist e graylistblacklist e graylist• Particularly long and/or complex Particularly long and/or complex
programmes are always reviewed manuallyprogrammes are always reviewed manually• In the learning stage, all submissions are In the learning stage, all submissions are
reviewed manuallyreviewed manually
1515
How the parser worksHow the parser works
check typecheck type check performedcheck performed action if failed on action if failed on INPUTINPUT action if failed on OUTPUTaction if failed on OUTPUT
authentication
checking user authentication data
job cancelled n/a
blacklistparsing text for
specific words and sequences
job cancelled n/a
length checking the length of text n/a
soft ceiling: manual review
hard ceiling: job cancelled
graylist (*)parsing text for
specific words and sequences
manual review manual review
(*)(*) This feature will be available in the next release of the system.
1616
Datasets availableDatasets available
STANDARD DATASET: quantitative data for the biggest STANDARD DATASET: quantitative data for the biggest firms (in terms of workforce) are censored firms (in terms of workforce) are censored (Winsorised)(Winsorised)
COMPLETE DATASET: no data censoringCOMPLETE DATASET: no data censoring
Id variables are expunged from both datasets, obviouslyId variables are expunged from both datasets, obviously
1717
Aggravated procedure for accessing the complete Aggravated procedure for accessing the complete dataset:dataset:
• Access must be explicitly requested – a special profileAccess must be explicitly requested – a special profile is createdis created• Review is exclusively manualReview is exclusively manual• Wait times are longer than average as time allocated Wait times are longer than average as time allocated to manual review on complete dataset is reducedto manual review on complete dataset is reduced
Datasets availableDatasets available
1818
Documentation on the websiteDocumentation on the website
• Application formApplication form• Instruction manualInstruction manual• Dataset descriptionDataset description• Examples of submissions in the Examples of submissions in the
supported packages (SAS, Stata)supported packages (SAS, Stata)• Methodological notes on the Methodological notes on the
surveysurvey
1919
SupportSupport
1.1. Documentation available on the Bank of Italy website Documentation available on the Bank of Italy website (manuals, variables description, questionnaires)(manuals, variables description, questionnaires) http://www.bancaditalia.it/statistiche/indcamp/indimpser/birdhttp://www.bancaditalia.it/statistiche/indcamp/indimpser/bird
2.2. Mailbox for queries and assistance:Mailbox for queries and assistance:
2020
An exampleAn example
Program Program submitted by the submitted by the user in Stata. user in Stata. Authentication is Authentication is in the first four in the first four lines.lines.
2121
An exampleAn example
Output Output forwarded forwarded after reviewafter review
2222
Usage of the system in the first weeksUsage of the system in the first weeks
•System started officially on Mar 13, System started officially on Mar 13, 20082008
•Beta users from Feb 1, 2008Beta users from Feb 1, 2008•8 registered users8 registered users•172 submissions in 21 weeks172 submissions in 21 weeks
2323
Usage of the system in the first weeksUsage of the system in the first weeks
0
5
10
15
20
25
30
35
w 1 w 3 w 5 w 7 w 9 w 11 w 13 w 15 w 17 w 19 w 21
BIRD: # of weekly submissions, from Feb 1, 2008
2424
Future developmentsFuture developments
• Web submission available alongside e-mail submissionWeb submission available alongside e-mail submission• Other datasets will be made available in the future Other datasets will be made available in the future
(e.g. data from the (e.g. data from the Business Outlook Survey)Business Outlook Survey)• Open source packages processing (e.g. Open source packages processing (e.g. RR))• Merging with external datasets provided by the user, Merging with external datasets provided by the user,
for special projects, on a discretionary basis, under an for special projects, on a discretionary basis, under an aggravated procedure and higher security levels.aggravated procedure and higher security levels.
• Creation of closed groups with special authorisation Creation of closed groups with special authorisation levels for specific projectslevels for specific projects
2525
Thank you for your attention Thank you for your attention