Big Data
The Eurostat Perspective
Big Data for the Analysis for the Digital Economy and Society
Seville, 22 Sep 2015, JRC-IPT
Big Data and Official Statistics
What will be the impact of ubiquitous data collection and networking
• Mobile Communication
• Internet of [every]Things,
• Cloud services,
• Wearables,
• Autonomous traffic,
• Smart systems,
• …
on official statistics?
ESS Scheveningen Memorandum on Big Data – September 2013
• Examine the potential of Big Data sources for official statistics
• Official Statistics Big Data strategy as part of wider government strategy
• Address privacy and data protection
• Collaboration at European and global level
• Address need for skills
• Partnerships between different stakeholders (government, academics, private sector)
• Developments in Methodology, quality assessment and IT
• Adopt action plan and roadmap for the European Statistical System
• National Initiatives
• CBS Netherlands
• ISTAT Italy
• ONS UK
• CSO Ireland
• Statistics Finland
• SURS Slovenia
Big Data and Statistics
Activities at Eurostat
• Within European Commission • Strategy on Big Data for evidence based policy
• Interservice coordination group
• Collaboration with DGs on specific aspects of Big Data
• Within ESS • Analysis of legal situation
• Ethical guidelines
• Communication guidelines
• Training
• Execution of pilots as ESSnets exploring potential of big data sources for European Statistics
• Big Data contributing to goals of ESS Vision 2020
• With UN • HLG on modernizing statistics and Big Data project of UNECE
• Contribution to Global WG Big Data
Big Data Projects at Eurostat
• Internet as a data source: Information Society Statistics
• Mobile phone data for tourism and population statistics
• Flight reservation data for transport and tourism statistics
• Google searches to improve (un)employment statistics
• Wikipedia page hits for cultural statistics
Mobile phone data for tourism statistics Feasibility study 2012-2014
Five main project tasks
All reports are on the Eurostat website
Stock-taking (31 cases relevant for official statistics)
Feasibility of access
Feasibility of use - methodological issues
Feasibility of use - coherence
Opportunities and benefits
An extension project is in the pipeline
http://ec.europa.eu/eurostat/web/tourism/methodology/projects-and-studies
Mobile phone data Promising, but some quality issues, e.g. coverage (representativeness?)
Source: DGE, SDP3E, bureau des études sur le tourisme et les catégories d’entreprise (France, Sept 2015)
Feasibility of use - coherence
0
50 000
100 000
150 000
200 000
250 000
300 000
350 000
Jan-
09
Mar
-09
May
-09
Jul-0
9
Sep-
09
Nov
-09
Jan-
10
Mar
-10
May
-10
Jul-1
0
Sep-
10
Nov
-10
Jan-
11
Mar
-11
May
-11
Jul-1
1
Sep-
11
Nov
-11
Jan-
12
Mar
-12
May
-12
Jul-1
2
Sep-
12
Nov
-12
MOB_IN(EU-27)_OVERNIGHT SUPPLY_EE(EU-27)_ARR
Inbound overnight trips (vs. accommodation statistics) Inbound, outbound overnight trips (vs. ferry passengers data)
0
50 000
100 000
150 000
200 000
250 000
300 000
350 000
400 000
450 000
500 000
Q1-
09
Q2-
09
Q3-
09
Q4-
09
Q1-
10
Q2-
10
Q3-
10
Q4-
10
Q1-
11
Q2-
11
Q3-
11
Q4-
11
Q1-
12
Q2-
12
Q3-
12
Q4-
12
MOB_OUT(EU-27)_OVERNIGHT DEMAND_EE(EU-27)_OVERNIGHT
Outbound overnight trips (vs. household survey data)
0
20 000
40 000
60 000
80 000
100 000
120 000
140 000
160 000
180 000
Jan-
09M
ar-0
9M
ay-0
9
Jul-
09Se
p-09
Nov
-09
Jan-
10
Mar
-10
May
-10
Jul-
10Se
p-10
Nov
-10
Jan-
11M
ar-1
1M
ay-1
1Ju
l-11
Sep-
11
Nov
-11
Jan-
12M
ar-1
2M
ay-1
2Ju
l-12
Sep-
12N
ov-1
2
MOB_EE(RU) BORDCONT_EE(RU)
Inbound overnight trips (vs. border control data)
Better coverage
Less recall bias
The Internet as Source (IaD) for Information Society Statistics
• Surveys of Households/Individuals and Enterprises
• Analysis of website functions of enterprises
• Statistics on use of Internet by observing use of computers
• Final Report on CROS-Portal
Nowcasting Unemployment
• Source
• Google Trends (others to be explored)
• High timeliness, geo info available, low transparency
• Processing
• Low computing power required
• Time-series modelling (machine learning to be explored)
• Tools: R
• Output
• Nowcasting of unemployment from 1 month lag to current time
Insights for world heritage sites from Wikipedia use
• Source
• Hourly page views for each Wikipedia article
• Content of Wikipedia articles
• High timeliness, temporal detail and transparency, no geographical information
• Processing
• Big Data Sandbox: computer cluster with 4 nodes
• Tools: Pig, Map-Reduce, Python, R
• Association of Wikipedia articles to specific WHS
• Output
• Exposure of world heritage via Wikipedia
Insights for world heritage sites from Wikipedia use
Page views of English Wikipedia articles related to World Heritage Sites
Big Data Roadmap and Action Plan 1.0
• Principles • Stepwise development with regular review • Focus on strengths of ESS
Partnerships, Legislation, Ethics, Harmonisation in EU context
• Use national and international experiences
• Definition of Goals • Short term 2016 Analysis, Strategy, Communication • medium term 2020 Pilots, Partnerships, Architecture • long term >2020 Integration into official statistics
• Pilots • Data driven approach • Hands-on experience
• Long term Goal:
• Full integration of big data sources into statistical information infrastructure
Roadmap
"As is" versus "To be"
Long term Vision
Medium term aims
Short term objectives
> 2020
By 2020
By 2016
Topics developed for integration of Big Data into official statistics
Government Strategy
Pilots finalised
IT infrastructure, Methods, Quality Framework
Skills
Partnerships
Ethics
Commission big data strategy
Identification and analysis of output portfolio
Pilot projects launched
Skills and training requirements identified
Horizon 2020 Research Framework Programme
Communication strategy
Exchange with stakeholders
Big data sources integrated in ESS
official statistics production
Mobile
Phone
Data
Tourism
Statistics
Population
Statistics
Migration
Statistics
Traffic
Statistics
Commuting
Statistics
Blending of Sources and multipurpose Statistics
Population
Statistics
Mobile
phone data
Smart
Meters
VGI
websites
Satellite
Images
Communication
Mobile phone data
Social Media
WWW
Web Searches
Businesses' Websites
E-commerce websites
Job advertisements
Real estate websites
Sensors
Traffic loops
Smart meters
Vessel Identification
Satellite Images
Process
generated data
Flight Booking transactions
Supermarket Cashier Data
Financial transactions
Crowd sourcing
VGI websites
(OpenStreetMap)
Community pictures
collection
Sources of Big Data for Pilots
Social Media
Mobile Phones
Prices
Smart Meters
Job Vacancies Ads
Web Scraping
Traffic Loops
Each experiment team produced a detailed report on its activity, available in draft format on the UNECE wiki
UNECE Big Data Project Experiments
The statistical office of the future
• Data flows instead of surveys and censuses • Product designers instead of data collection designers
• Statistical modelling will be a main activity • From descriptive indicators to nowcasting and forecasting
• New answers related to • Quality and transparency • Privacy and confidentiality • Access to third party data sources / data sharing • Scientific standards and methodology • Professional ethics • Skills
• Trust and Quality is key! • New role in teaching digital numeracy • Accreditation and certification instead of production • Embedded in data flow – statistics 'everywhere'
Thank you for your Attention!
Policy Quality Skills
Experience sharing
Legislation IT
Infrastructures
Methods Ethics /
Communication Pilots
T O P I C S
ESS Vision 2020: Strategic Aims
USERS
• agile and
responsive
attitude to users’
needs
• response to user
groups
• partner and a
leader for
innovation
• strategic alliances
public and private
QU
ALIT
Y
• CoP and ESS
Quality Assurance
Framework
• quality assurance
tools fit for
purpose
• usability and
quality of source
data
• sound
methodology and
effective quality
assurance
mechanism
NEW
DATA S
OU
RCES
•exploiting the
potential on
new data
sources
• establishing
alliances and
partnership with
data owners
• new IT tools and
methodological
development
• organisational
changes
• improving existing
data collection
methods
STATIS
TIC
AL P
RO
CESSES
• partnership of the
ESS
• enterprise
architecture
• standards
• common methods
and tools
• sharing IT
services and
infrastructures
• (micro) data
exchange and
statistical
confidentiality
• experts working
together
DIS
SEM
INATIO
N
• dissemination and
communication
strategy
• pool of European
statistics
• portfolio of
dissemination
products and
services
• European
statistics brand