Billion Prices Project
and
The Big Risks of Big Data
Roberto Rigobon
MIT, NBER
1
Evolution in Data SourcesSurveys Administrative Big Data
Relevance Representative Somewhat Representative Not Representative
Cost Extremely Costly Costly Cheap
Intrusiveness Extremely intrusive Intrusive Non-intrusive
Design Organic
Different Types of Data
Data
Designed
Survey Aggregate
Administrative
Public
Private
Organic
Aspirational
Transactional
3
Pillars of modern measures
1. Continuous Measurement of process
Timely measures
2. Non-intrusive
Can’t rely on surveys – needs electronic forms of data collection
3. Open source
Many could adopt the methodologies
4. Privacy protecting
Violations of privacy can be significantly harmful, especially when estimating hidden behavior that is morally questionable
5. Imperfect Measurement
To guarantee the previous 4 characteristics the measures need to be noisy.
BPP: Countries covered
Billion Prices ProjectOnline Information and Indexes
Our Approach to Daily Inflation Statistics
• Date• Item• Price• Description
Use scraping technology Connect to thousands of online retailers every day
Find individual items Develop daily inflation statistics for ~20 countries
1 2 3 4 5
Store and process key item information in a database
Argentina
USA8
Germany9
France10
Thousands Big Mac’s Project
11
Compare prices for a bottle of Coke across countries
• Online prices represent an effective
tool to measure PPP fluctuations
– Identical items sold around the
world
– Detailed descriptions to achieve
a nearly perfect matching
– Daily Prices
• PPP indices:
– More than 300 narrow product
categories
– With thousands individually
matched items
– In food, fuel, and electronics: we
are missing clothing, personal
care, household products.
– Cars we will never matchApply similar approach to hundreds of products on a daily basis
PPP Indices
Relative Prices
Price: 49.90Product: 4081762
Price: 29.99Product: 4081762
49.90
29.99=1.664
Relative Prices
Price: 49.99Product: 70136
Price: 29.99Product: 70136
49.99
29.99=1.667
UK14
UK15
Lessons:Principles from 13 years of experience
Purpose:
Go to the data with a preestablished purpose
Privacy is a first order concern:
Use cloud computing to design what you need – and try to only collect public
information
Representativeness is a first order concern:
If something exist in the web it does not imply that it is meaningful.
Time series:
Make sure tomorrow and today are comparable
16
Possible Problems17
Pitfalls of Organic Data
Representativeness
Model Uncertainty
Source Reliability
Estimation versus Bias Error
Privacy and Regulation
18
Sample Selection
TS Shock
Teenage Firms
Correlation is not Causality
Violation of norms is not productivity
Sample Selection and Representativeness
Parenting through Facebook
Large quantities of data of a tiny population
People that participate are inherently different from the average
19
Google Flu and the TS Shock
The brilliant idea behind Google Flu
TS Shock: a new song from Taylor Swift
“I Got the Flu”
Some Behavior is inherently unstable
Social Media
Searches
Patterns of consumption
20
Data Collection and Characteristics
Companies collect information for their purposes
How can we be sure that it is
Reliability
Consistency
No Errors-in-variables
Aspirational versus Transactional
21
Causality and Correlation22
Nicolas Cage’s Movies and Drowning in a Swimming Pool
Spurious correlations can be extremely significant. They do not imply
causality
Mistake of thinking that clustering customers teaches something about the
world other than differences in opinion and preferences
Hard to tell if we are making a mistake by looking only at the statistical
significance
These problems imply biases
Problems
Sample Selection Learn for a particular group
Mode instability Unstable Coefficients
Reliability and Consistency Unstable Coefficients
Aspirational (Transactional) Biased Estimates
Data Collection errors Biased Estimates
The size of the data reduces the estimation error but not the bias error
You estimate perfectly the incorrect thing
23
The Father and Target24
Uber Trips25
These problems imply biases
Problems
Sample Selection Learn for a particular group
Mode instability Unstable Coefficients
Reliability and Consistency Unstable Coefficients
Aspirational (Transactional) Biased Estimates
Data Collection errors Biased Estimates
The size of the data reduces the estimation error but not the bias error
You estimate perfectly the incorrect thing
Causality is not Correlation
Privacy violations will be the new scandals
26
Regulation is around the corner
Privacy Protection
Right to be forgotten
We need to prove that eliminating the observation does not change the coefficients of the model.
Organizations that rely exclusively on the marketing revenue model will need to
change
New Licensing Models
Why we have licenses for the Taxis?
Uber -by circumventing the licenses- is causing massive congestion
But licenses give monopoly power and this has been abused by the taxi
companies
27
Thanks28