Date post: | 08-Apr-2017 |
Category: |
Technology |
Upload: | micah-altman |
View: | 494 times |
Download: | 1 times |
Sources of Big Data for the Social Sciences
Micah AltmanDirector of Research
MIT Libraries
Prepared for
Program on Information Science Brown Bag Series
MIT
August 2015
Sources of Big Data for the Social Sciences
Roadmap What the @#%&!
Is “big data”? Two examples of
big data in social & health sciences
Open questions Potential roles for
libraries
Big Data Challenges
Acquisition
Retention
Analysis
Access
Sources of Big Data for the Social Sciences
DISCLAIMERThese opinions are my own, they are not the opinions of MIT, Brookings, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators
Secondary disclaimer:
“It’s tough to make predictions, especially about the future!”
-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx,
Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc.
Sources of Big Data for the Social Sciences
Collaborators & Co-Conspirators Workshop Series Co-Organizers
– U.S. Census Bureau Cavan Capps Ron Prevost
Research Support Supported by the U.S. Census Bureau
Sources of Big Data for the Social Sciences
Related Work
Main Project: Census-MIT Big Data Workshop Series
projects.informatics.mit.edu/bigdataworkshops Related publications:(Reprints available from: informatics.mit.edu ) Altman, M., D. O’Brien, S. Vadhan, A. Wood. 2014. “Big Data Study: Request
for Information.” Altman, M Altman M, Wood A, O'Brien D, Gasser M., Vadhan, S. Towards a Modern Approach to
Privacy-Aware Government Data Releases. Berkeley Journal of Technology Law. Forthcoming. Altman M, McDonald MP. 2014. Public Participation GIS : The Case of
Redistricting. Proceedings of the 47th Annual Hawaii International Conference on Systems Science .
Sources of Big Data for the Social Sciences
Workshops Series: Big Data and Official Statistics
Acquisition ChallengesUsing New forms of Information for Official Economic Statistics[August 3-4]
Privacy ChallengesLocation Confidentiality and Official Surveys [October 5-6]
Inference ChallengesTransparency and Inference[December 7-8]
Expected outcomes:
Workshop reports (September, October, December)
Integrated white paper(February)
Identifying new opportunities for statistical agencies
Inform the Census Big Data Research Program.
projects.informatics.mit.edu/bigdataworkshops
Sources of Big Data for the Social Sciences
Small, Big, Massive & Ginormous Data Characteristics: the k “V’s” of big data
Volume Velocity Variety + Veracity + Variability + …
Sources of Big Data for the Social Sciences
“Big” is in the use, not just the dataWhen do challenges of “big” exceed limits of well-selected traditional methods and practices?
Data Management – Workflow & Governance Challenge
Implementation – Performance Challenges Analysis methods – Inferential Challenges
Sources of Big Data for the Social Sciences
Trends and Challenges Trends
Increasingly data-driven economy Individuals are increasingly mobile Technology changes data uses Stakeholder expectations are changing Agency budgets and staffing remain flat.
The next generation of official statistics Utilize broad sources of information Increase granularity, detail, and timeliness Reduce cost & burden Maintain confidentiality and security
Multi-disciplinary challenges : Computation, Statistics, Informatics, Social Science, Policy
Using Weibo to Discover Chinese Censorship Strategies(and U.S. Debate Strategies)
Sources of Big Data for the Social Sciences
More Information• Grimmer, Justin, and Gary King. "General purpose
computer-assisted clustering and conceptualization." Proceedings of the National Academy of Sciences 108.7 (2011): 2643-2650.
• King, Gary, Jennifer Pan, and Margaret E Roberts. 2013. “How Censorship in China Allows Government Criticism but Silences Collective Expression.” American Political Science Review 107 (2 (May): 1-18. Copy at http://j.mp/LdVXqN
“Posts with negative, even vitriolic, criticism of the state, its leaders, and its policies are not more likely to be censored… the censorship program is aimed at curtailing collective action by silencing comments that represent, reinforce, or spur social mobilization, regardless of content.”
Data Source - Social Media Messages
Data: Structure - Network, Unstructured Text, Structured metadata
Unit of Observation - Individuals; InteractionsCollection Design - Pure observationalDesired Inferences - Causal inference
– what censorship strategies cause observed reaction
- Inference to Population Frame
Performance challenges
- High volume- Complex network
structure- Scaling bespoke
algorithms- Sparsity- Systematic and sparse
metadataManagement Challenges
- License- Replication- Revision Control
Inferential Challenges - Measurement error – extracting topics from text
Using Google Searches to Forecast Disease Outbreaks
Sources of Big Data for the Social Sciences
More Information• Ginsberg, Jeremy, et al. "Detecting influenza
epidemics using search engine query data." Nature 457.7232 (2009): 1012-1014.
• Lazer, David, et al. "The parable of Google Flu: traps in big data analysis." Science 343.14 March (2014).
“Big data hubris” is the often implicitassumption that big data are a substitutefor, rather than a supplement to, traditional data collection and analysis.
Data Source - Google search queriesData: Structure - Quasi-tabular, structured
metadata and unstructured text
Unit of Observation - Interactions with a system
Collection Design - Pure observational
Desired Inferences - Predictive inference-- where will flu clusters appear next-- Short-term (nearcasting)-- small-area (fine-spatial granularity)
- Inference to general population
Performance challenges - Streaming algorithmsManagement Challenges - Replication
- Transparency- Variability
Inferential Challenges - External Validity- Measurement error
– extracting topics from text- Overfitting- Sampling
Sources of Big Data for the Social Sciences
Comparing CasesChinese Censorship Flu Prediction
Data Source - Social Media Messages - Google search queries
Data: Structure - Network, Unstructured Text, Structured metadata
- Quasi-tabular, structured metadata and unstructured text
Unit of Observation - Individuals; Interactions - Interactions with a systemCollection Design - Pure observational - Pure observationalDesired Inferences - Causal inference
– what censorship strategies cause observed reaction
- Inference to Population Frame
- Predictive inference-- where will flu clusters appear next-- Short-term (nearcasting)-- small-area (fine-spatial granularity)
- Inference to general populationPerformance challenges - High volume
- Complex network structure- Scaling bespoke algorithms- Sparsity- Systematic and sparse
metadata
- Streaming algorithms
Management Challenges - License- Replication- Revision Control
- Replication- Transparency- Variability
Inferential Challenges - Measurement error – extracting topics from text
- External Validity- Measurement error
– extracting topics from text- Overfitting- Sampling
Challenges of Big Data
Big Data Challenges
Acquisition
Sources
Incentives Quality
Provenance
Retention
Change Managemen
t
Integration
Security
StorageAnalysis
Bias
CausationComputation
Visualization
Access
Transparency
Reproducibility
Durable Access(Preservation)
Confidentialiity
Challenges of Big Data
Some Sources of Economic Information Smartphone sensors – GPS + Vehicle systems IoT – smart thermostats, fire alarms Transactions – online, internal Search behavior – search engine queries Social media – twitter, FaceBook, LinkedIN Imagery – satellite, thermal, video …
Challenges of Big Data
Source Characteristics Unit of Observation
Location, virtual service, communication network, individual
Context Behavior, transaction, environment, statement
Measure characteristics Measure scale Measure structure Accuracy, precision
Frame & Sample characteristics
Challenges of Big Data
Some Potential Sources of Analysis Error
Target Population
FrameSelection
Super Population
Laws
(structures)
λβ
(generates)
Parameters
• Selection bias• Frame
uncertainty• Measurement
error• Unknown
measurement semantics
• Non-independence of measures
• Non-independence of samples
• Model uncertainty
• Unknown causal structure
• Shift in measurements, samples, frames
Challenges of Big Data
Many Initiatives to Improve Scientific Reliability
Retraction monitoring Data citation Clinical trial
preregistration Registered replication Open data Badges
Challenges of Big Data
Some Types of Reproducibility Issues
• Fraud• Misconduct• Negligence• Bit Rot• Versioning problem• Replication• Reproduction• Extension• Result Validation• Fact Checking• Calibration, Extension, Reuse• Undereporting• Data Dredging• Multiple Comparisons’ P-Hacking• Sensitivity, Robustness• Reliability• Generalizability
Ensuring Repeatability & Transparency
Challenges of Big Data
‘
‘’ΩΩΩΩ
Theory(Rules, Entities, Concepts)
Algorithm (Protocol, Operationalization)
Theory(Rules, Entities, Concepts)
Theory(Rules, Entities, Concepts)
Implementation(Software, Coding Rules, Instrumentation )
Execution(Deployment, House Survey Style, Equipment
Setting )
’
Algorithms (Protocol, Operationalization)
Implementations(Software, Coding Rules, Instrumentation Design )
Executions(Deployment, House Survey Style, Operating
System, Hardware, Starting Values, PRNG seeds)
Structure
Formats
Versions/RevisionsSelections
Integrations
Instantiations(copies)
Execution Context(weather, compiler, operating system system load)
Durable, Long-Term Access• Why durable access?
• The rule of law require maintaining authentic public records• Scientific advances rely on a cumulative, traceable evidence base• Art, history, culture require durable access to national heritage
information• Our nation needs durable access to a strategic information reserve• Humanity needs durable long-term access information in order to
communicate to future generations• Big data challenges to durability
• Velocity – information is updated, sometime overwritten• Many sources are commercial/private
– not routinely archived, preserved• Modeling future value of information • Maintaining privacy and confidentiality
Challenges of Big Data
Challenges of Big Data
Big data challenges… Anonymization can completely destroy utility
The “Netflix Problem”: large, sparse datasets that overlap can be probabilistically linked [Narayan and Shmatikov 2008]
Observable Behavior Leaves Unique “Fingerprints” The “GIS”: fine geo-spatial-temporal data impossible
mask, when correlated with external data [Zimmerman 2008; ]
Big Data can be Rich, Messy & Surprising The “Facebook Problem”: Possible to identify masked
network data, if only a few nodes controlled. [Backstrom, et. al 2007]
The “Blog problem” : Pseudononymous communication can be linked through textual analysis [Novak wet. al 2004]
Source: [Calberese 2008; Real Time Rome Project 2007]
Challenges of Big Data
Little Data in a Big World Little Data in a Big World
The “Favorite Ice Cream” problem
-- public information that is not risky can help us learn information that is risky
The “Doesn’t Stay in Vegas” problem-- information shared locally can be found anywhere
The “Unintended Algorithmic Discrimination” problem-- algorithms are often not transparent, and can amplify human biases
Sources of Big Data for the Social Sciences
Categorizing Challenges Implementation – Performance
Challenges Systems challenges
Exceed capacity of locally managed storage
Location and migration of data becomes critical for performance
Standard backup, recovery and data integrity mechanisms ineffective
Communication bandwidth Algorithmic Challenges
“in core” vs. “out-of-core” implementations
O(N^2) vs. O(log n) complexity Static vs. streaming algorithms Serial vs. massively parallel Distributed – shared-nothing algorithms
Analysis methods – Inferential Challenges Sources: Designed vs. “found” data Model-based vs. data-based analysis Causal inference vs.
Descriptive/ predictive (forecasting) inference
Data Management & Workflow Provenance Data quality Change management Continuous integration Accommodating variety – semantics,
quality Transparency and reproducibility Privacy Security
Data Governance and Policy Standards Incentives Certifications Regulation
Sources of Big Data for the Social Sciences
Preliminary Observations from First WorkshopTopic:
Sources of Economic Big DataUse Case:
Commodity Flow Survey
Observations: Different classes of decisions require different sources of data:
E.g. much designed survey data contributes baseline data for decisions about infrastructure and strategic planning
Transaction based big data could contribute frequency and granularity of estimates
In big data, data sources are stakeholders Businesses need to react quickly and predict the future – and need frequently
updated detailed data Critical to provide a value proposition to business Critical to develop a trust relationship
Some Potential sources ERP and DRP operations data EDI Mobile Phone Traffic Data
Sources of Big Data for the Social Sciences
Some Non-Technical Questions About Sources● Who are the key stakeholders in big data source,
and what are the key stakeholder incentives?○ What key decisions does this information support for
stakeholders? What are the gaps in data from the stakeholder perspective?
○ What are barriers associated with new sources of information?○ Legal barriers○ Economic barriers○ Social/trust barriers
Sources of Big Data for the Social Sciences
Potential Roles -- Infrastructure Dissemination
Catalog range of new statistics/indicators , sources Selection based on quality Guide proper use
Durability Ensure long-term accessibility of big-data Manage provenance, versioning Provide transparency of new indicators/statistics
Security & Confidentiality Libraries could be a trusted and accountable 3rd party Store and integrate data from multiple sources Could develop expert implementation of privacy
best practices
Sources of Big Data for the Social Sciences
Potential Roles - LeadershipAdvocacy
Advocate for quality, transparency, replication, durable access.
StandardizationDevelop new methods for big data management
Identify “best practices” for replication, transparency, long-term access
Standardize licenses for reuse, preservation
Sources of Big Data for the Social Sciences
Additional References● Einav, Liran, and Jonathan Levin. "Economics in the age of
big data." Science 346.6210 (2014): 1243089. http://www.sciencemag.org/content/346/6210/1243089.short
● Varian, Hal R. "Big data: New tricks for econometrics." The Journal of Economic Perspectives 28.2 (2014): 3-27.http://people.ischool.berkeley.edu/~hal/Papers/2013/ml.pdf
Reimsbach-Kounatze, C. (2015), “The Proliferation of “Big Data” and Implications for Official Statistics and Statistical Agencies: A Preliminary Analysis”, OECD Digital Economy Papers, No. 245, OECD Publishing. http://dx.doi.org/10.1787/5js7t9wqzvg8-en
Kriger, David S., et al. Freight Transportation Surveys. Vol. 410. Transportation Research Board, 2011. http://www.nap.edu/catalog/13627/nchrp-synthesis-410-freight-transportation-surveys
Questions?E-mail: [email protected]
Web: informatics.mit.edu
Sources of Big Data for the Social Sciences
Sources of Big Data for the Social Sciences
Creative Commons License
This work. Managing Confidential information in research, by Micah Altman (http://redistricting.info) is licensed under the Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.