Big Data Curation Webinar 19/12/2013
BIG Big Data Public Private Forum
Big Data Curation
Edward Curry (Insight @ NUI Galway)
Project co-funded by the European Commission within the 7th Framework Program (Grant Agreement No. 257943)
Big Data Curation Webinar 19/12/2013
BIG Big Data Public Private Forum
BIG DATA INSIGHTS
▶ Coping with data variety and verifiability are central challenges and opportunities for Big Data
▶ The long tail of data variety is a major shift in the data landscape ▶ Need for scalable approaches to cope with data under different
format and semantic assumptions
The Data Landscape
The Solution Space ▶ Lowering the usability barrier for data tools is a major requirement
across all sectors. Users should be able to directly manipulate the data ▶ Blended human and algorithmic data processing approaches are
a trend for coping with data acquisition, transformation, curation, access, and analysis challenges for Big Data
▶ Solutions based on large communities (crowd-based approaches) are emerging as a trend to cope with Big Data challenges
Big Data Curation Webinar 19/12/2013
BIG Big Data Public Private Forum
THE DATA VALUE CHAIN
Data Acquisition
Data Analysis
Data Curation
Data Storage
Data Usage
• Structured data
• Unstructured data
• Event processing
• Sensor networks
• Streams • Multimodality
• Data preprocessing
• Semantic analysis
• Sentiment analysis
• Data correlation
• Pattern recognition
• Realtime analysis
• Machine learning
• Trust • Provenance • Data
augmentation • Annotation • Data validation • Redundancy
elimination • Keep up-to-date • Consistency
• In-Memory Technology
• HANA • Column DB • NoSQL • Cloud storage • Compression
• Decision support
• Predictions • Simulation • Exploration • Modelling • Control • Domain-
specific usage
Technical Working Groups
Value Chain
Big Data Curation Webinar 19/12/2013
BIG Big Data Public Private Forum
DATA CURATION
Value Chain
Data Acquisition
Data Analysis
Data Curation
Data Storage
Data Usage
Big Data Curation Webinar 19/12/2013
BIG Big Data Public Private Forum
THE PROBLEM: DATA QUALITY
ID PNAME PCOLOR PRICE
APNR iPod Nano Red 150
APNS iPod Nano Silver 160
<Product name=“iPod Nano”> <Items> <Item code=“IPN890”> <price>150</price> <genera>on>5</genera>on> </Item> </Items> </Product>
Source A
Source B Schema Difference?
Data Developer
APNR
iPod Nano
Red
150
APNR
iPod Nano
Silver
160
iPod Nano IPN890 150
5
Value Conflicts? Entity Duplication?
Data Steward
Business Users
? Technical Domain
(Technical)
Domain
Big Data Curation Webinar 19/12/2013
BIG Big Data Public Private Forum
DATA CURATION OVERVIEW
▶ Digital Curation “Selection, preservation, maintenance, collection, and archiving of digital assets”
▶ Data Curation “Active management of data over its life-cycle”
Definition
▶ Individual Curators ▶ Curation Departments ▶ Community-based (Emerging trend)
Who?
▶ Manual Curation ▶ (Semi-)Automated ▶ Sheer Curation ▶ Collaborative Data Management (Crowdsourcing)
How?
▶ Accessible ▶ Authenticity ▶ Collaboration ▶ Discoverability ▶ Fitness for Use
Why? ▶ Integrity ▶ Reusability ▶ Security ▶ Sustainability ▶ Trustworthy
Big Data Curation Webinar 19/12/2013
BIG Big Data Public Private Forum
Clean Data
ALGORITHM + CROWD
Developers Data Governance
Internal Community
External Crowd
Data Sources
Data Quality Algorithms
Human Computation
Big Data Curation Webinar 19/12/2013
BIG Big Data Public Private Forum
MIXED HUMAN-COMPUTER INTELLIGENCE
▶ Coordinating a crowd (a large group of workers) to do micro-work (small tasks) that solves pro(that computers or a single user can’t)blems
▶ A collection of mechanisms and associated methodologies for scaling and directing crowd activities to achieve goals
Key Points
▶ Collective Intelligence ▶ Social Computing ▶ Human Computation ▶ Data Mining & Machine learning ▶ Natural Language Processing ▶ Speech recognition & Computer vision
Related Areas
Big Data Curation Webinar 19/12/2013
BIG Big Data Public Private Forum
HUMAN VS MACHINE AFFORDANCES
ü Visual perception ü Visuospatial thinking ü Audiolinguistic ability ü Sociocultural awareness ü Creativity ü Domain knowledge
ü Large-scale data manipulation ü Collecting and storing
large amounts of data ü Efficient data movement ü Bias-free analysis
Human Machine
Big Data Curation Webinar 19/12/2013
BIG Big Data Public Private Forum
WHEN COMPUTERS WERE HUMAN
▶ Used human computers to created almanac of moon positions ▶ Used for shipping/
navigation ▶ Quality assurance ▶ Do calculations twice ▶ Compare to third verifier
Maskelyne 1760
Big Data Curation Webinar 19/12/2013
BIG Big Data Public Private Forum
WHEN COMPUTERS WERE HUMAN
Big Data Curation Webinar 19/12/2013
BIG Big Data Public Private Forum
BIG DATA CURATION EXEMPLARS
Big Data Curation Webinar 19/12/2013
BIG Big Data Public Private Forum
TAG A TUNE
Big Data Curation Webinar 19/12/2013
BIG Big Data Public Private Forum
PEEKABOOM
Big Data Curation Webinar 19/12/2013
BIG Big Data Public Private Forum
FOLDIT
Big Data Curation Webinar 19/12/2013
BIG Big Data Public Private Forum
RECAPTCHA
▶ OCR ▶ ~ 1% error rate ▶ 20%-30% for 18th and 19th
century books ▶ 40 million ReCAPTCHAs
every day” (2008) ▶ Fixing 40,000 books a day
Recaptcha
Big Data Curation Webinar 19/12/2013
BIG Big Data Public Private Forum
BIG DATA CURATION IN ENTERPRISES
Product Categorization
Sentiment Analysis
▶ Categorize millions of products with accurate and complete attributes
▶ Combine the crowd with machine learning to create an affordable and flexible catalog quality system
▶ Understanding customer sentiment for worldwide launch of new product
▶ Implemented 24/7 sentiment analysis system using workers from around the world
Big Data Curation Webinar 19/12/2013
BIG Big Data Public Private Forum
BIG DATA CURATION USE CASES Telco, Media, & Entertainment
Manufacturing, Retail, Energy & Transport
Public Sector Life Sciences
Big Data Curation Webinar 19/12/2013
BIG Big Data Public Private Forum
COMMUNITY, CROWDS, & OPEN DATA
▶ Leaverage online community to curate large datasets ▶ Natural Language Processing, Computer Vision,
Classification, Verification, Enrichment, Judgments, etc
Community & Crowds
Emerging Economic Model for Open Data ▶ Pre-competitive collaboration efforts ▶ Share costs, risks, & technical challenges ▶ Benefit from collective wisdom and
network effect for curated dataset ▶ Pistoia Alliance (pharmaceutical data)
Big Data Curation Webinar 19/12/2013
BIG Big Data Public Private Forum
FUTURE REQUIREMENTS OF BIG DATA CURATION
▶ Increase in the need for automation ▶ Trust and provenance capture/management
Curation at Scale
▶ Interfaces which can cope with different levels of expertise and responsibility
▶ Discoverability of data items ▶ Fine-grained control over accessibility of various data items
Access Management
▶ Enable contribution from wide range of human resources such as programmers, domain experts, non-experts contributors, and crowds.
▶ Distribute curation tasks while considering abilities of persons and complexities of tasks
Variety of Expertise
Multimedia & Text ▶ Data curation infrastructure focused on multimedia and
unstructured resources
Big Data Curation Webinar 19/12/2013
BIG Big Data Public Private Forum
SUMMARY
▶ Coping with data variety and verifiability are central challenges and opportunities for Big Data
▶ The long tail of data variety is a major shift in the data landscape ▶ Need for scalable approaches to cope with data under different
format and semantic assumptions
The Data Landscape
The Solution Space ▶ Lowering the usability barrier for data tools is a major requirement
across all sectors. Users should be able to directly manipulate the data ▶ Blended human and algorithmic data processing approaches are
a trend for coping with data acquisition, transformation, curation, access, and analysis challenges for Big Data
▶ Solutions based on large communities (crowd-based approaches) are emerging as a trend to cope with Big Data challenges
▶ Principled semantic and standardized data representation models are central to cope with data heterogeneity
Big Data Curation Webinar 19/12/2013
BIG Big Data Public Private Forum
BIG DATA CURATION INTERVIEW SERIES http://big-project.eu/text-interviews
More to come in 2014…
Future Interviews