TOWARDS A BIG DATACOMMUNITY CHALLENGE
Tilmann Rabl, Florian Stegmaier,Michael Granitzer and Hans-Arno Jacobsen
3rd Workshop on Big Data BenchmarkingJuly 16-17
Xi‘an, China
BIG DATA – WHY COMMUNITY CHALLANGES MATTER
• Big Data is a major buzzword in scientific's world- Conferences, workshops, tutorials, panels- Component benchmark, end-to-end systems, etc.
• Variety leads to incomparability of results
• Research communities run challenges to… enable comparability of results… foster evolution of a research field… “Kites rise highest against the wind, not with it.” (W. Churchill)
WHAT SHOULD BE IN THE FOCUS?
DATA!
„[...] other communities, like information retrieval, natural language processing, or Web research, have a much richer and agile culture in creating, disseminating, and re-using interesting new data resources
for scientific experimentation [...]” – G. Weikum, SIGMOD Blog
HOW SHOULD IT BE?
INTERESTING!
HOW ARE „THE OTHERS“ DOING?• Information retrieval community:
– TREC, TRECVid (task-based, measurable scientific impact)
– CLEF Initiative (task-based, benchmarking initiatives)
• Multimedia community:– Multimedia Grand Challenge (tasks defined by “global players”,
e.g., Yahoo! and Microsoft)
– Open Source Software Comp. (foster community activities)
• Semantic Web guys:– Linked Data Cup (data generation)
– Semantic Web in-Use (mashup creation)
SUCCESSFUL COMMUNITY CHALLENGES: TAKE-HOME MESSAGE
• Challenges are not a single event• On-going process, running through different stages:
– Data generation– Solving restricted, high-impact issues– Fostering open source frameworks – Assembling mashups
• Accepted by the community
BRAINSTORMING AREA:STRUCTURE OF THE CHALLENGE
• Challenge needs to be focused on specific tasks:– Tasks assemble a “Big Data pipeline”– Specified by academia and industry
• Hybrid approach to engage participants:– Utilize benchmark activities– Computing tasks on “Open Data”
TIME TO BREAKOUT!• Discussions should focus on:
– Where to find large-scale, interesting “open” data sets?– Which tasks could form a sophisticated Big Data
pipeline ensuring a broad range of implementations?
BREAKOUT HOW-TO:• Breakout and student groups as
yesterday• Prepare one slide for each question