Big data processing systems research

transcript

Big Data Processing Systems StudyVasiliki Kalavri, emjd-dc

3 Dec 2012

MapReducesimplified Data Processing on Large

Clusters

OSDI 2004

MapReduce

● Specify a map and a reduce functions

● The system takes care of○ parallelization

○ partitioning

○ scheduling

○ communication

○ fault-tolerance

Hadoop MapReduce 1.0

MapReduce Limitations

● Static Pipeline

● No support for common operations

● Data materialization after every job

● Slow - not fit for interactive analysis

● Complex configuration

YARN (MapReduce v.2)

What is Hadoop/MR NOT good for?

● All the things it wasn't built for○ Iterative computations○ Stream processing○ Incremental computations○ Interactive Analysis○ [insert research paper here]

Improving Hadoop performance

● Reduce Network & Disk I/O● Skewed Datasets● DB-like optimizations

○ column-oriented storage

○ indexes

Map-Reduce Inspired Systems

Extending the Programming Model

● Extend the programming model to support○ Iterative○ Streamapplications

Iterative Processing

● Characteristics○ Datasets already stored

○ Need to reuse a dataset more than once, possibly

multiple times○ Iterative jobs, e.g. estimates, convergence

● Problems with iterative MR applications○ manual orchestration of several MR jobs○ re-loading & re-processing of invariant data○ no explicit way to define a termination condition

HaLoopEfficient Iterative Data Processing on

Large Clusters

VLDB 2010

System Overview

Programming Model

● Iterative Programming Model

Ri+1 = R0 U (Ri ⋈ L)

● Extensions to MR○ loop body○ termination condition○ loop-invariant data

Loop-Aware Scheduling

● Inter-Iteration Locality○ schedule tasks of different iterations which

access the same data on the same machines

Caching and Indexing

● Reducer Input Cache○ caches and indexes reducer inputs○ reduces M->R I/O

● Reducer Output Cache○ stores and indexes most recent local reducer

outputs○ reduces termination condition computation cost

● Mapper Input Cache○ avoids non-local data reads in mappers

Stream Processing

● Characteristics○ Data continuously comes into the system○ Usually needs to be processed as it arrives○ Frequent updates

● Problems with stream MR applications○ runs on a static snapshot os a dataset○ computations need to finish

MuppetMapReduce-Style Processing of Fast Data

VLDB 2012

Programming Model

● MapUpdate○ operates on streams, i.e. sequence of events with

the same id in increasing timestamp order

● Slates○ in-memory data structures which "summarize"

all events with key k that an Update function has

seen so far

Example Applications

● An application that monitors the FourSquare-checkin stream to count the number of checkins per retailer and displays the count on a Web page

● Detect "hot" topics in Twitter

System Overview

● Uses Cassandra to persist slate states

Improving Performance

● Improve performance by○ reusing data

○ building caches / indexes

○ DBMS-like optimizations

○ reducing I/O

IncoopMapReduce for Incremental Computations

SOCC 2011

System Overview

Inc-HDFS

● Content-based chunking● Fingerprint calculation

Incremental MapReduce

● Incremental Map○ persistently store intermediate results○ insert reference to memoization server

○ query memoization server and fetch result if

already computed

● Incremental Reduce○ persistently store entire tasks computations

○ store and map sub-computations used in the

Contraction phase

Contraction Phase

● Break up large Reduce tasks into many applications of the Combine function

● Only a subset of Combiners needs to be re-executed

HAILOnly Aggressive Elephants are Fast Elephants

VLDB 2012

System Overview

Upload Pipeline

● HDFS upload pipeline is changed so that:

○ the Client creates PAX blocks○ Datanodes do not flush data or checksums to

disk○ After all chunks of a block have been received,

the block is sorted in memory and flushed○ Each DataNode computes its own checksums

Query Pipeline

Transparency is achieved using UDFs:

● HailInputFormat○ elaborate splitting policy○ scheduling taked into account relevant indexes

● HailRecordReader○ Uses user annotation / configuration info to

select records for map phase○ transforms records from PAX to row format

ThemisAn I/O Efficient MapReduce

SOCC 2012

How to limit Disk I/O?

● Process records in memory and spill to disk as rarely as possible

● Relax fault-tolerance guarantees○ job-level recovery

● Dynamic memory management○ pluggable policies

● Per-node I/O management○ organize data in large batches

Memory policies

● Pool-based○ fixed-sized pre-allocated buffers

● Quota-based○ controls dataflow between computational stages

using queues

● Constraint-based○ dynamically adjusts memory allocation based on

requests and available memory

System Overview

Data-flow graph consisting of stages:

● Phase Zero extracts information about distribution of records and keys

● Phase One implements mapping and shuffling

● Phase Two implements the sorting and reduce, always keeping results in memory

ReStoreReusing Results of MapReduce Jobs

VLDB 2012

System Overview

● Built as an extension to Pig● When a workflow is submitted, ReStore:

○ re-writes the query to reuse stored results○ stores outputs of the workflow○ stores results of sub-jobs

○ decided which outputs to store in HDFS and

which to delete

System Architecture

Example

MANIMALAutomatic Optimization for MapReduce

Programs

VLDB 2011

● Apply well-known query optimization techniques to Map-Reduce jobs

● Static analysis of compiled code● Apply optimizations only when "safe"

System Architecture

Example Optimizations

● Selection○ if the map function is a filter, use a B+Tree to

only scan the relevant portion of the input

● Projection○ eliminate unnecessary fields from input records

SkewTuneMitigating Skew in MapReduce Applications

SIGMOD 2012

Common Types of Skew

● Uneven distribution of input data○ partitioning which does not guarantee even

distribution○ popular key groups

● Expensive records○ some portions of the input take longer to process

than others

System Overview

● Per-task progress estimation● Per-task statistics● Late skew detection

○ skew mitigation is delayed until a slot is

available

● Only re-partition one task at a time○ only when half the time remaining is less than

the re-partitioning overhead

Implementation

Re-partition a map task

● mitigators execute as mappers within a new MapReduce job

● output is written to HDFS

Re-partition a reduce task

● mitigator job with an identity map read input from task tracker

StarfishA Self-Tuning System for Big Data Analytics

CIDR 2011

System Overview

Job-Level Tuning

● Just-in-Time Optimizer○ choose efficient execution techniques, e.g. joins

● Profiler○ learns performance models, job profiles

● Sampler○ collects statistics about input, intermediate and

output data○ helps the profiler build approximate models

Workflow-Level Tuning

● Workflow-aware Scheduler○ exploring data locality on workflow-level instead

of making locally optimal decisions

● What-If Engine○ answers questions based on simulations of job

executions

Workload-Level Tuning

● Workload Optimizer○ Data-flow sharing○ Materialization of intermediate results for reuse○ Reorganization

● Elastisizer○ node and network configuration automation

Big-Data Processing Beyond MapReduce

DryadDistributed Data-Parallel Programs from

Sequential Building Blocks

EuroSys 2007

System Overview

Graph Description

Communication

Graph Optimizations

● Schedule vertices clode to the input data

● If a computation is associative and commutative, use an aggregation tree

● Dynamically refine the graph based on output data sizes○ vary number of vertices in each stage,

connectivity

SCOPEEasy and Efficient Parallel Processing of

Massive Data Sets

VLDB 2008

System Overview

SCOPE scripting language

● resembles SQL with C# expressions

● commands are data transformation operators

● extensible mapreduce-like commands

SCOPE Execution

● The Compiler creates internal parse tree

● The Optimizer creates a parallel execution plan, i.e. a Cosmos job

● The Job Manager constructs the graph and schedules execution

SparkCluster Computing with Working Sets

HotCloud 2010

Read-only collection of objects● partitioned across machines● store their "lineage"● can be re-constructed● users can control persistence and

partitioning

Programming Model

● Scala API● driver program

○ defines RDDs and actions on them

● workers○ long-lived processes

○ store and process RRD partitions in-memory

Job Stages

Nephele/PACTsA Programming Model and Execution Framework for Web-Scale Analytical

ProcessingSoCC 2010

The Stratosphere Stack

System Overview

● Execution plan in the form of a DAG● Abstracts parallelization and

communication● Optimizer to choose best execution

strategy

Programming Model

● Input Contracts: ○ give guarantees on how data is organized into

independent subsets○ Map, Reduce, Match, Cross, CoGroup

● Output Contracts:○ define properties on the output data○ Same-Key, Super-Key, Unique-Key

ASTERIXScalable, Semi-structured Data Platform for

Evolving-World Models

Distributed and Parallel Databases 2011

Evolving World Model

● As-of queries○ What is the best route to get to the Olympic

Stadium right now?

○ What is the traffic situation like on Saturday nights close to the city center?

○ How many visitors that visited the City Hall during the past year also went for dinner in that nearby restaurant?

Data Model - Query Language

● Semi-structured data model, ADM○ dataset ~ table: indexed, partitioned, replicated○ dataverse ~ database○ DDL: primary key, partitioning key○ "open" data schemes

● AQL query language○ declarative, inspired from Jaql and XQuery○ logical plan -> DAG -> Hyracks Job

System Overview

DremelInteractive analysis of Web-Scale Datasets

VLDB 2010

Columnar Storage

● lossless representation○ save field types, repetition/definition levels

● fast encoding○ recursively traverses record and computes levels

● efficient record assembly○ use a FSM to reconstruct records

Query Execution

● Language based on SQL

● Tree architecture○ Root server

■ receives incoming queries

■ reads table metadata

■ routes queries to the next level of the tree

○ Leaf servers■ communicate with storage layer

Query Dispatcher

● Schedules queries to available slots

● Balances the load

● Assures fault-tolerance

● Specifies what percentage of tablets to be scanned before returning a result

CIELa universal execution engine for distributed

data-flow computing

NSDI 2011

Dynamic Task Graph

System Architecture

Skywriting Language

● Turing-complete● Arbitrary data-dependent control flow

○ while loops○ recursive functions

● Supports invokation of code written in other languages

References

www.citeulike.org/user/vasiakalavri

Big data processing systems research

Technology