+ All Categories
Home > Documents > from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g.,...

from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g.,...

Date post: 06-Oct-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
83
Genoveva Vargas-Solar Senior Scientist, French Council of Scientific Research, LIG-LAFMIA [email protected] Big Data Management at Scale from data processing to architectures Keystone, Santiago de Compostela, 17th-23th July, 2016 http://vargas-solar.com/big-linked-data-keystone/
Transcript
Page 1: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

Genoveva Vargas-SolarSenior Scientist, French Council of Scientific Research, LIG-LAFMIA

[email protected]

Big Data Management at Scalefrom data processing to architectures

Keystone, Santiago de Compostela, 17th-23th July, 2016

http://vargas-solar.com/big-linked-data-keystone/

Page 2: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

MapReduceData Processing for

ComplexBI and Reporting

StreamingReal-time

processing and fulfillment

DocumentTransactional

Document Storage for cohesive and

large transactional data

RelationalTransactional

Relational storage for highly structured transactional data

DocumentArchival

Document Storage for Archival

Solutions

Data Access Patterns

Page 3: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

DBMS EVOLUTION

No more monolithic DBMS

Extensible, lightweight DBMS

Unbundled technology*

Component-based architectures* (thick-grain vs. fine-grain)

OO Frameworks

Components are providing Services

Blur the boundaries between OS & DBMS

Self-adaptive Systems

Multi-tier architectures, Web, P2P, GRID, CLOUD,…

3

* See Dittrich, Geppert, Eds, “Component Database Systems”, MK 2000

* Chaudhuri & Weikum, Rethinking Database System Architecture: Towards a Self-tuning RISC-style Database System, VLDB 2000

Page 4: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

SERVICE ORIENTED DBMS1

4

Data services

Accessservices

Storageservices

Additionalextensionservices

Otherservices

Extension servicesStreaming, XML, procedures,

queries, replication

1 Ionut Subasu, Patrick Ziegler, and Klaus R Dittrich. Towards service-based data management systems. In Workshop Proceedings of Datenbanksysteme in Business, Technologie und Web (BTW 2007)Klaus R Dittrich and Andreas Geppert. Component database systems. Morgan Kaufmann, 2000.

Page 5: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

SERVICE ORIENTED DBMS1

5

Data services

Accessservices

Storageservices

Additionalextensionservices

Otherservices

Extension servicesStreaming, XML, procedures,

queries, replication

1 Ionut Subasu, Patrick Ziegler, and Klaus R Dittrich. Towards service-based data management systems. In Workshop Proceedings of Datenbanksysteme in Business, Technologie und Web (BTW 2007)Klaus R Dittrich and Andreas Geppert. Component database systems. Morgan Kaufmann, 2000.

Service level agreement: the contracted delivery time of the service or performance

Required SLA: agreements between the user and SDBMS expressed as a combination of weighted measures associated to a query

Service Level Agreement• In the event of a corruption, or other disaster

• the maximum amount of data loss is the last 15 minutes of transactions• the maximum amount of downtime the application can tolerate is 20 minutes

Page 6: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

6

The cloud as data management environment

Page 7: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

THE CLOUD

Promotes a style of computing in which dynamically scalable and often virtualized resources are provided as a service over the Internet

PaaS: allows customers to rent computers (virtual machines) on which to run their own computerapplications.

7

Infrastructure as a serviceIaaS

Platform as a servicePaaS

Software as a servicePaaS

• Illusion of infinite resources• No up-front cost• Fine-grained billing (e.g. hourly)

Page 8: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

+THE CLOUD

8

Infrastructure as a service (IaaS)e.g., Amazon EC2, GoGrid, Rackspace

Platform as a service (PaaS)e.g., Microsoft Azure, Google App Engine

Software as a service (SaaS)e.g., Salesforce, Google Apps

Enabling tecnologies (hardware & software)[FurhtEscalante 2010]

Individual users & applications

Page 9: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

+THE CLOUD

9

Infrastructure as a serviceIaaS

Platform as a servicePaaS

Software as a servicePaaS

n Computing power is elastic, butnoly if workload isparallelizablen Shared-nothing architecture

n Data is stored at un-trustedhostsn Solution: encrypting data

n Data is replicated, across largegeographic distancesn Availability and durability

Page 10: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

+CLOUD DATA MANAGEMENT: SERVICES VIEWS

10

Data Volume

Peta1015

Exa1018Zetta

1021Yota1024

Hardware

Cloud

tapemagnetic

• Storage (persistency)• Efficient retrieval (indexing, caching)• Fault tolerance (recovery, replication)• Maintenance

• Definition• Querying and exploiting• Manipulation

RAID

Page 11: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

+DATA MANAGEMENT WITHOUT RESOURCES CONSTRAINTS

11

Reduce the cost to manage and exploit data sets according to unlimited storage, memory and computation resources

Systems

AlgorithmsCOSTAWARE

ELASTIC

Page 12: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

+SQL AS A SERVICE

12

Relational DBMS

Relational Cloud storage service

Relational model and SQL as aService e.g. Amazon relationaldatabase service (RDS), MS SQL Azure

Implemented on top ofparallel clusters of commonDBMS servers e.g., MySQLMS SQL Server

User applications

Page 13: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

+CLOUD DATA MANAGEMENT: FUNCTIONS VIEW

13

Distributed storage system

Structured data system

Distributed processing system

Query language

Performance for data accessfault tolerance, availability, scalability

Performance for complex operations (SQL like joins & grouping, data analysis)

Simple & flexible data model (key-value), basic access operations (lookup API)

High level languages for accessing data and controlling processing

Individual users & applications

Page 14: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

+CLOUD DATA MANAGEMENT: FUNCTIONS VIEW

14

Distributed storage system

Structured data system

Distributed processing system

Query language

Individual users & applications

Distributed file systems:Google file system, Hadoop Distributed File System, CloudStoreCloud-based file Service: Amazon S3P2P-like file service: Amazon Dynamo

Google BigTable & other BigTable implementations like Hbase, Cassandra, Amazon SimpleDB

Google/Hadoop MapReduce

HiveQL, JaQL, Pig on top of Hadoop Map-Reduce

Page 15: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

OPENSOURCEBIGDATASTACKS

15

Notes:• Giantbytesequence

atthebottom• Map,sort,shuffle,

reducelayerinmiddle• Possiblestoragelayer

inmiddleaswell• HLLsnowatthetop

From Mike Carey

Page 16: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

+

http://asterixdb.ics.uci.edu

“OneSizeFitsaBunch”

Semi-structured

Data Management

ParallelDatabase Systems

Data-IntensiveComputing

•Inside “Big Data Management”: Ogres, Onions, or Parfaits?, Vinayak Borkar, Michael J. Carey, Chen Li, EDBT/ICDT 2012 Joint Conference Berlin

•Data Services, Michael J. Carey, Nicola Onose, Michalis PetropoulosCACM June 2012, (Vol55, N.6)

ASTERIXDB PROJECT @ UCI

Page 17: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

#ASTERIXDB

THEASTERIXSOFTWARESTACK

17

Other HLLCompilers

AlgebricksAlgebra Layer

Hyracks Data-parallel Platform

Piglet ...

HadoopM/R Job

Hadoop M/RCompatibility

Hyracks Job

AsterixQL

AsterixData

Mgmt.System Hivesterix

HiveQL

PregelJob

Pregelix

IMRUJob

IMRU

#AsterixDB

Page 18: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

+GOOGLE BIGQUERY

18

Page 19: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

19

Page 20: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

20

Page 21: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

21

Next generation of analytics data stack• Berkeley data analytics stack (BADS)• Release as open source

Page 22: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

TERALAB

Big Data platform for research and experimentation

FSN Big Data Call for academia and start ups

Target infrastructure­ Storage: 1,5 Peta octets­ RAM: 16 Tera octets­ Computing power [SPECint_rate2006]: 28000

Software as a Service: R(evolution), MapReduce, Impala, Hive, Pig, GRAPHLAB, KNIME, Rapid Miner, Alpine miner, Python tools (Pandas, IPython...)

Public data collections

22

https://www.teralab-datascience.fr

Page 23: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

+DATABASE LANDSCAPE

23

Page 24: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

24

Conclusions & Perspectives

Page 25: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

CONCLUSIONS & PERSPECTIVES

Data collections­New scales: bronto scale due to emerging IoT­New types: thick, long hot, cold­New quality measures: QoS, QoE, SLA

Data processing & analytics­Complex jobs, stream analytics are still open issues­ Economic cost model & business models (Big Data value & pay-as-U-go)

Multi-cloud: elasticity, quality, SLA

25

Page 26: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

26

Genoveva Vargas-SolarCR1, CNRS, [email protected]

http://vargas-solar.com/big-linked-data-keystone/

Page 27: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

DISTRIBUTED FILE SYSTEMReliable distributed file system

Data kept in “chunks” spread across machines

Each chunk replicated on different machines ­ Seamless recovery from disk or machine failure

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 27

C0 C1

C2C5

Chunk server 1

D1

C5

Chunk server 3

C1

C3C5

Chunk server 2

…C2D0

D0

Bring computation directly to the data!

C0 C5

Chunk server N

C2D0

Chunk servers also serve as compute servers

Page 28: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

28

Page 29: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

PIG

“Pig Latin: A Not-So-Foreign Language for Data Processing” ­ Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins (Yahoo! Research)

­ http://www.sigmod08.org/program_glance.shtml#sigmod_industrial_program­ http://infolab.stanford.edu/~usriv/papers/pig-latin.pdf

29

Page 30: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

PIG

High level data flow language for exploring very large datasets

Compiler that produces sequences of MapReduce programs

Structure is amenable to substantial parallelization

Operates on files in HDFS

Metadata not required, but used when available

Provides an engine for executing data flows in parallel on Hadoop

Ease of programming­ Trivial to achieve parallel execution of simple

and parallel data analysis tasks

Optimization opportunities­ Allows the user to focus on semantics rather than

efficiency

Extensibility ­ Users can create their own functions to do

special-purpose processing

30

General description Key properties

Page 31: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

+

Top 5 pages accessed by users between 18 and 25 year

EXAMPLE

31

Page 32: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

Filter by Age

Load Users Load Pages

Join on Name

Group on url

Count Clicks

Order by Clicks

Take Top 5

Save results

Page 33: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

+EQUIVALENT JAVA MAP REDUCE CODE

33

Page 34: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

34

Page 35: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

35

Page 36: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

36

Map reduceThe new software stack

Page 37: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

MAP REDUCE

Challenges:­How to distribute computation?­Distributed/parallel programming is hard

Map-reduce addresses all of the above­Google’s computational/data manipulation model­ Elegant way to work with big data

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 37

Page 38: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

+

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG

SINGLE NODE ARCHITECTURE

38

Memory

Disk

CPU

Machine Learning, Statistics

“Classical” Data Mining

Page 39: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

MOTIVATION: GOOGLE EXAMPLE

20+ billion web pages x 20KB = 400+ TB

1 computer reads 30-35 MB/sec from disk­ ~4 months to read the web

~1,000 hard drives to store the web

Takes even more to do something useful with the data!

Today, a standard architecture for such problems is emerging:­ Cluster of commodity Linux nodes­ Commodity network (ethernet) to connect them

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 39

Page 40: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

+

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG

CLUSTER ARCHITECTURE

40

Mem

Disk

CPU

Mem

Disk

CPU

Switch

Each rack contains 16-64 nodes

Mem

Disk

CPU

Mem

Disk

CPU

Switch

Switch1 Gbps between any pair of nodesin a rack

2-10 Gbps backbone between racks

In 2011 it was estimated that Google had 1M machines, http://bit.ly/Shh0RO

Page 41: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

LARGE-SCALE COMPUTING

For data mining problems on commodity hardware

Challenges:­How do you distribute computation?­How can we make it easy to write distributed programs?­Machines fail:­ One server may stay up 3 years (1,000 days)­ If you have 1,000 servers, expect to loose 1/day­ People estimated Google had ~1M machines in 2011­ 1,000 machines fail every day!

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 41

Page 42: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

IDEA AND SOLUTION

Issue: Copying data over a network takes time

Idea:­ Bring computation close to the data­ Store files multiple times for reliability

Map-reduce addresses these problems­Google’s computational/data manipulation model­ Elegant way to work with big data­ Storage Infrastructure – File system­ Google: GFS. Hadoop: HDFS

­ Programming model­ Map-Reduce

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 42

Page 43: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

STORAGE INFRASTRUCTURE

Problem:­ If nodes fail, how to store data persistently?

Answer:­Distributed File System:­ Provides global file namespace­ Google GFS; Hadoop HDFS;

Typical usage pattern­Huge files (100s of GB to TB)­Data is rarely updated in place­Reads and appends are common

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 43

Page 44: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

PROGRAMMING MODEL: MAP REDUCE

Warm-up task:

We have a huge text document

Count the number of times each distinct word appears in the file

Sample application: Analyze web server logs to find popular URLs

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 44

Page 45: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

TASK: WORD COUNT

Case 1:­ File too large for memory, but all <word, count> pairs fit in memory

Case 2:

Count occurrences of words:­words(doc.txt) | sort | uniq -c­ where words takes a file and outputs the words in it, one per a line

Case 2 captures the essence of MapReduce­Great thing is that it is naturally parallelizable

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 45

Page 46: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

MAP REDUCE: OVERVIEW

Sequentially read a lot of data

Map:­ Extract something you care about

Group by key: Sort and Shuffle

Reduce:­Aggregate, summarize, filter or transform

Write the result

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 46

Page 47: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

+

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG

MAPREDUCE: THE MAP STEP

47

vk

k v

k v

mapvk

vk

k vmap

Inputkey-value pairs

Intermediatekey-value pairs

k v

Page 48: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

+

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG

MAP REDUCE: THE REDUCE STEP

48

k v

k v

k v

k v

Intermediatekey-value pairs

Groupby key

reduce

reduce

k v

k v

k v

k v

k v

k v v

v v

Key-value groupsOutput key-value pairs

Page 49: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

MORE SPECIFICALLY

­Map(k, v) ® <k’, v’>*­ Takes a key-value pair and outputs a set of key-value pairs­ E.g., key is the filename, value is a single line in the file

­ There is one Map call for every (k,v) pair

­Reduce(k’, <v’>*) ® <k’, v’’>*­All values v’ with same key k’are reduced together and processed in v’ order­ There is one Reduce function call per unique key k’

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 49

Input: a set of key-value pairs

Page 50: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 50

The crew of the spaceshuttle Endeavor recentlyreturned to Earth asambassadors, harbingersof a new era of spaceexploration. Scientists atNASA are saying that therecent assembly of theDextre bot is the first stepin a long-term space-based man/machepartnership. '"The workwe're doing now -- therobotics we're doing -- iswhat we're going to need……………………..

Big document

(The, 1)(crew, 1)(of, 1)(the, 1)

(space, 1)(shuttle, 1)

(Endeavor, 1)(recently, 1)

….

(crew, 1)(crew, 1)(space, 1)(the, 1)(the, 1)(the, 1)

(shuttle, 1)(recently, 1)

(crew, 2)(space, 1)(the, 3)

(shuttle, 1)(recently, 1)

MAP:Read input and

produces a set of key-value pairs

Group by key:Collect all pairs with same key

Reduce:Collect all values belonging to the key and output

(key, value)

Provided by the programmer

Provided by the programmer

(key, value)(key, value)

Sequ

entia

lly re

ad th

e da

taO

nly

se

quen

tial

rea

ds

MAP-REDUCE: WORD COUNTING

Page 51: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

WORD COUNT USING MAP REDUCEmap(key, value):

// key: document name; value: text of the document

for each word w in value:

emit(w, 1)

reduce(key, values):// key: a word; value: an iterator over counts

result = 0for each count v in values:

result += vemit(key, result)

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 51

Page 52: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

MAP-REDUCE: ENVIRONMENT

Map-Reduce environment takes care of:

Partitioning the input data

Scheduling the program’s execution across a set of machines

Performing the group by key step

Handling machine failures

Managing required inter-machine communication

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 52

Page 53: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

+

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG

MAP-REDUCE: A DIAGRAM

53

Big document

MAP:Read input and

produces a set of key-value pairs

Group by key:Collect all pairs with same key

(Hash merge, Shuffle, Sort, Partition)

Reduce:Collect all values belonging to the key and output

Page 54: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

+

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG

MAP-REDUCE: IN PARALLEL

54

All phases are distributed with many tasks doing the work

Page 55: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

MAP REDUCE SUMMARY

Highly fault tolerant

Relatively easy to write “arbitrary” distributed computations over very large amounts of data

MR framework removes burden of dealing with failures from programmer

Schema embedded in application code

A lack of shared schema

Makes sharing data betweenapplications difficult

Makes lots of DBMS “goodies” such as indices, integrity constraints, views, ... impossible

No declarative query language

55

Page 56: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

56

Map reduceSuited problems

Page 57: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

57

SUMMARIZATIONNumerical

Inverted index

Counting with counters

FILTERINGFiltering

Bloom

Top ten

Distinct

DATA ORGANIZATIONStructured to hierarchical

Partitioning

Binning

Total order sorting

Shuffling

JOINReduce side join

Reduce side join with bloom filter

Replicated join

Composite join

Cartesian product

• Minimum, maximum, count, average, median-standard deviation

• Wikipedia inverted index

• Count number of records, a small number of unique instances, summations

• Number of users per state

• Remove most of nonwatchedvalues, prefiltering data for a set membership check

• Hot list, Hbase query

• Closer view of data, tracking event threads, distributed grep, data cleansing, simple random sampling, remove low scoring data

• Outlier analysis, select interesting data, catchy dashbords

• Top ten users by reputation

• Deduplicate data, getting distinct values, protecting from inner join explosion

• Distinct user ids

• Prejoining data, preparing data for Hbaseor MongoDB

• Post/comment building for StackOverflow, Question/Answer building

• Partitioning users by last access date

• Binning by Hadoop-related tags

• Sort users by last visit

• Anonymizing StackOverflow comments

• Multiple large data sets joined by foreign key

• User – comment join

• Reputable user – comment join

• Replicated user – comment join

• Composite user – comment join

• Comment comparison

MAP-REDUCE DESIGN PATTERNS

Page 58: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

58

Pointers and further reading

Page 59: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

IMPLEMENTATIONS

Google­ Not available outside Google

Hadoop­ An open-source implementation in Java­ Uses HDFS for stable storage­ Download: http://lucene.apache.org/hadoop/

Aster Data­ Cluster-optimized SQL Database that also implements MapReduce

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 59

Page 60: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

READING

Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters­ http://labs.google.com/papers/mapreduce.html

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung: The Google File System­ http://labs.google.com/papers/gfs.html

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 60

Page 61: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

RESOURCES

Hadoop Wiki­ Introduction­ http://wiki.apache.org/lucene-hadoop/

­ Getting Started­ http://wiki.apache.org/lucene-hadoop/GettingStartedWithHadoop

­ Map/Reduce Overview ­ http://wiki.apache.org/lucene-hadoop/HadoopMapReduce­ http://wiki.apache.org/lucene-hadoop/HadoopMapRedClasses

­ Eclipse Environment­ http://wiki.apache.org/lucene-hadoop/EclipseEnvironment

Javadoc­ http://lucene.apache.org/hadoop/docs/api/

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 61

Page 62: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

RESOURCES

Releases from Apache download mirrors­ http://www.apache.org/dyn/closer.cgi/lucene/hadoop/

Nightly builds of source­ http://people.apache.org/dist/lucene/hadoop/nightly/

Source code from subversion­ http://lucene.apache.org/hadoop/version_control.html

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 62

Page 63: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

FURTHER READING

Programming model inspired by functional language primitives

Partitioning/shuffling similar to many large-scale sorting systems ­ NOW-Sort ['97]

Re-execution for fault tolerance ­ BAD-FS ['04] and TACC ['97]

Locality optimization has parallels with Active Disks/Diamond work ­ Active Disks ['01], Diamond ['04]

Backup tasks similar to Eager Scheduling in Charlotte system ­ Charlotte ['96]

Dynamic load balancing solves similar problem as River's distributed queues ­ River ['99]

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 63

Page 64: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

64

Page 65: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

65

C

A

P

C - A A - P

C - P

Data models

- Relational- Key-Value- Column oriented Tabular- Document oriented

- Dynamo- Voldemort- Tokyo Cabinet- KAI

- Cassandra- SimpleDB- CouchDB- Riak

- BigTable- HyperTable- Hbase

- MongoDB- TerraStore- Scalaris

- BerkeleyDB- MemcacheDB- Redis

- RDBM’s- MySQL- Postgres- etc

- Aster Data- GreenPlum- Vertica

Availability: each client can

always read & write

Partition tolerance: The system works well despite physical network partitions

Consistency: all clients always have

the same view of de data

VISUAL GUIDE TO NOSQL SYSTEMS

Page 66: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

NOSQL STORES CHARACTERISTICSSimple operations

­ Key lookups reads and writes of one record or a small

number of records

­ No complex queries or joins

­ Ability to dynamically add new attributes to data

records

­ Horizontal scalability

­ Distribute data and operations over many servers

­ Replicate and distribute data over many servers

­ No shared memory or disk

High performance

­ Efficient use of distributed indexes and RAM for data

storage

­ Weak consistency model

­ Limited transactions

66

Next generation databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable [http://nosql-database.org]

Page 67: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

67

DatastoresdesignedtoscalesimpleOLTP-styleapplicationloads

• Data model • Consistency • Storage • Durability

• Availability• Query support

Read/Write operations by thousands/millions of users

Page 68: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

IMPORTANT DESIGN GOALS

Scale out: designed for scale ­ Commodity hardware­ Low latency updates­ Sustain high update/insert throughput

Elasticity – scale up and down with load

High availability – downtime implies lost revenue­ Replication (with multi-mastering) ­ Geographic replication­ Automated failure recovery

68

Page 69: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

LOWER PRIORITIES

No Complex querying functionality­ No support for SQL­ CRUD operations through database specific API

No support for joins­ Materialize simple join results in the relevant row ­ Give up normalization of data?

No support for transactions­ Most data stores support single row transactions­ Tunable consistency and availability (e.g., Dynamo)

69

à Achieve high scalability

Page 70: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

70

Page 71: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

WHY SACRIFICE CONSISTENCY?

It is a simple solution ­ nobody understands what sacrificing P means­ sacrificing A is unacceptable in the Web ­ possible to push the problem to app developer

C not needed in many applications ­ Banks do not implement ACID (classic example wrong) ­ Airline reservation only transacts reads (Huh?) ­ MySQL et al. ship by default in lower isolation level

Data is noisy and inconsistent anyway­ making it, say, 1% worse does not matter

71

Page 72: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

CONSISTENCY MODEL

ACID semantics (transaction semantics in RDBMS)­ Atomicity: either the operation (e.g., write) is performed on all replicas or is not performed on any of

them­ Consistency: after each operation all replicas reach the same state­ Isolation: no operation (e.g., read) can see the data from another operation (e.g., write) in an

intermediate state­ Durability: once a write has been successful, that write will persist indefinitely

BASE semantics (modern Internet systems)­ Basically Available­ Soft-state (or scalable)­ Eventually consistent

72

Page 73: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

CONSISTENCY MODELS

Strong consistency:­ After the update completes, every subsequent access from A, B, C will return D1

Weak consistency:­ Does not guaranty that any subsequent accesses return D1 -> a number of conditions need to be met before

D1 is returned

Eventual consistency: Special form of weak consistency­ Guaranty that if no new updates are made, eventually all accesses will return D1

73

D0

A B C

DistributedStorage system

read(D)update(D)D0 à D1

Page 74: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

VARIATIONS OF EVENTUAL CONSISTENCY

Causal consistency:­ If A notifies B about the update, B will read D1 (but not C!)

Read your writes:­ A will always read D1 after its own update

Sessionconsistency:­ Read your writes inside a session

Monotonic reads:­ If a process has seen Dk, any subsequent access will never return any Di with i < k

Monotonic writes:­ Guaranty to seiralize the writes of the same process

74

Page 75: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

ACID VS BASE

Strong consistency for transactions highest priority

Availability less important

Pessimistic

Rigorous analysis

Complex mechanisms

Availability and scaling highest priorities

Weak consistency

Optimistic

Best effort

Simple and fast

75

ACID BASE

Page 76: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

76

Page 77: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

EXAMPLE: JOIN BY MAP-REDUCE

Compute the natural join R(A,B) ⋈ S(B,C)

R and S are each stored in files

Tuples are pairs (a,b) or (b,c)

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 77

A Ba1 b1

a2 b1

a3 b2

a4 b3

B Cb2 c1

b2 c2

b3 c3

⋈A Ca3 c1

a3 c2

a4 c3

=

RS

Page 78: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

+MAP REDUCE COMPLEX JOBS

78

Mapper1 Mapper2 Mapper3 Mappern

Reducer1 Reducer2 Reducern

Shuffling & Sorting

⋈ ⋈ ⋈

HDFS storesdata blocks

Each mapper processes one block

Each mapper producesthe join key & the record

pairs

Reducers performthe actual join

Page 79: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

MAP-REDUCE JOIN

Use a hash function h from B-values to 1...k

A Map process turns:­ Each input tuple R(a,b) into key-value pair (b,(a,R))­ Each input tuple S(b,c) into (b,(c,S))

Map processes send each key-value pair with key b to Reduce process h(b)­ Hadoop does this automatically; just tell it what k is.

Each Reduce process matches all the pairs (b,(a,R)) with all (b,(c,S)) and outputs (a,b,c).

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 79

Page 80: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

COST MEASURES FOR ALGORITHMS

In MapReduce we quantify the cost of an algorithm using

1. Communication cost = total I/O of all processes

2. Elapsed communication cost = max of I/O along any path

3. (Elapsed) computation cost analogous, but count only running time of processes

Note that here the big-O notation is not the most useful (adding more machines is always an option)

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 80

Page 81: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

EXAMPLE: COST MEASURES

For a map-reduce algorithm:­Communication cost = input file size + 2 ´ (sum of the sizes of all files passed from Map processes to Reduce processes) + the sum of the output sizes of the Reduce processes.­Elapsed communication cost is the sum of the largest input + output for any map process, plus the same for any reduce process

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 81

Page 82: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

WHAT COST MEASURES MEAN

Either the I/O (communication) or processing (computation) cost dominates­ Ignore one or the other

Total cost tells what you pay in rent from your friendly neighborhood cloud

Elapsed cost is wall-clock time using parallelism

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 82

Page 83: from data processing to architecturesvargas-solar.com/big-linked-data-keystone/wp... · e.g., Amazon EC2, GoGrid, Rackspace Platform as a service (PaaS) e.g., Microsoft Azure, Google

COST OF MAP-REDUCE JOIN

Total communication cost= O(|R|+|S|+|R ⋈ S|)

Elapsed communication cost = O(s)­ We’re going to pick k and the number of Map processes so that the I/O limit s is respected­ We put a limit s on the amount of input or output that any one process can have. s could be:­ What fits in main memory

­ What fits on local disk

With proper indexes, computation cost is linear in the input + output size­ So computation cost is like comm. cost

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 83


Recommended