+ All Categories
Home > Documents > Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Date post: 27-Jan-2015
Category:
Upload: mjfrankli
View: 109 times
Download: 4 times
Share this document with a friend
Description:
 
Popular Tags:
38
Transforming Big Data with Spark and Shark Michael Franklin and Matei Zaharia – UC Berkeley UC BERKELEY
Transcript
Page 1: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Transforming Big Data with Spark and Shark

Michael Franklin and Matei Zaharia – UC Berkeley

UC BERKELEY

Page 2: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Sources Driving Big Data

It’s All Happening On-line

Every:ClickAd impressionBilling eventFast Forward, pause,…Friend RequestTransactionNetwork messageFault…

User Generated (Web, Social & Mobile)

…..

Internet of Things / M2M Scientific Computing

Page 3: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Big Data: The Challenges

3

Terabytes Petabytes+Volume

Structured Unstructured Variety

Batch Real-TimeVelocity

Our view: More data should mean better answers

• Must deal with vertical and horizontal growth

• Must balance Cost, Time, and Answer Quality

Page 4: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

AMP Expedition

4

Page 5: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Resources for Making Sense at Scale

Algorithms: Machine Learning and

Analytics

Machines:

Cloud Computing

People:

CrowdSourcing &

Human Computation

5

Massive and Diverse

Data

UC BERKELEY

Page 6: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

The AMPLab Big Bets• New “Big Data” stacks are limited by traditional intellectual

borders• Need Machine Learning/Systems/Database Co-Design• Requires Cohabitation and Real Collaboration

• Opportunity to rethink fundamental design points:• Low Latency• Variable Consistency• Cloud-based Elastic Resources• Desire for New Solutions in the Marketplace

• Consider role of people throughout the entire analytics lifecycle6

Page 7: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

AMPLab FactsAn integration of Faculty Interests (*Directors):

+ ~50 amazing students, post-docs, staff & visitors

7

Alex Bayen (Mobile Sensing) Anthony Joseph (Sec./ Privacy)

Ken Goldberg (Crowdsourcing) Randy Katz (Systems)

*Michael Franklin (Databases) Dave Patterson (Systems)

Armando Fox (Systems) *Ion Stoica (Systems)

*Mike Jordan (Machine Learning) Scott Shenker (Networking)

Organized for Collaboration:

Page 8: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

AMP Facts (continued)

• Launched February 2011; 6 Year Duration• Strong industry and government support

• NSF Expedition and Darpa XData• BDAS stack components released

as BSD/Apache Open Source (e.g. Spark, Shark, Mesos)

8

Page 9: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

App: Carat - Detection of Smartphone Energy Bugs

9

> 450,000downloads

Page 10: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

App: Cancer Tumor Genomics• Vision: Personalized Therapy

“…10 years from now, each cancer patient is going to want to get a genomic analysis of their cancer and will expect customized therapy based on that information.” Director, The Cancer Genome Atlas (TCGA), Time Magazine, 6/13/11

10

• UCSF cancer researchers + UCSC cancer genetic database + AMP Lab + Intel Cluster@TCGA: 5 PB = 20 cancers x 1000 genomes

• Sequencing costs (150X) Big Data

David Patterson, “Computer Scientists May Have What It Takes to Help Cure Cancer,” New York Times, 12/5/2011

$0.1

$1.0

$10.0

$100.0

$1,000.0

$10,000.0

$100,000.0

$K per genome

2001 - 2014

• See Dave Patterson’s Talk: Thursday 3-4, BDT205

Page 11: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

MLBase (Declarative Machine Learning)

BlinkDB (approx QP)

BDAS: The Berkeley Data Analytics System

11

HDFS

Shark (SQL) + Streaming

AMPLab (released)3rd party AMPLab (in progress)

Streaming

Hadoop MRMPI

Graphlabetc. Spark

Shared RDDs (distributed memory) Mesos (cluster resource manager)

Page 12: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

BDAS: Where we’re going• Time, Cost, Quality tradeoffs using sampling and the “Bag

of Little Bootstraps”• Refactoring Distributed Memory layer for sharing• Low-latency (real-time) processing via discretized streams• Graph Processing and Asynchronous computation • Declarative Machine Learning libraries that utilize these

interfaces for scalability• A “logical plan” level to serve as the narrow waist for these

and future components.• Integration of the “People” component (e.g., CrowdDB)

12

Page 13: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

For More Informationamplab.cs.berkeley.edu

• Papers and project pages• News updates and blogs

Spark User Group and MeetupGithub and Apache Mesos

13

Page 14: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Deep Dive: Spark and Shark

Page 15: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

What is Spark?

• Fast, MapReduce-like engine• In-memory storage for very fast iterative queries• General execution graphs• Up to 100x faster than Hadoop (2-10x even for on-disk data)

• Compatible with Hadoop’s storage APIs• Can access HDFS, HBase, S3, SequenceFiles, etc

Lightning-Fast Cluster Computing

Page 16: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

What is Shark?

• Port of Apache Hive to run on Spark

• Compatible with existing Hive data, metastores, and queries (HiveQL, UDFs, etc)

• Can be more than 100x faster

Page 17: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Project History

• Spark started in 2009, open sourced 2010

• Shark released spring 2012

• In use at Yahoo!, Klout, Airbnb, Foursquare, Conviva, Quantifind & others

• 400+ member meetup, 20+ developers

Page 18: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Spark

• Language-integrated API in Scala, Java and soon Python

• Can be used interactively from Scala and Python shells

• Lets users manipulate distributed collections (“resilient distributed datasets”, or RDDs) with parallel operations

Page 19: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

tasks

results

Cache 1

Cache 2

Cache 3

Base RDDTransformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data)

Page 20: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Fault Tolerance

RDDs track the series of transformations used to build them (their lineage) to recompute lost data

E.g:messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))

HadoopRDDpath = hdfs://…

FilteredRDDfunc = _.contains(...)

MappedRDDfunc = _.split(…)

Page 21: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Example: Logistic Regression

Goal: find best line separating two sets of points

+

+ ++

+

+

++ +

– ––

––

+

target

random initial line

Page 22: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Example: Logistic Regression

val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}

println("Final w: " + w)

Initial parameter vector

Repeated MapReduce stepsto do gradient descent

Load data in memory once

Page 23: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

1 10 20 300

10

20

30

40

50

60

HadoopSpark

Number of Iterations

Ru

nn

ing

Tim

e (

min

)

Logistic Regression Performance

110 s / iteration

first iteration 80 sfurther iterations 1 s

Page 24: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Spark in Java and Python

JavaRDD<String> lines = sc.textFile(...);

lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();

lines = sc.textFile(...)

lines.filter(lambda x: x.contains('error')) \ .count()

Java API(out now)

PySpark(coming soon)

Page 25: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

User Applications

• In-memory analytics on Hive data (Conviva)

• Interactive queries on data streams (Quantifind)

• Business intelligence (Yahoo!)

• Traffic estimation w/ GPS data (Mobile Millennium)

• DNA sequence analysis (SNAP)

. . .

Page 26: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Conviva GeoReport

• Group aggregations on many keys with same filter

• 40× gain over Hive from avoiding repeated reading, deserialization and filtering

Spark

Hive

0 2 4 6 8 10 12 14 16 18 20

0.5

20

Time (hours)

Page 27: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Shark: SQL on Spark

• Hive is great, but Hadoop’s execution engine makes even the smallest queries take minutes

• Can we extend Hive to run on Spark?

Page 28: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Hive Architecture

Meta store

HDFS

Client

Driver

SQL Parser

Query Optimizer

Physical Plan

Execution

CLI JDBC

MapReduce

Page 29: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Shark Architecture

Meta store

HDFS

Client

Driver

SQL Parser

Physical Plan

Execution

CLI JDBC

Spark

Cache Mgr.

Query Optimizer

Page 30: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Column-Oriented Storage

• Caching Hive records as Java objects is inefficient• Instead, use arrays of primitive types for columns

• Similar size to serialized form, but 5x faster to process• Columnar compression can further reduce size by 5x

1

Column Storage

2 3

john mike sally

4.1 3.5 6.4

Row Storage

1 john 4.1

2 mike 3.5

3 sally 6.4

Page 31: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Other Shark Optimizations

• Dynamic join algorithm selection based on the data

• Runtime selection of # of reducers

• Partition pruning using range statistics

• Controllable table partitioning across nodes

Page 32: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Using Shark

CREATE TABLE latest_logs TBLPROPERTIES ("memory"=true)AS SELECT * FROM logs WHERE date > now()-3600;

Then, just run HiveQL against it!

Page 33: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Shark Results

Selection0

10

20

30

40

50

60

70

80

90

100

1.1

Shark Shark (disk) Hive

100 m2.4xlarge nodes2.1 TB benchmark (Pavlo et al)

SELECT pageURL, pageRankFROM rankingsWHERE pageRank > X;

Page 34: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Shark Results: Group By

100 m2.4xlarge nodes2.1 TB benchmark (Pavlo et al)

SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue)FROM uservisitsGROUP BY SUBSTR(sourceIP, 1, 7);

Group By0

100

200

300

400

500

600

32

Shark Shark (disk) Hive

Page 35: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Shark Results: Join

100 m2.4xlarge nodes2.1 TB benchmark (Pavlo et al)

SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS totalRevenueFROM rankings r, uservisits vWHERE r.pageURL = v.destURL AND v.visitDate BETWEEN Date(’2000-01-15’) AND Date(’2000-01-22’)GROUP BY v.sourceIP;

Join0

300

600

900

1200

1500

1800

10

5

Shark (copartitioned)

Shark

Shark (disk)

Hive

Page 36: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

User Queries

Yahoo!, Conviva report 40-100x speedups over Hive

Query 10

10

20

30

40

50

60

70

0.8

Shark Shark (disk) Hive

Query 20

10

20

30

40

50

60

70

0.7

Query 30

10

20

30

40

50

60

70

80

90

100

1.0

100 m2.4xlargenodes, 1.7 TBConviva dataset

Page 37: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Getting Started

• Spark and Shark both have scripts for launching on EC2

• Work with data in HDFS, HBase, S3, and existing Hive warehouses and metastores

• Local execution mode for testing

spark-project.org amplab.cs.berkeley.edu

UC BERKELEY

Page 38: Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

We are sincerely eager to hear your feedback on this

presentation and on re:Invent.

Please fill out an evaluation form when you have a

chance.


Recommended