+ All Categories
Home > Documents > What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s >...

What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s >...

Date post: 22-May-2020
Category:
Upload: others
View: 14 times
Download: 0 times
Share this document with a friend
46
New Developments in Spark Matei Zaharia and many others And Rethinking APIs for Big Data
Transcript
Page 1: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

New Developments in Spark

Matei Zaharia and many others

And Rethinking APIs for Big Data

Page 2: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

What is Spark?

Unified computing engine for big data apps > Batch, streaming and interactive

Collection of high-level APIs > One of first widely used systems

with a functional API > Libraries for SQL, ML, graph, …

Spark

Stre

amin

g

SQL

MLl

ib

Gra

phX

Page 3: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Project Growth

June 2013 January 2016

Lines of code 70,000 450,000

Total contributors 80 1000

Monthly contributors 20 140

Largest cluster 400 nodes 8000 nodes

Page 4: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Project Growth

June 2013 January 2016

Lines of code 70,000 450,000

Total contributors 80 1000

Monthly contributors 20 140

Largest cluster 400 nodes 8000 nodes

Most active open source project in big data

Page 5: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

This Talk

Original Spark vision

How did the vision hold up?

New APIs: DataFrames + Spark SQL

New capabilities under these APIs

Ongoing research

Page 6: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Original Spark Vision

1) Unified engine for big data processing > Combines batch, interactive, streaming

2) Concise, language-integrated API > Functional programming in Scala/Java/Python

Page 7: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

MapReduce

General batch processing

Pregel

Dremel

Presto

Storm

Giraph

Drill

Impala

S4 . . .

Specialized systems for new workloads

Motivation: Unification

Hard to manage, tune, deploy Hard to compose into pipelines

Page 8: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

MapReduce

Pregel

Dremel

Presto

Storm

Giraph

Drill

Impala

S4

Specialized systems for new workloads

General batch processing

Unified engine

Motivation: Unification

? . . .

Page 9: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Motivation: Concise API

Much of data analysis is exploratory / interactive Answer: Resilient Distributed Datasets (RDDs) > Distributed collections with simple functional API

lines = spark.textFile(“hdfs://...”)

points = lines.map(line => parsePoint(line))

points.filter(p => p.x > 100).count()

Page 10: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

This Talk

Original Spark vision

How did the vision hold up?

New APIs: DataFrames + Spark SQL

New capabilities under these APIs

Ongoing research

Page 11: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

How Did the Vision Hold Up?

Mostly well Users really appreciate unification Functional API causes some challenges, which we are now tackling

Page 12: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Spark Core

Spark Streaming

real-time

Spark SQL relational

MLlib machine learning

GraphX graph

Libraries Built on Spark

Largest integrated library for big data

Page 13: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Which Libraries do People Use?

80% of users use more than one component 60% use three or more

18%

54%

58%

69%

GraphX

MLlib

Streaming

Spark SQL

Page 14: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Which Languages do People Use?

84%

38% 38%

71%

31%

58%

18%

2014 Languages Used 2015 Languages Used

Page 15: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Main Challenge: Functional API

Looks high-level, but hides many semantics of computation from the engine > Functions passed in are arbitrary blocks of code > Data stored is arbitrary Java/Python objects

Users can mix APIs in suboptimal ways

Page 16: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Which API Call Causes Most Tickets?

map

filter

groupBy

sort

union

join

leftOuterJoin

rightOuterJoin

reduce

count

fold

reduceByKey

cogroup

cross

zip

sample

take

first

partitionBy

mapWith

pipe

save

...

groupByKey

Page 17: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

What People Do

pairs = data.map(word => (word, 1))

groups = pairs.groupByKey()

groups.map((k, vs) => (k, vs.sum))

Materializes all groups as lists of integers

Then sums each list

(“the”, [1, 1, 1, 1, 1, 1]) (“quick”, [1, 1]) (“fox”, [1, 1])

(“the”, 6) (“quick”, 2) (“fox”, 2)

Better code: pairs.reduceByKey(_ + _)

Page 18: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

class User(name: String, friends: Array[Int])

Challenge: Data Representation

User 0x… 0x…

String

3

0

1 2

B o b b y

5 0x…

int[]

char[] 5

Object graphs much larger than underlying data

Page 19: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

This Talk

Original Spark vision

How did the vision hold up?

New APIs: DataFrames + Spark SQL

New capabilities under these APIs

Ongoing research

Page 20: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

DataFrames and Spark SQL

Efficient library for working with structured data > Two interfaces: SQL for data analysts + external

apps, DataFrames for programmers

Optimized computation and storage

SIGMOD 2015

Page 21: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Spark SQL Architecture

Logical Plan

Physical Plan

Catalog

Optimizer RDDs

Data Source

API

SQL Data Frames

Code

Generator

Page 22: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

DataFrame API

DataFrames hold rows with a known schema and offer relational operations on them through a DSL

users = sql(“select * from users”) ma_users = users[users.state == “MA”] ma_users.count() ma_users.groupBy(“name”).avg(“age”) ma_users.map(lambda u: u.name.toUpper())

Expression AST

Page 23: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

API Details

Based on data frame concept in R and Python > Spark is first system to make this API declarative

Integrated with the rest of Spark > MLlib takes DataFrames as input/output > Easily convert RDDs � DataFrames

Google trends for “data frame”

Page 24: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

What DataFrames Enable

1.  Compact binary representation •  Columnar format outside Java heap

2.  Optimization across operators (join reordering, predicate pushdown, etc)

3.  Runtime code generation

Page 25: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Performance

0 2 4 6 8 10

RDD Scala

RDD Python

DataFrame Scala

DataFrame Python

DataFrame R

DataFrame SQL

Time for aggregation benchmark (s)

Page 26: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Performance

0 2 4 6 8 10

RDD Scala

RDD Python

DataFrame Scala

DataFrame Python

DataFrame R

DataFrame SQL

Time for aggregation benchmark (s)

Page 27: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

DataFrames vs SQL

Easier to compose into large programs: organize code into functions, classes, etc

“[DataFrames are] concise and declarative like SQL, but I can name intermediate values”

Spark 1.6 adds static typing over DataFrames (Datasets: tinyurl.com/spark-datasets)

Page 28: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

This Talk

Original Spark vision

How did the vision hold up?

New APIs: DataFrames + Spark SQL

New capabilities under these APIs

Ongoing research

Page 29: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

New Capabilities under Spark SQL

Uniform and efficient access to data sources Rich optimization across libraries

Page 30: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Data Sources

Having a uniform API for structured data lets apps migrate across data sources > Hive, MySQL, Cassandra, JSON, …

API semantics allow query pushdown into sources (not possible with old RDD API)

users[users.age > 20]

select id from users

Spark SQL

Page 31: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Examples

JSON:

JDBC:

Together:

select user.id, text from tweets

{ “text”: “hi”, “user”: { “name”: “bob”, “id”: 15 } }

tweets.json

select age from users where lang = “en”

select t.text, u.age from tweets t, users u where t.user.id = u.id and u.lang = “en”

Spark SQL

{JSON}

select id, age from users where lang=“en”

Page 32: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Library Composition

One of our goals was to unify processing types Problem: optimizing across libraries > Big data is expensive to copy & scan > Libraries are written in isolation

Spark SQL gives more semantics to do this

Logical Plan

SQL Data Frames ML Graph

Not a problem for small data

Page 33: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Example: ML Pipelines

New API in MLlib that lets users express and optimize end-to-end workflows > Feature preparation, training, evaluation > Similar to scikit-learn, but declarative

tokenizer = Tokenizer() tf = HashingTF(features=1000) lr = LogisticRegression(r=0.1) p = Pipeline(tokenizer, tf, lr) p.fit(df)

tokenizer TF LR

model DataFrame Fused into one pass over data

Filters pushed into data source

CrossValidator.fit(p, df, args)

Repeated queries

Page 34: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

This Talk

Original Spark vision

How did the vision hold up?

New APIs: DataFrames + Spark SQL

New capabilities under these APIs

Ongoing research

Page 35: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

The Problem

Hardware has changed a lot since big data systems were first designed

2010

Storage 50+MB/s (HDD)

Network 1Gbps

CPU ~3GHz

Page 36: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

The Problem

Hardware has changed a lot since big data systems were first designed

2010 2015

Storage 50+MB/s (HDD)

500+MB/s (SSD)

Network 1Gbps 10Gbps

CPU ~3GHz ~3GHz

Page 37: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

2010 2015

Storage 50+MB/s (HDD)

500+MB/s (SSD) 10x

Network 1Gbps 10Gbps 10x

CPU ~3GHz ~3GHz !

The Problem

Hardware has changed a lot since big data systems were first designed

New bottleneck in Spark, Hadoop, etc

Page 38: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

CPU

To Make Matters Worse

In response to the slowdown of Moore’s Law, hardware is becoming more diverse

Have to optimize separately for each platform!

GPU FPGA

App 1 App 2 App 3

Page 39: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Observation

Many common algorithms can be written with “embarrassingly” data-parallel operations > See how many run on MapReduce / Spark

Focus on optimizing these as opposed to general programs (e.g. C++)

Page 40: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

The Goal

CPUs GPUs ...

intermediate language

machine learning SQL graph

algorithms

transformations

Page 41: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Nested Vector Language (NVL)

Functional-like parallel language > Captures SQL, machine learning, and graphs,

but very easy to analyze Closed under composition (nested calls) and common transformations (e.g. loop fusion) > Unlike relational algebra, OpenCL, NESL

Page 42: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Example Transformations def query(products: vec[{dept:int, price:int}]): sum = 0 for p in products: if p.dept == 20: sum += p.price

def query(dept: vec[int], price: vec[int]): sum = 0 for i in 0..len(users): if dept[i] == 20: sum += price[i]

for i in 0..len(products) by 4: sum += price[i..i+4] * (dept[i..i+4] == [20,20,20,20])

row-to-column

vectorization

Page 43: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Results: TPC-H Q6

0.53

0.14 0.08 0.11

0.03 0.00

0.10

0.20

0.30

0.40

0.50

0.60

Python Java C HyPer Database

NVL

Run

time

(sec

)

Page 44: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Effect of Transformations

0.23

0.08

0.03

0.00

0.05

0.10

0.15

0.20

0.25

Row-Oriented Program

After Row-To-Column

After Vectorization

Run

time

(sec

)

Transformations usable on any NVL program

Page 45: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Library Composition API

Disjoint libraries can take & return “NVL objects” to build up a combined program Example: optimize across Spark and NumPy

data = sql(“select features from users where age>20”)

scores = data.map(lambda vec: scoreMatrix * vec)

mean = scores.mean()

Page 46: What is Spark? - Stanford University Talks/stanford-seminar.pdf · What is Spark? apps > ve s > stems I > aph, … Spark SQL Streaming MLlib GraphX . wth une 2013 6 Lines of code

Conclusion

Large data volumes + changing hardware pose a formidable challenge for next-generation apps Spark shows a unified API for data apps is useful NVL targets a new range of optimizations and environments


Recommended