+ All Categories
Home > Documents > Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei...

Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei...

Date post: 20-May-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
32
Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015
Transcript
Page 1: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

Making Big Data Processing Simple with Spark

Matei Zaharia December 17, 2015

Page 2: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

What is Apache Spark?

Fast and general cluster computing engine that generalizes the MapReduce model

Makes it easy and fast to process large datasets • High-level APIs in Java, Scala, Python, R • Unified engine that can capture many workloads

Page 3: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

A Unified Engine

Spark

Spark Streaming

real-time

Spark SQL structured data

MLlib machine learning

GraphX graph

Page 4: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

0 20 40 60 80

100 120 140 160

2010 2011 2012 2013 2014 2015

Cont

ribut

ors

Contributors / Month to Spark

A Large Community

Most active open source project for big data

Page 5: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

Overview

Why a unified engine?

Spark programming model

Built-in libraries

Applications

Page 6: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

History: Cluster Computing 2004

Page 7: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

A general engine for batch processing

MapReduce

Page 8: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

Beyond MapReduce

MapReduce was great for batch processing, but users quickly needed to do more: • More complex, multi-pass algorithms • More interactive ad-hoc queries • More real-time stream processing

Result: specialized systems for these workloads

Page 9: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

MapReduce

Pregel

Dremel

Presto

Storm

Giraph

Drill

Impala

S4 . . .

Specialized systems for new workloads

General batch processing

Big Data Systems Today

Page 10: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

Problems with Specialized Systems

More systems to manage, tune, deploy

Can’t easily combine processing types • Even though most applications need to do this! • E.g. load data with SQL, then run machine learning

In many cases, data transfer between engines is a dominant cost!

Page 11: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

MapReduce

Pregel

Dremel

Presto

Storm

Giraph

Drill

Impala

S4

Specialized systems for new workloads

General batch processing

Unified engine

Big Data Systems Today

? . . .

Page 12: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

Overview

Why a unified engine?

Spark programming model

Built-in libraries

Applications

Page 13: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

Background

Recall 3 workloads were issues for MapReduce: • More complex, multi-pass algorithms • More interactive ad-hoc queries • More real-time stream processing

While these look different, all 3 need one thing that MapReduce lacks: efficient data sharing

Page 14: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

Data Sharing in MapReduce

iter. 1 iter. 2 . . .

Input

HDFS read

HDFS write

HDFS read

HDFS write

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFS read

Slow due to replication and disk I/O

Page 15: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

iter. 1 iter. 2 . . .

Input

What We’d Like

Distributed memory

Input

query 1

query 2

query 3

. . .

one-time processing

10-100x faster than network and disk

Page 16: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

Spark Programming Model

Resilient Distributed Datasets (RDDs) • Collections of objects stored in RAM or disk across cluster • Built via parallel transformations (map, filter, …) • Automatically rebuilt on failure

Page 17: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines=spark.textFile(“hdfs://...”)

errors=lines.filter(lambdas:s.startswith(“ERROR”))

messages=errors.map(lambdas:s.split(‘\t’)[2])

messages.cache()Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambdas:“MySQL”ins).count()

messages.filter(lambdas:“Redis”ins).count()

...

tasks

results Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Action

Example: full-text search of Wikipedia in 0.5 sec (vs 20s for on-disk data)

Page 18: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

Fault Tolerance

file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

filter reduce map

Inpu

t file

RDDs track lineage info to rebuild lost data

Page 19: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

filter reduce map

Inpu

t file

Fault Tolerance

file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

RDDs track lineage info to rebuild lost data

Page 20: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

Example: Logistic Regression

0

500

1000

1500

2000

2500

3000

3500

4000

1 5 10 20 30

Runn

ing

Tim

e (s

)

Number of Iterations

Hadoop

Spark

110 s / iteration

first iteration 80 s further iterations 1 s

Page 21: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

Source: Daytona GraySort benchmark, sortbenchmark.org

2100 machines 2013 Record: Hadoop

72 minutes

2014 Record: Spark

207 machines

23 minutes

On-Disk Performance Time to sort 100TB

Page 22: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

Libraries Built on Spark

Spark

Spark Streaming

real-time

Spark SQL structured data

MLlib machine learning

GraphX graph

Page 23: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

// Load data using SQL points = ctx.sql(“select latitude, longitude from tweets”)

// Train a machine learning model model = KMeans.train(points, 10)

// Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)

Combining Processing Types

Page 24: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

Combining Processing Types

Separate systems:

. . .

HDFS read

HDFS write ET

L HDFS read

HDFS write tr

ain HDFS

read HDFS write qu

ery

HDFS write

HDFS read ET

L tr

ain

quer

y

Spark:

Page 25: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

Hiv

eIm

pala

(dis

k)

Impa

la (m

em)

Spar

k (d

isk)

Sp

ark

(mem

)

0

10

20

30

40

50

Resp

onse

Tim

e (s

ec)

SQL

Mah

out

Grap

hLab

Sp

ark

0

10

20

30

40

50

60

Resp

onse

Tim

e (m

in)

ML

Performance vs Specialized Systems

Stor

m

Spar

k 0

5

10

15

20

25

30

35

Thro

ughp

ut (M

B/s/

node

)

Streaming

Page 26: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

Some Recent Additions

DataFrame API (similar to R and Pandas) • Easy programmatic way to work with structured data

R interface (SparkR)

Machine learning pipelines (like SciKit-learn)

Page 27: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

Overview

Why a unified engine?

Spark programming model

Built-in libraries

Applications

Page 28: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

Over 1000 deployments, clusters up to 8000 nodes

Spark Community

Many talks online at spark-summit.org

Page 29: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

Top Applications

29%

36%

40%

44%

52%

68%

Faud Detection / Security

User-Facing Services

Log Processing

Recommendation

Data Warehousing

Business Intelligence

Page 30: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

Spark Components Used

58%

58%

62%

69%

MLlib + GraphX

Spark Streaming

DataFrames

Spark SQL

75%

of users use more than one component

Page 31: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

Learn More

Get started on your laptop: spark.apache.org

Resources and MOOCs: sparkhub.databricks.com

Spark Summit: spark-summit.org

Page 32: Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

Thank You


Recommended