+ All Categories
Home > Documents > Fast and Expressive Big Data Analytics with Python

Fast and Expressive Big Data Analytics with Python

Date post: 24-Feb-2016
Category:
Upload: alyson
View: 59 times
Download: 0 times
Share this document with a friend
Description:
Fast and Expressive Big Data Analytics with Python. Matei Zaharia. UC Berkeley / MIT. spark-project.org. UC BERKELEY. What is Spark?. Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory computing primitives - PowerPoint PPT Presentation
Popular Tags:
29
Matei Zaharia Fast and Expressive Big Data Analytics with Python UC BERKELEY spark-project .org UC Berkeley / MIT
Transcript
Page 1: Fast and Expressive Big Data Analytics with Python

Matei Zaharia

Fast and Expressive Big Data Analytics with Python

UC BERKELEY

spark-project.org

UC Berkeley / MIT

Page 2: Fast and Expressive Big Data Analytics with Python

What is Spark?Fast and expressive cluster computing system interoperable with Apache HadoopImproves efficiency through:

»In-memory computing primitives»General computation graphs

Improves usability through:»Rich APIs in Scala, Java, Python»Interactive shell

Up to 100× faster(2-10× on disk)

Often 5× less code

Page 3: Fast and Expressive Big Data Analytics with Python

Project HistoryStarted in 2009, open sourced 201017 companies now contributing code

»Yahoo!, Intel, Adobe, Quantifind, Conviva, Bizo, …

Entered Apache incubator in JunePython API added in February

Page 4: Fast and Expressive Big Data Analytics with Python

An Expanding StackSpark is the basis for a wide set of projects in the Berkeley Data Analytics Stack (BDAS)

Spark

Spark Streamin

g(real-time)

GraphX(graph)

Shark(SQL)

MLbase(machine learning)

More details: amplab.berkeley.edu

Page 5: Fast and Expressive Big Data Analytics with Python

This TalkSpark programming modelExamplesDemoImplementationTrying it out

Page 6: Fast and Expressive Big Data Analytics with Python

Why a New Programming Model?

MapReduce simplified big data processing, but users quickly found two problems:Programmability: tangle of map/red functionsSpeed: MapReduce inefficient for apps that share data across multiple steps

»Iterative algorithms, interactive queries

Page 7: Fast and Expressive Big Data Analytics with Python

Data Sharing in MapReduce

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFSread

Slow due to data replication and disk I/O

Page 8: Fast and Expressive Big Data Analytics with Python

iter. 1 iter. 2 . . .

Input

Distributedmemory

Input

query 1

query 2

query 3

. . .

one-timeprocessing

10-100× faster than network and disk

What We’d Like

Page 9: Fast and Expressive Big Data Analytics with Python

Spark ModelWrite programs in terms of transformations on distributed datasets

Resilient Distributed Datasets (RDDs)»Collections of objects that can be stored in

memory or disk across a cluster»Built via parallel transformations (map,

filter, …)»Automatically rebuilt on failure

Page 10: Fast and Expressive Big Data Analytics with Python

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patternslines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))messages = errors.map(lambda s: s.split(“\t”)[2])messages.cache()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda s: “foo” in s).count()messages.filter(lambda s: “bar” in s).count(). . .

tasks

resultsCache 1

Cache 2

Cache 3

Base RDD

Transformed RDD

Action

Result: full-text search of Wikipedia in 2 sec (vs 30 s for

on-disk data)

Result: scaled to 1 TB data in 7 sec (vs 180 sec for on-disk

data)

Page 11: Fast and Expressive Big Data Analytics with Python

Fault ToleranceRDDs track the transformations used to build them (their lineage) to recompute lost datamessages = textFile(...).filter(lambda s: “ERROR” in s) .map(lambda s: s.split(“\t”)[2])

HadoopRDDpath = hdfs://…

FilteredRDDfunc = lambda s:

MappedRDDfunc = lambda s:

Page 12: Fast and Expressive Big Data Analytics with Python

Example: Logistic RegressionGoal: find line separating two sets of points

+

+ ++

+

+

++ +

– ––

–– –

+

target

random initial line

Page 13: Fast and Expressive Big Data Analytics with Python

Example: Logistic Regressiondata = spark.textFile(...).map(readPoint).cache()

w = numpy.random.rand(D)

for i in range(iterations): gradient = data.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x ).reduce(lambda x, y: x + y) w -= gradient

print “Final w: %s” % w

Page 14: Fast and Expressive Big Data Analytics with Python

Logistic Regression Performance

1 5 10 20 300

5001000150020002500300035004000

HadoopPySpark

Number of Iterations

Runn

ing

Tim

e (s

)

110 s / iteration

first iteration 80 sfurther iterations

5 s

Page 15: Fast and Expressive Big Data Analytics with Python

Demo

Page 16: Fast and Expressive Big Data Analytics with Python

Supported OperatorsmapfiltergroupByunionjoinleftOuterJoinrightOuterJoin

reducecountfoldreduceByKeygroupByKeycogroupflatMap

takefirstpartitionBypipedistinctsave...

Page 17: Fast and Expressive Big Data Analytics with Python

1000+ meetup members60+ contributors17 companies contributing

Spark Community

Page 18: Fast and Expressive Big Data Analytics with Python

This TalkSpark programming modelExamplesDemoImplementationTrying it out

Page 19: Fast and Expressive Big Data Analytics with Python

OverviewSpark core is written in ScalaPySpark calls existing scheduler, cache and networking layer (2K-line wrapper)No changes to Python

Your app Spark

client

Spark worker

Python child

Python child

PySp

ark

Spark worker

Python child

Python child

Page 20: Fast and Expressive Big Data Analytics with Python

OverviewSpark core is written in ScalaPySpark calls existing scheduler, cache and networking layer (2K-line wrapper)No changes to Python

Your app Spark

client

Spark worker

Python child

Python childPy

Spar

k

Spark worker

Python child

Python child

Main PySpark author:Josh Rosen

cs.berkeley.edu/~joshrosen

Page 21: Fast and Expressive Big Data Analytics with Python

Object MarshalingUses pickle library for both communication and cached data

»Much cheaper than Python objects in RAM

Lambda marshaling library by PiCloud

Page 22: Fast and Expressive Big Data Analytics with Python

Job SchedulerSupports general operator graphsAutomatically pipelines functionsAware of data locality and partitioning

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

= cached data partition

Page 23: Fast and Expressive Big Data Analytics with Python

InteroperabilityRuns in standard CPython, on Linux / Mac

»Works fine with extensions, e.g. NumPy

Input from local file system, NFS, HDFS, S3

»Only text files for now

Works in IPython, including notebookWorks in doctests – see our tests!

Page 24: Fast and Expressive Big Data Analytics with Python

Getting StartedVisit spark-project.org for video tutorials, online exercises, docsEasy to run in local mode (multicore), standalone clusters, or EC2Training camp at Berkeley in August (free video): ampcamp.berkeley.edu

Page 25: Fast and Expressive Big Data Analytics with Python

Getting StartedEasiest way to learn is the shell:$ ./pyspark

>>> nums = sc.parallelize([1,2,3]) # make RDD from array

>>> nums.count()3

>>> nums.map(lambda x: 2 * x).collect()[2, 4, 6]

Page 26: Fast and Expressive Big Data Analytics with Python

ConclusionPySpark provides a fast and simple way to analyze big datasets from PythonLearn more or contribute at spark-project.org

Look for our training camp on August 29-30!

My email: [email protected]

Page 27: Fast and Expressive Big Data Analytics with Python

Behavior with Not Enough RAM

Cache disabled

25% 50% 75% Fully cached

020406080

10068

.858.1

40.729.7

11.5

% of working set in memory

Iter

atio

n ti

me

(s)

Page 28: Fast and Expressive Big Data Analytics with Python

The Rest of the StackSpark is the foundation for wide set of projects in the Berkeley Data Analytics Stack (BDAS)

Spark

Spark Streamin

g(real-time)

GraphX(graph)

Shark(SQL)

MLbase(machine learning)

More details: amplab.berkeley.edu

Page 29: Fast and Expressive Big Data Analytics with Python

Performance Comparison

0

5

10

15

20

25

Impa

la

(disk

)Im

pala

(m

em)

Reds

hift

Shar

k (d

isk)

Shar

k (m

em)Re

spon

se T

ime

(s)

SQL0

5

10

15

20

25

30

35

Stor

mSp

ark

Thro

ughp

ut (

MB/

s/no

de)

Streaming0

5

10

15

20

25

30

Hado

opGi

raph

Grap

hLab

Grap

hX

Resp

onse

Tim

e (m

in)

Graph


Recommended