PySaprk

PySpark Next generation cloud

computing engine using Python

Wisely Chen Yahoo! Taiwan Data team

Who am I?

• Wisely Chen ( [email protected] )

• Sr. Engineer in Yahoo![Taiwan] data team

• Loves to promote open source tech

• Hadoop Summit 2013 San Jose

• Jenkins Conf 2013 Palo Alto

• Coscup 2006, 2012, 2013 , OSDC 2007,2014, Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012

mailto:[email protected]

Taiwan Data Team

Data!Highway

BI!Report

Serving!API

Data!Mart

ETL /Forecast

Machine!Learning

Agenda• What is Spark?

• What is PySpark?

• How to write PySpark applications?

• PySpark demo

• Q&A

HDFS

YARN

MapReduce

What is Spark?

Spark

Storage

Resource Management

Computing Engine

• The leading candidate for “successor to MapReduce” today is Apache Spark

• No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. !

• From Cloudera CTO http://0rz.tw/y3OfM

What is Spark?

http://0rz.tw/y3OfM

Spark is 3X~25X faster than MapReduce !

From Matei’s paper: http://0rz.tw/VVqgP

Logistic regression

Runn

ing

Tim

e(S)

0

20

40

60

80

MR Spark3

76

KMeans

0

27.5

55

82.5

110

MR Spark

33

106

PageRank

0

45

90

135

180

MR Spark

23

171

http://0rz.tw/VVqgP

Most machine learning algorithms need iterative computing

a1.0

1.0

1.0

1.0

PageRank

1st Iter 2nd Iter 3rd Iter

b

d

c

Rank Tmp

Result

Rank Tmp

Result

a1.85

1.00.58

b

d

c

0.58

a1.31

1.720.39

b

d

c

0.58

HDFS is 100x slower than memory

Input (HDFS) Iter 1 Tmp

(HDFS) Iter 2 Tmp (HDFS) Iter N

Input (HDFS) Iter 1 Tmp

(Mem) Iter 2 Tmp (Mem) Iter N

MapReduce

Spark

First iteration(HDFS)!take 200 sec

3rd iteration(mem)!take 7.7 sec

Page Rank algorithm in 1 billion record url

2nd iteration(mem)!take 7.4 sec

What is PySpark?

Spark API

• Multi Language API

• JVM: Scala, JAVA

• PySpark: Python

PySpark• Process via Python

• CPython

• Python lib (NumPy, Scipy…)

• Storage and transfer data in Spark

• HDFS access/Networking/Fault-recovery

• scheduling/broadcast/checkpointing/

Spark ArchitectureMaster!(JVM)

Worker!!!!!!

Task

Client

Block1

Worker!!!!!!

Task

Block2

Worker!!!!!!

Task

Block3

PySpark ArchitectureMaster!(JVM)

Worker!(JVM)!

!!!

Python!Code

Block1

Py Proc

Worker!(JVM)!

!!!

Block2

Py Proc

Worker!(JVM)!

!!!

Block3

Py Proc

JVM


Worker!(JVM)!

!!!

Python!Code

Py4J Socket Local FS

Block1

Worker!(JVM)!

!!!

Block2

Worker!(JVM)!

!!!

Block3


Worker!(JVM)!

!!!

Python!Code

Block1

Py code

Worker!(JVM)!

!!!

Block2

Worker!(JVM)!

!!!

Block3

Python functions and closures are serialized using PiCloud’s CloudPickle module

Py code

Py code


Worker!(JVM)!

!!!

Python!Code

Block1

Py Proc

Worker!(JVM)!

!!!

Block2

Py Proc

Worker!(JVM)!

!!!

Block3

Py Proc

On worker launch, Python subprocesses and communicate with them using pipes, sending the user's code and the data to be processed.

A lot of python processes

How to write PySpark application?

Python Word Count• file = spark.textFile("hdfs://...")

• counts = file.flatMap(lambda line: line.split(" ")) \

• .map(lambda word: (word, 1)) \

• .reduceByKey(lambda a, b: a + b)

• counts.saveAsTextFile("hdfs://...")

Access data via Spark API

Process via Python

Python Word Count

• counts = file.flatMap(lambda line: line.split(" ")) \

You can find the latest Spark

documentation, including the

guide

Original text List

['You', 'can', 'find', 'the', 'latest', 'Spark',

'documentation,', 'including', 'the', ‘guide’]

Python Word Count

• .map(lambda word: (word, 1))

List Tuple List

[ (‘You’,1) , (‘can’,1), (‘find’,1) , (‘the’,1) ….,

……….. (‘the’,1) , (‘guide’ ,1) ]

['You', 'can', 'find', 'the', 'latest', 'Spark',

'documentation,', 'including', 'the', ‘guide’]

Python Word Count• .reduceByKey(lambda a, b: a + b)

Tuple List Reduce Tuple List

[ (‘You’,1) , (‘can’,1), (‘find’,1) , (‘the’,1), ……….. (‘the’,1) ,

(‘guide’ ,1) ]

[ (‘You’,1) , (‘can’,1), (‘find’,1) , (‘the’,2), ………

……….. (‘guide’ ,1) ]

Can I use ML python lib on PySpark?

PySpark + scikit-learn• sgd = lm.SGDClassifier(loss=‘log')

• for ii in range(ITERATIONS):

• sgd = sc.parallelize(…) \

• .mapPartitions(lambda x:…) \

• .reduce(lambda x, y: merge(x, y))

Use scikit-learn in Single mode(master)

Cluster operation

Use scikit-learn function in cluster mode ,

deal with partial data

!Source Code is From : http://0rz.tw/o2CHT

!!

PySpark support MLlib

• MLlib is spark version machine learning lib

• Example: KMeans.train(parsedData, 2, maxIter=10, runs=30, "random")

• Check it out on http://0rz.tw/M35Rz

http://0rz.tw/M35Rz

DEMO 1 : Recommendation using ALS

(Data : MovieLens)

DEMO 2: Interactive Shell

Conclusion

Join Us• Our team’s work is highlight by world top conf

• Hadoop Summit San Jose 2013

• Hadoop Summit Amsterdam 2014

• MSTR World Las Vegas 2014

• SparkSummit San Francisco 2014

• Jenkins Conf Palo Alto 2013

Thank you

Date post:	26-Jan-2015
Category:	Technology
Upload:	giivee-the
View:	113 times
Download:	0 times

PySaprk

Technology