+ All Categories

PySaprk

Date post: 26-Jan-2015
Category:
Upload: giivee-the
View: 113 times
Download: 0 times
Share this document with a friend
Description:
 
Popular Tags:
33
PySpark Next generation cloud computing engine using Python Wisely Chen Yahoo! Taiwan Data team
Transcript
Page 1: PySaprk

PySpark Next generation cloud

computing engine using Python

Wisely Chen Yahoo! Taiwan Data team

Page 2: PySaprk

Who am I?

• Wisely Chen ( [email protected] )

• Sr. Engineer in Yahoo![Taiwan] data team

• Loves to promote open source tech

• Hadoop Summit 2013 San Jose

• Jenkins Conf 2013 Palo Alto

• Coscup 2006, 2012, 2013 , OSDC 2007,2014, Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012

Page 3: PySaprk

Taiwan Data Team

Data!Highway

BI!Report

Serving!API

Data!Mart

ETL /Forecast

Machine!Learning

Page 4: PySaprk

Agenda• What is Spark?

• What is PySpark?

• How to write PySpark applications?

• PySpark demo

• Q&A

Page 5: PySaprk

HDFS

YARN

MapReduce

What is Spark?

Spark

Storage

Resource Management

Computing Engine

Page 6: PySaprk

• The leading candidate for “successor to MapReduce” today is Apache Spark

• No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. !

• From Cloudera CTO http://0rz.tw/y3OfM

What is Spark?

Page 7: PySaprk

Spark is 3X~25X faster than MapReduce !

From Matei’s paper: http://0rz.tw/VVqgP

Logistic regression

Runn

ing

Tim

e(S)

0

20

40

60

80

MR Spark3

76

KMeans

0

27.5

55

82.5

110

MR Spark

33

106

PageRank

0

45

90

135

180

MR Spark

23

171

Page 8: PySaprk

Most machine learning algorithms need iterative computing

Page 9: PySaprk

a1.0

1.0

1.0

1.0

PageRank

1st Iter 2nd Iter 3rd Iter

b

d

c

Rank Tmp

Result

Rank Tmp

Result

a1.85

1.00.58

b

d

c

0.58

a1.31

1.720.39

b

d

c

0.58

Page 10: PySaprk

HDFS is 100x slower than memory

Input (HDFS) Iter 1 Tmp

(HDFS) Iter 2 Tmp (HDFS) Iter N

Input (HDFS) Iter 1 Tmp

(Mem) Iter 2 Tmp (Mem) Iter N

MapReduce

Spark

Page 11: PySaprk

First iteration(HDFS)!take 200 sec

3rd iteration(mem)!take 7.7 sec

Page Rank algorithm in 1 billion record url

2nd iteration(mem)!take 7.4 sec

Page 12: PySaprk

What is PySpark?

Page 13: PySaprk

Spark API

• Multi Language API

• JVM: Scala, JAVA

• PySpark: Python

Page 14: PySaprk

PySpark• Process via Python

• CPython

• Python lib (NumPy, Scipy…)

• Storage and transfer data in Spark

• HDFS access/Networking/Fault-recovery

• scheduling/broadcast/checkpointing/

Page 15: PySaprk

Spark ArchitectureMaster!(JVM)

Worker!!!!!!

Task

Client

Block1

Worker!!!!!!

Task

Block2

Worker!!!!!!

Task

Block3

Page 16: PySaprk

PySpark ArchitectureMaster!(JVM)

Worker!(JVM)!

!!!

Python!Code

Block1

Py Proc

Worker!(JVM)!

!!!

Block2

Py Proc

Worker!(JVM)!

!!!

Block3

Py Proc

JVM

Page 17: PySaprk

PySpark ArchitectureMaster!(JVM)

Worker!(JVM)!

!!!

Python!Code

Py4J Socket Local FS

Block1

Worker!(JVM)!

!!!

Block2

Worker!(JVM)!

!!!

Block3

Page 18: PySaprk

PySpark ArchitectureMaster!(JVM)

Worker!(JVM)!

!!!

Python!Code

Block1

Py code

Worker!(JVM)!

!!!

Block2

Worker!(JVM)!

!!!

Block3

Python functions and closures are serialized using PiCloud’s CloudPickle module

Py code

Py code

Page 19: PySaprk

PySpark ArchitectureMaster!(JVM)

Worker!(JVM)!

!!!

Python!Code

Block1

Py Proc

Worker!(JVM)!

!!!

Block2

Py Proc

Worker!(JVM)!

!!!

Block3

Py Proc

On worker launch, Python subprocesses and communicate with them using pipes, sending the user's code and the data to be processed.

Page 20: PySaprk

A lot of python processes

Page 21: PySaprk

How to write PySpark application?

Page 22: PySaprk

Python Word Count• file = spark.textFile("hdfs://...")

• counts = file.flatMap(lambda line: line.split(" ")) \

• .map(lambda word: (word, 1)) \

• .reduceByKey(lambda a, b: a + b)

• counts.saveAsTextFile("hdfs://...")

Access data via Spark API

Process via Python

Page 23: PySaprk

Python Word Count

• counts = file.flatMap(lambda line: line.split(" ")) \

You can find the latest Spark

documentation, including the

guide

Original text List

['You', 'can', 'find', 'the', 'latest', 'Spark',

'documentation,', 'including', 'the', ‘guide’]

Page 24: PySaprk

Python Word Count

• .map(lambda word: (word, 1))

List Tuple List

[ (‘You’,1) , (‘can’,1), (‘find’,1) , (‘the’,1) ….,

……….. (‘the’,1) , (‘guide’ ,1) ]

['You', 'can', 'find', 'the', 'latest', 'Spark',

'documentation,', 'including', 'the', ‘guide’]

Page 25: PySaprk

Python Word Count• .reduceByKey(lambda a, b: a + b)

Tuple List Reduce Tuple List

[ (‘You’,1) , (‘can’,1), (‘find’,1) , (‘the’,1), ……….. (‘the’,1) ,

(‘guide’ ,1) ]

[ (‘You’,1) , (‘can’,1), (‘find’,1) , (‘the’,2), ………

……….. (‘guide’ ,1) ]

Page 26: PySaprk

Can I use ML python lib on PySpark?

Page 27: PySaprk

PySpark + scikit-learn• sgd = lm.SGDClassifier(loss=‘log')

• for ii in range(ITERATIONS):

• sgd = sc.parallelize(…) \

• .mapPartitions(lambda x:…) \

• .reduce(lambda x, y: merge(x, y))

Use scikit-learn in Single mode(master)

Cluster operation

Use scikit-learn function in cluster mode ,

deal with partial data

!Source Code is From : http://0rz.tw/o2CHT

!!

Page 28: PySaprk

PySpark support MLlib

• MLlib is spark version machine learning lib

• Example: KMeans.train(parsedData, 2, maxIter=10, runs=30, "random")

• Check it out on http://0rz.tw/M35Rz

Page 29: PySaprk

DEMO 1 : Recommendation using ALS

(Data : MovieLens)

Page 30: PySaprk

DEMO 2: Interactive Shell

Page 31: PySaprk

Conclusion

Page 32: PySaprk

Join Us• Our team’s work is highlight by world top conf

• Hadoop Summit San Jose 2013

• Hadoop Summit Amsterdam 2014

• MSTR World Las Vegas 2014

• SparkSummit San Francisco 2014

• Jenkins Conf Palo Alto 2013

Page 33: PySaprk

Thank you