Basic&Spark&Programming&and&...

transcript

Basic Spark Programming and Performance Diagnosis

Jinliang Wei 15-‐719 Spring 2017

Recita@on

Today’s Agenda

•  PySpark shell and submiHng jobs •  Basic Spark programming – Word Count •  How does Spark execute your program? •  Spark monitoring web UI •  What is shuffle and how does it work? •  Spark programming caveats •  Generally good prac@ces •  Important configura@on parameters •  Basic performance diagnosis

PySpark shell and submiHng jobs

Launch A Spark + HDFS Cluster on EC2

•  Firstly, set environment variables: – AWS_SECRET_ACCESS_KEY– AWS_ACCESS_KEY_ID

•  Get spark-‐ec2-‐setup •  Launch a cluster with 4 slave nodes: ./spark-ec2 -k <key-id> -i <identity-file> \

-t m4.xlarge -s 4 -a ami-6d15ec7b \--ebs-vol-size=200 --ebs-vol-num=1 \--ebs-vol-type=gp2 \--spot-price=<proper-price> \launch SparkCluster

•  Login as root•  Replace launch with destroy to terminate the

cluster

Your Standalone Spark Cluster Master

Worker1 Worker2 •  Spark master is the cluster manager (analogous to YARN/

Mesos). •  Workers are some@mes referred to as slaves. •  When your applica@on is submided, worker nodes run

executors, which are processes that run computa@ons and store data for your applica@on.

•  By default, an executor uses all cores on a worker node. •  Configurable via spark.executor.cores (normally lee as

default unless too many cores per node)

Standalone Spark Master Web UI http://[master-node-public-ip]:8080For an overview of the cluster and state of each worker.

PySpark Shell

•  Spark is installed under /root/spark

•  Launch PySpark shell /root/spark/bin/pyspark

Simple math using PySpark Shell

•  Define a list of numbers: a = [1, 3, 7, 4, 2]

•  Create an RDD from that list: rdd_a = sc.parallelize(a)

•  Double each element: rdd_b = rdd_a.map(lambda x: x * 2)

•  Sum the elements up: c = rdd_b.reduce(lambda x, y: x + y)

Submit Applica@ons to Spark

•  Suppose you have a Spark program named word_count.py, submit to run by running

/root/spark/bin/spark-submit \[optional arguments to spark-submit] \word_count.py \[arguments to your program]

What happens when you submit your applica@on?

•  Your program (driver program) runs in “client” mode – a client outside of the Spark master.

•  Spark launches executors on the worker nodes. •  SparkContext sends tasks to the executors to run.

Basic Spark Programming – Word Count

How to implement a word count w/ map-‐reduce?

•  Problem: given a document, count the occurrences of each word

•  Map: take in a chunk of the document, output a list of pairs of (word, 1)

•  Shuffle: group KV pairs by their key (word), assign each group to a reducer

•  Reduce: sum up the values of each group

How to implement it using Spark? import pyspark

if __name__ == "__main__": conf = pyspark.SparkConf().setAppName("WordCount") sc = pyspark.SparkContext(conf=conf)

text_rdd = sc.textFile("/README.md") tokens_rdd = text_rdd.flatMap( \

lambda x: [(a, 1) for a in x.split()])count_rdd = tokens_rdd.reduceByKey(lambda x, y: x + y)

tokens_count = count_rdd.collect()

sc.stop()

tokens_count.sort(key = lambda x: x[1], reverse=True) count = 0 for token_tuple in tokens_count: print "(%s, %d)" % token_tuple count += 1 if count >= 10: break

Lazy Evalua@on

•  Two kinds of opera@ons on RDD: –  Transforma@on: RDD_A -‐> RDD_B, e.g. flatMap– Ac@on: RDD_A -‐> outside Spark, e.g. collect

•  Transforma@on is “lazy evaluated”. –  Record the dependency informa@on when called. –  Evaluated only when necessary

•  An ac@on causes the RDD and the ones it depends on to be computed.

How does Spark execute your program?

Why should you care? Because you may need to do performance diagnosis and understand the terminology to interpret the Spark monitoring UIs.

The lineage graph is built when transforma@ons are invoked

par@@on1

par@@on2

par@@on3

text_rdd tokens_rdd tokens_rdd

narrow dependence wide dependence

Pipelined execu@on: a sequence of transforma@ons applied on each record (par@@on), independently executed of other records (par@@ons)

Shuffle: every node reads from every other nodes; might cause global barrier

An ac@on causes the actual evalua@on

•  Spark calls it a job. •  If the RDD on which the ac@on was invoked exists, then compute the ac@on, else compute the RDD.

•  Compu@ng an RDD recursively computes its parent RDDs.

Build a DAG of stages from the lineage graph

•  RDD’s with narrow dependence between them are grouped into the same stage.

•  Stage boundaries are shuffles. •  Each task is scheduled to a core.

stage 1 stage 2

a task

A stage is computed as a set of parallel tasks

•  Each par@@on is a task •  You may control the number of par@@ons of an RDD – partitionBy(num_partitions)–  Some opera@ons allow you to explicitly specify the number of par@@ons

–  Configura@on parameter: spark.default.parallelism

•  This is where most of the parallelism comes from

What’s the proper number of par@@ons for an RDD?

•  Want to have sufficient parallelism and balanced load. – Rule of thumb: at least 2 @mes the number of cores

•  Don’t want too many tasks otherwise most of the @me will be spent on seHng up the tasks. – Rule of thumb: at least hundreds of milliseconds per task

•  Make sure each par@@on can fit in memory.

How does Spark run Python code?

•  Your Python UDFs are executed in Python processes •  RDD records need to be transferred between JVM and Python

•  Serializa@on could be a performance problem.

PySpark “pipelines” Python func@on automa@cally

•  If you apply mul@ple transforma@ons in a series, Spark “fuses” the Python UDFs to avoid mul@ple transfers between Python and JVM.

•  Example: rdd_x.map(foo).map(bar) – Func@on foo(x) takes in a record x and outputs a record y

– Func@on bar(y) takes in a record y and outputs a record z

– Spark automa@cally creates a func@on foo_bar(x) that takes in a record x and outputs a record z, which is essen@ally bar(foo(x)).

Spark Monitoring Web UIs

Live Monitoring Web UI

http://[master-node-public-ip]:4040How is my running applica@on doing?

History Server http://[master-node-public-ip]:18080Visualizing the logs of completed applica@ons.

The job view •  Jobs – why is there only one job?

Details for a job •  Stages: what opera@ons are in stage 1 and 2?

Understand the DAG Visualiza@on

•  Dots are RDDs. •  Dots inside the blue box are RDDs in JVM. •  Text labels are transforma@on that generates the RDDs – Problem: PySpark uses some transforma@ons to implement other transforma@ons (reduceByKey implement by par@@onBy and mapPar@@ons), so the labels are not exactly the same as your code

– But if you know the stage boundaries, you can figure out which opera@ons belong to which stage

Details for a stage

Event Timeline

Recap: Stage DAG

•  Each RDD par@@on correspond to a task •  Number of RDD par@@ons can oeen be controled

stage 1 stage 2

What is shuffle write and shuffle read?

What is shuffle and how does it work?

What is shuffle and what is it used for?

•  Informally, a mechanism that redistributes the par@@oned RDD records.

•  Informally, it is needed whenever you need records that sa@sfy certain condi@on (e.g. the same key) to reside in the same par@@on.

(“a”, 1), (“b”, 1), (“d”, 1)

(“a”, 1), (“b”, 1), (“c”, 1)

(“a”, 1), (“a”, 1), (“c”, 1)

(“b”, 1), (“b”, 1), (“d”, 1)

Opera@ons that may cause a shuffle

•  par@@onBy •  reduceByKey •  groupByKey •  …

How is shuffle implemented?

•  Two implementa@ons: hash shuffle and sort shuffle

•  You don’t need to know the details for this project. If curious, read this blog post:

hdps://0x0fff.com/spark-‐architecture-‐shuffle/ •  You need to know: – Mappers (sources) serializes RDD records and write them to local disk (shuffle write)

–  Reducers (des@na@ons) reads from remote disk over network for their par@@on of records (shuffle read)

Shuffle is expensive

•  Data is serialized, wriden to local disk, and communicated over network – Serializa@on takes @me, disk and network are slow

•  Everyone depends on everyone else, if there is a straggler, everyone has to wait

•  Minimize the number of shuffles in your program

Spark Programming Caveats

Understanding Closures

•  Informally, a closure is a func@on with its surrounding environment when the closure is created.

•  The driver program sends closures to executors to have them executed.

•  RDD opera@ons (closures) modify variables outside of their scope oeen causes confusion (generally don’t do that).

What’s the behavior of this code?

•  The driver’s counter is captured when the closure is created and then visible to executors.

•  The global counter that the executor modifies is the executor’s local variable, i.e. writes are not see by driver.

Broadcast variable

•  broadcastVar.value can be read by any worker any@me aeer it’s created

•  Read-‐only variable (to avoid dealing with concurrent writes)

•  One copy per executor

Ways to communicate values from driver to executors or tasks

•  Create RDDs •  Closure •  Broadcast variable •  Ques@on: when should you use each one? – Closure: values of small size that are only useful for this func@on

– Broadcast variable: more efficient for larger variables and when you want to reuse the values across stages

– RDDs: when the variable is too large

How do executors send values to driver?

•  Use RDD ac@ons •  Accumulators – Only allow associa@ve and commuta@ve opera@ons

– Because concurrent writes can be easily dealt with – Read Spark programming guide for details

RDD Persistence

•  Spark is in-‐memory – what does it mean? •  Spark is capable of persis@ng (or caching) an RDD in memory across ac@ons (jobs). –  Hadoop can’t. –  Spark may persist RDDs in disks too.

•  If RDDs are not persisted, they are recomputed for different ac@ons. –  RDDs are computed at most once per job.

•  But you need to tell Spark which RDDs to persist. •  Spark some@mes persists an RDD automa@cally, but this is not very well specified.

Persis@ng an RDD

•  persist() op@ons:–  MEMORY_ONLY: default, if not enough memory, recompute it

–  MEMORY_AND_DISK: if not enough memory, persist on disk

–  DISK_ONLY: persist on disk –  A few others

•  cache() is persist(MEMORY_ONLY)

Generally Good Prac@ces •  Generally, avoid shuffles if you can – A shuffle might be worth doing if it increases parallelism •  E.g. more par@@ons, beder load balancing,

•  For shuffle, pick the right operators – avoid transferring the en@re RDD over network

•  Some opera@ons do local aggrega@ons before shuffles before shuffling •  E.g. groupByKey() + mapValues() vs. reduceByKey()

Spark Proper@es – Per Applica@on Proper@es

The ones that you should understand

•  spark.executor.memory: amount of memory to user per executor process (JVM heap size) – Op@onal reading -‐ Spark memory management: hdp://spark.apache.org/docs/latest/tuning.html#memory-‐management-‐overview

•  spark.default.parallelism: default number of par@@ons in RDDs returned by certain opera@ons, when not set by user –  You can explicitly control the number of par@@ons in most cases

•  More details (op@onal for project 2): hdp://spark.apache.org/docs/latest/configura@on.html#spark-‐proper@es

How to set those proper@es

•  When calling spark-‐submit, use op@on –conf “config.property=value”One property per conf.

•  Can be set programmably using SparkConf when crea@ng SparkContext (don’t work for all proper@es)

•  conf/spark-defaults.conf (don’t do that for Project 2)

Basic Performance Diagnosis: What do I do if my applica@on is

running slow?

Q1: Which job and stage is the bodleneck?

•  Check the Spark monitoring UIs •  Iden@fy the bodlenecking stage

Possible sources of bodleneck

•  CPUs are not fully u@lized – Network I/O – Disk I/O –  Insufficient parallelism –  Imbalance

•  CPUs are highly u@lized

Q1: Are you fully u@lizing your CPUs? •  vmstat 2 20

–  One update every 2 seconds, for 20 updates –  The first line is an average since the machine is booted –  Good for a quick overview of the machine

Q2: Why are my CPUs not fully u@lized?

•  Generally you can find answers from the monitoring web UI

•  Insufficient parallelism or imbalance? – Check the per stage @meline

•  Blocked on network or disk I/O? – Shuffle reads and writes

•  How to op@mize for those problems?

Q2: My CPUs are highly u@lized, so?

•  Which func@ons are your CPUs spend their @me on? Answer: profile your code.

•  Spark Python profiler --conf “spark.python.profile=true”--conf “spark.python.profile.dump=/root/spark_profile” •  More details: hdp://spark.apache.org/docs/latest/configura@on.html hdps://docs.python.org/2/library/profile.html •  If most @me is spent in JVM, this is not useful and it’s beyond your control.

Basic Performance Diagnosis: What do I do if I get Out-‐Of-‐Memory

(OOM) excep@ons? OOM can manifest as other

excep@ons.

A Common Pivall

•  Driver and executor memory sizes are configurable, and the defaults are 1g

•  You can configure them – spark.driver.memory– spark.executor.memory

Size of a par@@on maders

•  Informally, for each task, the executor loads the corresponding par@@on into memory.

•  If the par@@on cannot fit in memory, you get OOMs.

•  Then you want more and smaller par@@ons. •  RDD par@@oning is in unit of records, if a single record is huge then repar@@oning won’t help.

Basic&Spark&Programming&and&...

Documents