+ All Categories
Home > Data & Analytics > Debugging & Tuning in Spark

Debugging & Tuning in Spark

Date post: 24-Jan-2018
Upload: shiao-an-yuan
View: 812 times
Download: 0 times
Share this document with a friend
Debugging & Tuning in Spark Shiao-An Yuan @sayuan 2016-08-11

Debugging & Tuning in Spark

Shiao-An Yuan@sayuan


Spark Overview

● Cluster Manager (aka Master)● Worker (aka Slave)

● Driver● Executor


RDD (Resilient Distributed Dataset)

A fault-tolerant collection of elements that can be operated on in parallel

Word Count

val sc: SparkContext = ...

val result = sc.textFile(file) // RDD[String]

.flatMap(_.split(" ")) // RDD[String]

.map(_ -> 1) // RDD[(String, Int)]

.groupByKey() // RDD[(String, Iterable[Int])]

.map(x => (x._1, x._2.sum)) // RDD[(String, Int)]

.collect() // Array[(String, Int])

Lazy, Transformation, Action, Job

groupByKey mapmapflatMap collect

Partition, Shuffle

groupByKey mapmapflatMap collect

Stage, Task

groupByKey mapmapflatMap collect

DAG (Directed Acyclic Graph)

● RDD operations○ Transformation○ Action

● Lazy● Job● Shuffle● Stage● Partition● Task


1. A correct and parallelizable algorithm2. Parallelism3. Reduce the overhead from parallelization

Correctness and Parallelizable

● Use small input● Run locally

○ --master local○ --master local[4]○ --master local[*]

Non-RDD Operations

● Avoid long blocking on driver

Data Skew

● repartition() come to rescue?● Hotspots

○ Choose another partitioned key○ Filter unreasonable data

● Trace to it’s source

Prefer reduceByKey() over groupByKey()

● reduceByKey() combines output before shuffling the data

● Also consider aggregateByKey()● Use groupByKey() if you really

know what you are doing

Shuffle Spill

● Increase partition count● spark.shuffle.spill=false (default since Spark 1.6)● spark.shuffle.memoryFraction● spark.executor.memory


● partitionBy()● repartitionAndSortWithinPartitions()● spark.sql.autoBroadcastJoinThreshold (default 10 MB)● Join it manually by mapPartitions()

○ Broadcast small RDD■ http://stackoverflow.com/a/17690254/406803

○ Query data from database■ https://groups.google.com/a/lists.datastax.com/d/topic/spark-connector-user/63ILfPqPRYI/discussion

Broadcast Small RDD

val smallRdd = ...

val largeRdd = ...

val smallBroadcast = sc.broadcast(smallRdd.collectAsMap())

val joined = largeRdd.mapPartitions(iter => {

val m = smallBroadcast.value

for {

(k, v) <- iter

if m.contains(k)

} yield (k, (v, m.get(k).get))

}, preservesPartitioning = true)

Query Data from Cassandra

val conf = new SparkConf()

.set("spark.cassandra.connection.host", "")

val connector = CassandraConnector(conf)

val joined = rdd.mapPartitions(iter => {

connector.withSessionDo(session => {

val stmt = session.prepare("SELECT value FROM table WHERE key=?")

iter.map {

case (k, v) => (k, (v, session.execute(stmt.bind(k)).one()))






● Kryo serialization○ Much faster○ Registration needed


Common Failures

● Large shuffle blocks○ java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

■ Increase partition count○ MetadataFetchFailedException, FetchFailedException

■ Increase partition count■ Increase `spark.executor.memory`■ …

○ java.lang.OutOfMemoryError: GC overhead over limit exceeded■ May caused by shuffle spill

java.lang.OutOfMemoryError: Java heap space

● Driver○ Increase `spark.driver.memory`○ collect()

■ take()■ saveAsTextFile()

● Executor○ Increase `spark.executor.memory`○ More nodes

java.io.IOException: No space left on device

● SPARK_WORKER_DIR● SPARK_LOCAL_DIRS, spark.local.dir● Shuffle files

○ Only delete after the RDD object has been GC

Other Tips

● Event logs○ spark.eventLog.enabled=true○ ${SPARK_HOME}/sbin/start-history-server.sh


● Rule of thumb: ~128 MB per partition● If #partitions <= 2000, but close, bump to just > 2000

● Increase #partitions by repartition()● Decrease #partitions by coalesce()● spark.sql.shuffle.partitions (default 200)


Executors, Cores, Memory!?

● 32 nodes● 16 cores each● 64 GB of RAM each● If you have an application need 32 cores, what is the

correct setting?


Why Spark Debugging / Tuning is Hard?

● Distributed● Lazy● Hard to do benchmark● Spark is sensitive


● When in doubt, repartition!● Avoid shuffle if you can● Choose a reasonable partition count● Premature optimization is the root of all evil -- Donald Knuth
