5 Apache Spark Tips in 5 Minutes

transcript

5 Spark tips in 5 MinutesImran Rashid| Cloudera Engineer, Apache Spark PMC

rdd.cache()rdd.setName(…)

BAD:Sc.accumulator(0L)

GOOD:Sc.accumultor(0L, “my counter”)

#1: Name Cached RDDs and Accumulators

#1b: MEMORY_AND_DISK

• BAD: rdd.cache()• If partition is dropped, computed from scratch

• GOOD: rdd.persist(StorageLevel.MEMORY_AND_DISK)

Huge Raw Data

FilterFlatMap

…cache

• DAG Visualization• Key Metrics

• Data Read / Written• Shuffle Read / Write• Stragglers / Outliers

• Cache Utilization

#2: Use Spark’s UI

• Use Sample Code• Count Errors• Sample Errors

• SparkListener to output updates• https://gist.github.com/squito/2f7cc0

2c313e4c9e7df4

#3: Debug Counters

val parseErrors = ErrorTracker(“parsing errors", sc)

val allParsed: RDD[T] = sc.textFile(inputFile).flatMap { line => try { val r = Some(parser(line)) parseErrors.localValue.ok() r } catch { case NonFatal(ex) => parseErrors.localValue.error(line) None }}

#4: Avoid Driver Bottlenecks

GOOD BAD

rdd.collect() Exploratory data analysis; merging a small set of results.

Sequentially scan entire data set on driver. No parallelism, OOM on driver. (rdd.toLocaltIterator is better, still not good)

rdd.reduce() Summarize the results from a small dataset.

Big Data Structures, from lots of partitions.

sc.accumulator() Small data types, eg., counters. Big Data Structures, from lots of partitions. Set of a million “most interesting” user ids from each partition.

• Try Scala!• Much simpler code• KISS• Sbt: ~compile, ~test-quick• Template project with giter8

• Use Spark Testing Base• Talk Wednesday by Holden K

• Run Spark Locally• But try at scale periodically (you may hit

bottlenecks)

#5: Dev Environment

• I write bugs• You write bugs• Spark has bugs

• Long Pipelines should be restartable• Bad: Bug in Stage 18 after 5 hours

rerun from scratch?• Good: Write to stable storage (eg.,

hdfs) periodically, restart from stage 17

• DiskCachedRDD

#6:Code for Fast Iterations

#7: Narrow Joins & HDFS

• Narrow Joins• Much cheaper• Anytime rdds share Partitioner

• What about when reading from hdfs?• SPARK-1061• Read from hdfs• “Remember” data was written

with a partitioner

Wide Join Narrow Join

Thank you

5 Apache Spark Tips in 5 Minutes

Software