Post on 16-Apr-2017
transcript
1© Cloudera, Inc. All rights reserved.
5 Spark tips in 5 MinutesImran Rashid| Cloudera Engineer, Apache Spark PMC
2© Cloudera, Inc. All rights reserved.
rdd.cache()rdd.setName(…)
BAD:Sc.accumulator(0L)
GOOD:Sc.accumultor(0L, “my counter”)
#1: Name Cached RDDs and Accumulators
3© Cloudera, Inc. All rights reserved.
#1b: MEMORY_AND_DISK
• BAD: rdd.cache()• If partition is dropped, computed from scratch
• GOOD: rdd.persist(StorageLevel.MEMORY_AND_DISK)
Huge Raw Data
FilterFlatMap
…cache
4© Cloudera, Inc. All rights reserved.
• DAG Visualization• Key Metrics
• Data Read / Written• Shuffle Read / Write• Stragglers / Outliers
• Cache Utilization
#2: Use Spark’s UI
5© Cloudera, Inc. All rights reserved.
• Use Sample Code• Count Errors• Sample Errors
• SparkListener to output updates• https://gist.github.com/squito/2f7cc0
2c313e4c9e7df4
#3: Debug Counters
val parseErrors = ErrorTracker(“parsing errors", sc)
val allParsed: RDD[T] = sc.textFile(inputFile).flatMap { line => try { val r = Some(parser(line)) parseErrors.localValue.ok() r } catch { case NonFatal(ex) => parseErrors.localValue.error(line) None }}
6© Cloudera, Inc. All rights reserved.
#4: Avoid Driver Bottlenecks
GOOD BAD
rdd.collect() Exploratory data analysis; merging a small set of results.
Sequentially scan entire data set on driver. No parallelism, OOM on driver. (rdd.toLocaltIterator is better, still not good)
rdd.reduce() Summarize the results from a small dataset.
Big Data Structures, from lots of partitions.
sc.accumulator() Small data types, eg., counters. Big Data Structures, from lots of partitions. Set of a million “most interesting” user ids from each partition.
7© Cloudera, Inc. All rights reserved.
• Try Scala!• Much simpler code• KISS• Sbt: ~compile, ~test-quick• Template project with giter8
• Use Spark Testing Base• Talk Wednesday by Holden K
• Run Spark Locally• But try at scale periodically (you may hit
bottlenecks)
#5: Dev Environment
8© Cloudera, Inc. All rights reserved.
• I write bugs• You write bugs• Spark has bugs
• Long Pipelines should be restartable• Bad: Bug in Stage 18 after 5 hours
rerun from scratch?• Good: Write to stable storage (eg.,
hdfs) periodically, restart from stage 17
• DiskCachedRDD
#6:Code for Fast Iterations
9© Cloudera, Inc. All rights reserved.
#7: Narrow Joins & HDFS
• Narrow Joins• Much cheaper• Anytime rdds share Partitioner
• What about when reading from hdfs?• SPARK-1061• Read from hdfs• “Remember” data was written
with a partitioner
Wide Join Narrow Join
10© Cloudera, Inc. All rights reserved.
Thank you