Spark Tuning for Enterprise System Administrators

transcript

Anya T. Bida, PhD Rachel B. Warren

Don't worry about missing something...

Video: https://www.youtube.com/watch?v=DNWaMR8uKDc&feature=youtu.be Presentation: http://www.slideshare.net/anyabida Cheat-sheet: http://techsuppdiva.github.io/ !!Anya: https://www.linkedin.com/in/anyabida Rachel: https://www.linkedin.com/in/rachelbwarren !! !2

About Anya About RachelOperations Engineer !!!

Spark & Scala Enthusiast / Data Engineer

Alpine Data!alpinenow.com

About You*

Intermittent

Reliable Optimal

Spark practitioners

mySparkApp Success

Intermittent Reliable

Optimal

mySparkApp Success

Default != RecommendedExample: By default, spark.executor.memory = 1g 1g allows small jobs to finish out of the box. Spark assumes you'll increase this parameter.

Which parameters are important? !

How do I configure them?

Default != Recommended

Filter* data before an

expensive reduce or aggregation

consider* coalesce(

Use* data structures that

require less memory

Serialize*

PySpark

serializing is built-in

Scala/Java?

persist(storageLevel.[*]_SER)

Recommended: kryoserializer *

tuning.html#tuning-data-structures

See "Optimize partitions." *

See "GC investigation." *

See "Checkpointing." *

The Spark Tuning Cheat-Sheet

Optimal

mySparkApp Success

Memory trouble

Initial config

Optimal

mySparkApp Success

Memory trouble

Initial config

How many in the audience have their own

cluster?

Fair Schedulers

YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations>

SPARK <allocations> <pool name="sample_queue"> <schedulingMode>FAIR</schedulingMode> <weight>1</weight> <minShare>2</minShare> </pool> </allocations>

Fair Schedulers

Use these parameters!

Fair Schedulers

YARN <allocations> <user name="sample_user"> <maxRunningApps>6</maxRunningApps> </user> <userMaxAppsDefault>5</userMaxAppsDefault> !</allocations>

Fair Schedulers

YARN <allocations> <user name="sample_user"> <maxRunningApps>6</maxRunningApps> </user> <userMaxAppsDefault>5</userMaxAppsDefault> !</allocations>

What is the memory limit for mySparkApp?

Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !!!

Limitation

Reserve 25% for overhead

Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !

mySparkApp_mem_limit = driver.memory + (executor.memory x dynamicAllocation.maxExecutors)

Limitation: Driver must not be larger than a single node.

!!!!!!!

Parameter Default Recommended spark.executor.cores 1(Yarn mode) 5 or less

!executors per node= (cores per node) / (5cores per executor)

!executor.memory = (memory per node) / (executors per node)

!maxExecutors=(executors per node) x (num nodes)

Verify my calculations respect this limitation.

Optimal

mySparkApp Success

Memory trouble

Initial config

Optimal

mySparkApp Success

Memory trouble

Initial config

mySparkApp memory issues

Reduce the memory needed for mySparkApp. How?

Gracefully handle memory limitations. How?

here let's talk about one scenario

Spark SQL's Optimizer

here let's talk about one scenario

Symptoms:

• mySparkApp is running for several hours Container is lost.

• Several Executors are lost.

Symptoms:

• mySparkApp is running for several hours Container is lost.

• Several Executors are lost. • Behavior is intermittent (sometimes succeeds,

sometimes fails).

Potential Solution: RDD.checkpoint()

Use in these cases: !!

Function:

How-to: !!

Use in these cases: !!

Function: • saves the RDD to stable

storage (eg hdfs or S3)

How-to: !!

Use in these cases: !

How-to: Cache first!

SparkContext.setCheckpointDir(directory: String)

RDD.checkpoint()

Use in these cases: • high-traffic cluster • network blips • preemption • disk space nearly full !!

How-to: Cache first!

SparkContext.setCheckpointDir(directory: String)

RDD.checkpoint()

Optimal

mySparkApp Success

Memory trouble

Initial config

Optimal

mySparkApp Success

Memory trouble

Initial config

Instead of 2.5 hours, myApp completes in 1 hour.

Cheat-sheet techsuppdiva.github.io/

Optimal

mySparkApp Success

Memory trouble

Initial config

HighPerformanceSpark.com

Further Reading:• Spark Tuning Cheat-sheet

techsuppdiva.github.io

• Apache Spark Documentation https://spark.apache.org/docs/latest

• Checkpointinghttp://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointinghttps://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-rdd-checkpointing.adoc

• Learning Spark, by H. Karau, A. Konwinski, P. Wendell, M. Zaharia, 2015

Spark Tuning for Enterprise System Administrators

Data & Analytics