Post on 16-Apr-2017
transcript
Spark Tuning for Enterprise System Administrators
Anya T. Bida, PhD Rachel B. Warren
Don't worry about missing something...
Video: https://www.youtube.com/watch?v=DNWaMR8uKDc&feature=youtu.be Presentation: http://www.slideshare.net/anyabida Cheat-sheet: http://techsuppdiva.github.io/ !!Anya: https://www.linkedin.com/in/anyabida Rachel: https://www.linkedin.com/in/rachelbwarren !! !2
About Anya About RachelOperations Engineer !!!
Spark & Scala Enthusiast / Data Engineer
Alpine Data!alpinenow.com
About You*
Intermittent
Reliable Optimal
Spark practitioners
mySparkApp Success
*
Intermittent Reliable
Optimal
mySparkApp Success
Default != RecommendedExample: By default, spark.executor.memory = 1g 1g allows small jobs to finish out of the box. Spark assumes you'll increase this parameter.
!6
Which parameters are important? !
How do I configure them?
!7
Default != Recommended
Filter* data before an
expensive reduce or aggregation
consider* coalesce(
Use* data structures that
require less memory
Serialize*
PySpark
serializing is built-in
Scala/Java?
persist(storageLevel.[*]_SER)
Recommended: kryoserializer *
tuning.html#tuning-data-structures
See "Optimize partitions." *
See "GC investigation." *
See "Checkpointing." *
The Spark Tuning Cheat-Sheet
Intermittent Reliable
Optimal
mySparkApp Success
Memory trouble
Initial config
Intermittent Reliable
Optimal
mySparkApp Success
Memory trouble
Initial config
!11
!12
!13
How many in the audience have their own
cluster?
!14
Fair Schedulers
!15
YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations>
SPARK <allocations> <pool name="sample_queue"> <schedulingMode>FAIR</schedulingMode> <weight>1</weight> <minShare>2</minShare> </pool> </allocations>
Fair Schedulers
!16
YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations>
SPARK <allocations> <pool name="sample_queue"> <schedulingMode>FAIR</schedulingMode> <weight>1</weight> <minShare>2</minShare> </pool> </allocations>
Fair Schedulers
!17
YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations>
SPARK <allocations> <pool name="sample_queue"> <schedulingMode>FAIR</schedulingMode> <weight>1</weight> <minShare>2</minShare> </pool> </allocations>
Fair Schedulers
!18
YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations>
SPARK <allocations> <pool name="sample_queue"> <schedulingMode>FAIR</schedulingMode> <weight>1</weight> <minShare>2</minShare> </pool> </allocations>
Fair Schedulers
!19
YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations>
SPARK <allocations> <pool name="sample_queue"> <schedulingMode>FAIR</schedulingMode> <weight>1</weight> <minShare>2</minShare> </pool> </allocations>
Use these parameters!
Fair Schedulers
!20
YARN <allocations> <user name="sample_user"> <maxRunningApps>6</maxRunningApps> </user> <userMaxAppsDefault>5</userMaxAppsDefault> !</allocations>
Fair Schedulers
!21
YARN <allocations> <user name="sample_user"> <maxRunningApps>6</maxRunningApps> </user> <userMaxAppsDefault>5</userMaxAppsDefault> !</allocations>
What is the memory limit for mySparkApp?
!22
!23
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !!!
What is the memory limit for mySparkApp?
!24
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !!!
What is the memory limit for mySparkApp?
!25
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !!!
<maxResources>___mb</maxResources>
Limitation
What is the memory limit for mySparkApp?
What is the memory limit for mySparkApp?
!26
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !!!
Reserve 25% for overhead
!27
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !!!
What is the memory limit for mySparkApp?
!28
!29
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !
mySparkApp_mem_limit = driver.memory + (executor.memory x dynamicAllocation.maxExecutors)
What is the memory limit for mySparkApp?
!30
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !
mySparkApp_mem_limit = driver.memory + (executor.memory x dynamicAllocation.maxExecutors)
What is the memory limit for mySparkApp?
!31
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !
mySparkApp_mem_limit = driver.memory + (executor.memory x dynamicAllocation.maxExecutors)
What is the memory limit for mySparkApp?
Limitation: Driver must not be larger than a single node.
!32
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !
mySparkApp_mem_limit = driver.memory + (executor.memory x dynamicAllocation.maxExecutors)
What is the memory limit for mySparkApp?
!33
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !
mySparkApp_mem_limit = driver.memory + (executor.memory x dynamicAllocation.maxExecutors)
!!!!!!!
!34
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !
mySparkApp_mem_limit = driver.memory + (executor.memory x dynamicAllocation.maxExecutors)
!!!!!!!
Parameter Default Recommended spark.executor.cores 1(Yarn mode) 5 or less
!35
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !
mySparkApp_mem_limit = driver.memory + (executor.memory x dynamicAllocation.maxExecutors)
!executors per node= (cores per node) / (5cores per executor)
!!!!!
!36
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !
mySparkApp_mem_limit = driver.memory + (executor.memory x dynamicAllocation.maxExecutors)
!executors per node= (cores per node) / (5cores per executor)
!executor.memory = (memory per node) / (executors per node)
!!!
!37
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !
mySparkApp_mem_limit = driver.memory + (executor.memory x dynamicAllocation.maxExecutors)
!executors per node= (cores per node) / (5cores per executor)
!executor.memory = (memory per node) / (executors per node)
!maxExecutors=(executors per node) x (num nodes)
!
!38
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !
mySparkApp_mem_limit = driver.memory + (executor.memory x dynamicAllocation.maxExecutors)
What is the memory limit for mySparkApp?
!39
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !
mySparkApp_mem_limit = driver.memory + (executor.memory x dynamicAllocation.maxExecutors)
What is the memory limit for mySparkApp?
Verify my calculations respect this limitation.
!40
Intermittent Reliable
Optimal
mySparkApp Success
Memory trouble
Initial config
Intermittent Reliable
Optimal
mySparkApp Success
Memory trouble
Initial config
mySparkApp memory issues
Reduce the memory needed for mySparkApp. How?
Gracefully handle memory limitations. How?
mySparkApp memory issues
Reduce the memory needed for mySparkApp. How?
Gracefully handle memory limitations. How?
mySparkApp memory issues
Reduce the memory needed for mySparkApp. How?
Gracefully handle memory limitations. How?
mySparkApp memory issues
here let's talk about one scenario
Reduce the memory needed for mySparkApp. How?
Gracefully handle memory limitations. How?
mySparkApp memory issues
persist(storageLevel.[*]_SER)
Reduce the memory needed for mySparkApp. How?
Gracefully handle memory limitations. How?
mySparkApp memory issues
persist(storageLevel.[*]_SER)
Reduce the memory needed for mySparkApp. How?
Gracefully handle memory limitations. How?
mySparkApp memory issues
persist(storageLevel.[*]_SER)
Recommended: kryoserializer *
Reduce the memory needed for mySparkApp. How?
Gracefully handle memory limitations. How?
mySparkApp memory issues
persist(storageLevel.[*]_SER)
Recommended: kryoserializer *
Reduce the memory needed for mySparkApp. How?
Gracefully handle memory limitations. How?
mySparkApp memory issues
persist(storageLevel.[*]_SER)
Recommended: kryoserializer *
Spark SQL's Optimizer
Reduce the memory needed for mySparkApp. How?
Gracefully handle memory limitations. How?
mySparkApp memory issues
Reduce the memory needed for mySparkApp. How?
Gracefully handle memory limitations. How?
mySparkApp memory issues
here let's talk about one scenario
Symptoms:
!55
• mySparkApp is running for several hours Container is lost.
• Several Executors are lost.
Symptoms:
!56
• mySparkApp is running for several hours Container is lost.
• Several Executors are lost. • Behavior is intermittent (sometimes succeeds,
sometimes fails).
Potential Solution: RDD.checkpoint()
!57
Potential Solution: RDD.checkpoint()
!58
Use in these cases: !!
Function:
How-to: !!
Potential Solution: RDD.checkpoint()
!59
Use in these cases: !!
Function: • saves the RDD to stable
storage (eg hdfs or S3)
How-to: !!
Potential Solution: RDD.checkpoint()
!60
Use in these cases: !
Function: • saves the RDD to stable
storage (eg hdfs or S3)
How-to: Cache first!
SparkContext.setCheckpointDir(directory: String)
RDD.checkpoint()
Potential Solution: RDD.checkpoint()
!61
Use in these cases: • high-traffic cluster • network blips • preemption • disk space nearly full !!
Function: • saves the RDD to stable
storage (eg hdfs or S3)
How-to: Cache first!
SparkContext.setCheckpointDir(directory: String)
RDD.checkpoint()
Intermittent Reliable
Optimal
mySparkApp Success
Memory trouble
Initial config
Intermittent Reliable
Optimal
mySparkApp Success
Memory trouble
Initial config
Instead of 2.5 hours, myApp completes in 1 hour.
Cheat-sheet techsuppdiva.github.io/
Intermittent Reliable
Optimal
mySparkApp Success
Memory trouble
Initial config
HighPerformanceSpark.com
Further Reading:• Spark Tuning Cheat-sheet
techsuppdiva.github.io
• Apache Spark Documentation https://spark.apache.org/docs/latest
• Checkpointinghttp://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointinghttps://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-rdd-checkpointing.adoc
• Learning Spark, by H. Karau, A. Konwinski, P. Wendell, M. Zaharia, 2015
!66
More Questions?
!67
Video: https://www.youtube.com/watch?v=DNWaMR8uKDc&feature=youtu.be Presentation: http://www.slideshare.net/anyabida Cheat-sheet: http://techsuppdiva.github.io/ !!Anya: https://www.linkedin.com/in/anyabida Rachel: https://www.linkedin.com/in/rachelbwarren !! Thanks!
Extra slides
"Checkpointing"*
Checkpoint* reliably using RDD.checkpoint()
Need better Driver failure recovery?*
Metadata Checkpoint*
Using stateful transformations?*
RDD Checkpoint*
SPARK-9947 Separate Metadata and State Checkpoint