Date post: | 13-Apr-2017 |
Category: |
Data & Analytics |
Upload: | anya-bida |
View: | 259 times |
Download: | 0 times |
Don't worry about missing something...
Video: https://www.youtube.com/watch?v=DNWaMR8uKDc&feature=youtu.be Presentation: http://www.slideshare.net/anyabida Cheat-sheet: http://techsuppdiva.github.io/ !!Anya: https://www.linkedin.com/in/anyabida Rachel: https://www.linkedin.com/in/rachelbwarren !! !2
About Anya About RachelOperations Engineer !!!
Spark & Scala Enthusiast / Data Engineer
Alpine Data!alpinenow.com
Default != RecommendedExample: By default, spark.executor.memory = 1g 1g allows small jobs to finish out of the box. Spark assumes you'll increase this parameter.
!6
Filter* data before an
expensive reduce or aggregation
consider* coalesce(
Use* data structures that
require less memory
Serialize*
PySpark
serializing is built-in
Scala/Java?
persist(storageLevel.[*]_SER)
Recommended: kryoserializer *
tuning.html#tuning-data-structures
See "Optimize partitions." *
See "GC investigation." *
See "Checkpointing." *
The Spark Tuning Cheat-Sheet
Fair Schedulers
!13
YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations>
SPARK <allocations> <pool name="sample_queue"> <schedulingMode>FAIR</schedulingMode> <weight>1</weight> <minShare>2</minShare> </pool> </allocations>
Fair Schedulers
!14
YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations>
SPARK <allocations> <pool name="sample_queue"> <schedulingMode>FAIR</schedulingMode> <weight>1</weight> <minShare>2</minShare> </pool> </allocations>
Fair Schedulers
!15
YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations>
SPARK <allocations> <pool name="sample_queue"> <schedulingMode>FAIR</schedulingMode> <weight>1</weight> <minShare>2</minShare> </pool> </allocations>
Fair Schedulers
!16
YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations>
SPARK <allocations> <pool name="sample_queue"> <schedulingMode>FAIR</schedulingMode> <weight>1</weight> <minShare>2</minShare> </pool> </allocations>
Fair Schedulers
!17
YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations>
SPARK <allocations> <pool name="sample_queue"> <schedulingMode>FAIR</schedulingMode> <weight>1</weight> <minShare>2</minShare> </pool> </allocations>
Use these parameters!
Fair Schedulers
!18
YARN <allocations> <user name="sample_user"> <maxRunningApps>6</maxRunningApps> </user> <userMaxAppsDefault>5</userMaxAppsDefault> !</allocations>
Fair Schedulers
!19
YARN <allocations> <user name="sample_user"> <maxRunningApps>6</maxRunningApps> </user> <userMaxAppsDefault>5</userMaxAppsDefault> !</allocations>
!21
Driver
Executor
Cluster Manager
Sidebar: Spark Architecture
Mark Grover: http://www.slideshare.net/SparkSummit/top-5-mistakes-when-writing-spark-applications-by-mark-grover-and-ted-malaska
Executor
!24
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !!!
<maxResources>___mb</maxResources>
Limitation
What is the memory limit for mySparkApp?
What is the memory limit for mySparkApp?
!25
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !!!
Reserve 25% for overhead
!28
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !
mySparkApp_mem_limit > driver.memory + (executor.memory x dynamicAllocation.maxExecutors)
What is the memory limit for mySparkApp?
!29
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !
mySparkApp_mem_limit > driver.memory + (executor.memory x dynamicAllocation.maxExecutors)
What is the memory limit for mySparkApp?
!30
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !
mySparkApp_mem_limit > driver.memory + (executor.memory x dynamicAllocation.maxExecutors)
What is the memory limit for mySparkApp?
Limitation: Driver must not be larger than a single node.
!32
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !
mySparkApp_mem_limit > driver.memory + (executor.memory x dynamicAllocation.maxExecutors)
What is the memory limit for mySparkApp?
!33
Driver
Executor
Cluster Manager
Sidebar: Spark Architecture
Mark Grover: http://www.slideshare.net/SparkSummit/top-5-mistakes-when-writing-spark-applications-by-mark-grover-and-ted-malaska
Executor
!34
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !
mySparkApp_mem_limit > driver.memory + (executor.memory x dynamicAllocation.maxExecutors)
What is the memory limit for mySparkApp?
Verify my calculations respect this limitation.
Reduce the memory needed for mySparkApp. How?
Gracefully handle memory limitations. How?
mySparkApp memory issues
Reduce the memory needed for mySparkApp. How?
Gracefully handle memory limitations. How?
mySparkApp memory issues
Reduce the memory needed for mySparkApp. How?
Gracefully handle memory limitations. How?
mySparkApp memory issues
here let's talk about one scenario
Reduce the memory needed for mySparkApp. How?
Gracefully handle memory limitations. How?
mySparkApp memory issues
persist(storageLevel.[*]_SER)
Reduce the memory needed for mySparkApp. How?
Gracefully handle memory limitations. How?
mySparkApp memory issues
persist(storageLevel.[*]_SER)
Reduce the memory needed for mySparkApp. How?
Gracefully handle memory limitations. How?
mySparkApp memory issues
persist(storageLevel.[*]_SER)
Recommended: kryoserializer *
Reduce the memory needed for mySparkApp. How?
Gracefully handle memory limitations. How?
mySparkApp memory issues
persist(storageLevel.[*]_SER)
Recommended: kryoserializer *
Reduce the memory needed for mySparkApp. How?
Gracefully handle memory limitations. How?
mySparkApp memory issues
Reduce the memory needed for mySparkApp. How?
Gracefully handle memory limitations. How?
mySparkApp memory issues
here let's talk about one scenario
!51Alexey Grishchenko: https://0x0fff.com/spark-memory-management/
Spark 1.1-1.5, Recommendation: Increase spark.memory.storageFraction !Spark 1.6, Recommendation: UnifiedMemoryManager
Alexey Grishchenko: https://0x0fff.com/spark-memory-management/Sandy Ryza: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
yarn.nodemanager.resource.memory-mb
spar
k.ya
rn.e
xecu
tor.m
emor
yOve
rhea
d
Executor Container
spark.executor.memory
!53
Driver
Cluster Manager
Sidebar: Spark Architecture
yarn.nodema
spar
k.ya
rn.e
Execspark.e
yarn.nodema
spar
k.ya
rn.e
Execspark.e
yarn.nodema
spar
k.ya
rn.e
Execspark.e
Executor
Executor
Intermittent Reliable
Optimal
mySparkApp Success
Memory trouble
Initial config
Instead of 2.5 hours, myApp completes in 1 hour.
Intermittent Reliable
Optimal
mySparkApp Success
Memory trouble
Initial config
HighPerformanceSpark.com
Further Reading:• Spark Tuning Cheat-sheet
techsuppdiva.github.io
• Apache Spark Documentation https://spark.apache.org/docs/latest
• Checkpointinghttp://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointinghttps://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-rdd-checkpointing.adoc
• Learning Spark, by H. Karau, A. Konwinski, P. Wendell, M. Zaharia, 2015
!58
More Questions?
!59
Video: https://www.youtube.com/watch?v=DNWaMR8uKDc&feature=youtu.be Presentation: http://www.slideshare.net/anyabida Cheat-sheet: http://techsuppdiva.github.io/ !!Anya: https://www.linkedin.com/in/anyabida Rachel: https://www.linkedin.com/in/rachelbwarren !! Thanks!