+ All Categories
Home > Data & Analytics > Spark tuning2016may11bida

Spark tuning2016may11bida

Date post: 13-Apr-2017
Category:
Upload: anya-bida
View: 259 times
Download: 0 times
Share this document with a friend
59
Spark Tuning for Enterprise System Administrators Anya T. Bida, PhD Rachel B. Warren
Transcript

Spark Tuning for Enterprise System Administrators

Anya T. Bida, PhD Rachel B. Warren

Don't worry about missing something...

Video: https://www.youtube.com/watch?v=DNWaMR8uKDc&feature=youtu.be Presentation: http://www.slideshare.net/anyabida Cheat-sheet: http://techsuppdiva.github.io/ !!Anya: https://www.linkedin.com/in/anyabida Rachel: https://www.linkedin.com/in/rachelbwarren !! !2

About Anya About RachelOperations Engineer !!!

Spark & Scala Enthusiast / Data Engineer

Alpine Data!alpinenow.com

About You*

Intermittent

Reliable Optimal

Spark practitioners

mySparkApp Success

*

Intermittent Reliable

Optimal

mySparkApp Success

Default != RecommendedExample: By default, spark.executor.memory = 1g 1g allows small jobs to finish out of the box. Spark assumes you'll increase this parameter.

!6

Which parameters are important? !

How do I configure them?

!7

Default != Recommended

Filter* data before an

expensive reduce or aggregation

consider* coalesce(

Use* data structures that

require less memory

Serialize*

PySpark

serializing is built-in

Scala/Java?

persist(storageLevel.[*]_SER)

Recommended: kryoserializer *

tuning.html#tuning-data-structures

See "Optimize partitions." *

See "GC investigation." *

See "Checkpointing." *

The Spark Tuning Cheat-Sheet

Intermittent Reliable

Optimal

mySparkApp Success

Memory trouble

Initial config

Intermittent Reliable

Optimal

mySparkApp Success

Memory trouble

Initial config

!11

How many in the audience have their own

cluster?

!12

Fair Schedulers

!13

YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations>

SPARK <allocations> <pool name="sample_queue"> <schedulingMode>FAIR</schedulingMode> <weight>1</weight> <minShare>2</minShare> </pool> </allocations>

Fair Schedulers

!14

YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations>

SPARK <allocations> <pool name="sample_queue"> <schedulingMode>FAIR</schedulingMode> <weight>1</weight> <minShare>2</minShare> </pool> </allocations>

Fair Schedulers

!15

YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations>

SPARK <allocations> <pool name="sample_queue"> <schedulingMode>FAIR</schedulingMode> <weight>1</weight> <minShare>2</minShare> </pool> </allocations>

Fair Schedulers

!16

YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations>

SPARK <allocations> <pool name="sample_queue"> <schedulingMode>FAIR</schedulingMode> <weight>1</weight> <minShare>2</minShare> </pool> </allocations>

Fair Schedulers

!17

YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations>

SPARK <allocations> <pool name="sample_queue"> <schedulingMode>FAIR</schedulingMode> <weight>1</weight> <minShare>2</minShare> </pool> </allocations>

Use these parameters!

Fair Schedulers

!18

YARN <allocations> <user name="sample_user"> <maxRunningApps>6</maxRunningApps> </user> <userMaxAppsDefault>5</userMaxAppsDefault> !</allocations>

Fair Schedulers

!19

YARN <allocations> <user name="sample_user"> <maxRunningApps>6</maxRunningApps> </user> <userMaxAppsDefault>5</userMaxAppsDefault> !</allocations>

What is the memory limit for mySparkApp?

!20

!21

Driver

Executor

Cluster Manager

Sidebar: Spark Architecture

Mark Grover: http://www.slideshare.net/SparkSummit/top-5-mistakes-when-writing-spark-applications-by-mark-grover-and-ted-malaska

Executor

!22

Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !!!

What is the memory limit for mySparkApp?

!23

Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !!!

What is the memory limit for mySparkApp?

!24

Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !!!

<maxResources>___mb</maxResources>

Limitation

What is the memory limit for mySparkApp?

What is the memory limit for mySparkApp?

!25

Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !!!

Reserve 25% for overhead

!26

Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !!!

What is the memory limit for mySparkApp?

!27

!28

Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !

mySparkApp_mem_limit > driver.memory + (executor.memory x dynamicAllocation.maxExecutors)

What is the memory limit for mySparkApp?

!29

Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !

mySparkApp_mem_limit > driver.memory + (executor.memory x dynamicAllocation.maxExecutors)

What is the memory limit for mySparkApp?

!30

Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !

mySparkApp_mem_limit > driver.memory + (executor.memory x dynamicAllocation.maxExecutors)

What is the memory limit for mySparkApp?

Limitation: Driver must not be larger than a single node.

!31

yarn.nodemanager.resource.memory-mb

Driver Container

spark.driver.memory

!32

Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !

mySparkApp_mem_limit > driver.memory + (executor.memory x dynamicAllocation.maxExecutors)

What is the memory limit for mySparkApp?

!33

Driver

Executor

Cluster Manager

Sidebar: Spark Architecture

Mark Grover: http://www.slideshare.net/SparkSummit/top-5-mistakes-when-writing-spark-applications-by-mark-grover-and-ted-malaska

Executor

!34

Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !

mySparkApp_mem_limit > driver.memory + (executor.memory x dynamicAllocation.maxExecutors)

What is the memory limit for mySparkApp?

Verify my calculations respect this limitation.

!35

Intermittent Reliable

Optimal

mySparkApp Success

Memory trouble

Initial config

Intermittent Reliable

Optimal

mySparkApp Success

Memory trouble

Initial config

mySparkApp memory issues

Reduce the memory needed for mySparkApp. How?

Gracefully handle memory limitations. How?

mySparkApp memory issues

Reduce the memory needed for mySparkApp. How?

Gracefully handle memory limitations. How?

mySparkApp memory issues

Reduce the memory needed for mySparkApp. How?

Gracefully handle memory limitations. How?

mySparkApp memory issues

here let's talk about one scenario

Reduce the memory needed for mySparkApp. How?

Gracefully handle memory limitations. How?

mySparkApp memory issues

persist(storageLevel.[*]_SER)

Reduce the memory needed for mySparkApp. How?

Gracefully handle memory limitations. How?

mySparkApp memory issues

persist(storageLevel.[*]_SER)

Reduce the memory needed for mySparkApp. How?

Gracefully handle memory limitations. How?

mySparkApp memory issues

persist(storageLevel.[*]_SER)

Recommended: kryoserializer *

Reduce the memory needed for mySparkApp. How?

Gracefully handle memory limitations. How?

mySparkApp memory issues

persist(storageLevel.[*]_SER)

Recommended: kryoserializer *

Reduce the memory needed for mySparkApp. How?

Gracefully handle memory limitations. How?

mySparkApp memory issues

Reduce the memory needed for mySparkApp. How?

Gracefully handle memory limitations. How?

mySparkApp memory issues

here let's talk about one scenario

Spark 1.1-1.5, Recommendation: Increase spark.memory.storageFraction

!51Alexey Grishchenko: https://0x0fff.com/spark-memory-management/

Spark 1.1-1.5, Recommendation: Increase spark.memory.storageFraction !Spark 1.6, Recommendation: UnifiedMemoryManager

Alexey Grishchenko: https://0x0fff.com/spark-memory-management/Sandy Ryza: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

yarn.nodemanager.resource.memory-mb

spar

k.ya

rn.e

xecu

tor.m

emor

yOve

rhea

d

Executor Container

spark.executor.memory

!53

Driver

Cluster Manager

Sidebar: Spark Architecture

yarn.nodema

spar

k.ya

rn.e

Execspark.e

yarn.nodema

spar

k.ya

rn.e

Execspark.e

yarn.nodema

spar

k.ya

rn.e

Execspark.e

Executor

Executor

Intermittent Reliable

Optimal

mySparkApp Success

Memory trouble

Initial config

Intermittent Reliable

Optimal

mySparkApp Success

Memory trouble

Initial config

Instead of 2.5 hours, myApp completes in 1 hour.

Cheat-sheet techsuppdiva.github.io/

Intermittent Reliable

Optimal

mySparkApp Success

Memory trouble

Initial config

HighPerformanceSpark.com

Further Reading:• Spark Tuning Cheat-sheet

techsuppdiva.github.io

• Apache Spark Documentation https://spark.apache.org/docs/latest

• Checkpointinghttp://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointinghttps://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-rdd-checkpointing.adoc

• Learning Spark, by H. Karau, A. Konwinski, P. Wendell, M. Zaharia, 2015

!58

More Questions?

!59

Video: https://www.youtube.com/watch?v=DNWaMR8uKDc&feature=youtu.be Presentation: http://www.slideshare.net/anyabida Cheat-sheet: http://techsuppdiva.github.io/ !!Anya: https://www.linkedin.com/in/anyabida Rachel: https://www.linkedin.com/in/rachelbwarren !! Thanks!


Recommended