Spark Tuning For Enterprise System Administrators, Spark Summit East 2016

transcript

Spark Tuning for Enterprise System Administrators

Anya T. Bida, PhD Rachel B. Warren

Don't worry about missing something...

Presentation: http://www.slideshare.net/anyabida Cheat-sheet: http://techsuppdiva.github.io/ !!Anya: https://www.linkedin.com/in/anyabida Rachel: https://www.linkedin.com/in/rachelbwarren !! !2

About Anya About RachelOperations Engineer !!!

Spark & Scala Enthusiast / Data Engineer

About Alpine Data!alpinenow.com

Alpine deploys Spark in Production for our Enterprise Customers

About You*

Intermittent

Reliable Optimal

Enterprise System Administrators

mySparkApp Success

Intermittent Reliable

Optimal

mySparkApp Success

Default != RecommendedExample: By default, spark.executor.memory = 1g 1g allows small jobs to finish out of the box. Spark assumes you'll increase this parameter.

Which parameters are important? !

How do I configure them?

Default != Recommended

Filter* data before an

expensive reduce or aggregation

consider* coalesce(

Use* data structures that

require less memory

Serialize*

PySpark

serializing is built-in

Scala/Java?

persist(storageLevel.[*]_SER)

Recommended: kryoserializer *

tuning.html#tuning-data-structures

See "Optimize partitions." *

See "GC investigation." *

See "Checkpointing." *

The Spark Tuning Cheat-Sheet

Optimal

mySparkApp Success

mySparkApp memory issues

Shared Cluster

Fair Schedulers

YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations>

SPARK <allocations> <pool name="sample_queue"> <schedulingMode>FAIR</schedulingMode> <weight>1</weight> <minShare>2</minShare> </pool> </allocations>

Fair Schedulers

YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations>

SPARK <allocations> <pool name="sample_queue"> <schedulingMode>FAIR</schedulingMode> <weight>1</weight> <minShare>2</minShare> </pool> </allocations>

Configure these parameters too!

Fair Schedulers

YARN <allocations> <user name="sample_user"> <maxRunningApps>6</maxRunningApps> </user> <userMaxAppsDefault>5</userMaxAppsDefault> !</allocations>

What is the memory limit for mySparkApp?

Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !!!

Limitation

Reserve 25% for overhead.

Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !

mySparkApp_mem_limit = driver.memory + (executor.memory x dynamicAllocation.maxExecutors)

Limitation: Each driver and executor must not be larger than a

single node.

Limitation: Driver and executor memory must not be larger than

a single node.

!(yarn.nodemanager.resource.memory-mb - 1Gb)

executor.memory ~ # executors per node

Limitation

Limitation: maxExecutors should not exceed pool allocation.

!Yarn: <maxResources>8vcores</maxResources>

Limitation

I want a little more information...Top 5 Mistakes When Writing Spark Applications

by Mark Grover and Ted Malaska of Cloudera http://www.slideshare.net/hadooparchbook/top-5-mistakes-when-writing-spark-applications

How-to: Tune Your Apache Spark Jobs (Part 2) by Sandy Ryza of Cloudera

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

I want lots more...

Optimal

mySparkApp Success

Shared Cluster

Reduce the memory needed for mySparkApp. How?

Gracefully handle memory limitations. How?

here let's talk about one scenario

persist(storageLevel.[*]_SER)

Recommended: kryoserializer *

here let's talk about one scenario

Symptoms:

• mySparkApp is running for several hours Container is lost.

• I notice one container fails, then the rest fail one by one

• The first container to fail was the driver • Driver is a SPOF

Investigate:

collect unbounded data to the driver

• Driver failures are often caused by:

• I verified only bounded data is brought to the driver, but still the driver fails intermittently.

Potential Solution: RDD.checkpoint()

Use in these cases: • high-traffic cluster • network blips • preemption • disk space nearly full !!

Function: • saves the RDD to stable

storage (eg hdfs or S3)

How-to: SparkContext.setCheckpointDir(directory: String)

RDD.checkpoint()

Optimal

mySparkApp Success

Shared Cluster

Instead of 2.5 hours, myApp completes in 1 hour.

Cheat-sheet techsuppdiva.github.io/

Optimal

mySparkApp Success

Shared Cluster

HighPerformanceSpark.com

Further Reading:• Learning Spark, by H. Karau, A. Konwinski, P. Wendell, M. Zaharia, 2015, O'Reilly

https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html

• Scheduling:https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application

• Tuning the Spark Conf:Mark Grover and Ted Malaska from Cloudera http://www.slideshare.net/hadooparchbook/top-5-mistakes-when-writing-spark-applicationsSandy Ryza (Cloudera) http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

• Checkpointing:http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing

• Troubleshooting:Miklos Christine from Databricks https://spark-summit.org/east-2016/events/operational-tips-for-deploying-spark/

• High Performance Spark by R. Warren, H. Karau, coming in 2016, O'Reilly http://highperformancespark.com/

Spark Tuning For Enterprise System Administrators, Spark Summit East 2016

Data & Analytics