+ All Categories
Home > Software > Spark on Yarn

Spark on Yarn

Date post: 12-Jan-2017
Category:
Upload: qubole
View: 866 times
Download: 0 times
Share this document with a friend
41
Spark on Yarn Spark Meetup Oct 17, 2015
Transcript
Page 1: Spark on Yarn

Spark on Yarn

Spark Meetup

Oct 17, 2015

Page 2: Spark on Yarn

Agenda

• Autoscaling Spark Apps• Yarn • Cluster Management • Interfaces/APIs for Running Spark Jobs• JobServer• Persistent History Server• Hive Integration with Spark

Page 3: Spark on Yarn

Autoscaling Spark Applications

Page 4: Spark on Yarn

Spark Provisioning: Problems

• Spark Application starts with fixed number of resources and hold on to them till its alive

• Sometimes its difficult to estimate resources required by a job since AM is long running

• It becomes limiting spl when Yarn clusters can autoscale.

Page 5: Spark on Yarn

Dynamic Provisioning

• Speed up spark commands by using free resources in yarn cluster and also by releasing resources when free to RM.

Page 6: Spark on Yarn

Spark on Yarn basics

DriverAM

Executor-1 Executor-n

• Cluster Mode: Driver and AM run in same JVM in a yarn Executor

• Client Mode: Driver and AM run in separate JVM

• Driver and AM talk using Actors to handle both cases

Driver AM Executor-1 Executor-n

Page 7: Spark on Yarn

Dynamic Provisioning: Problem Statement

• Two parts: – Spark AM has no way to ask for additional

containers and give up free containers– Automating the process of requesting containers

and releasing containers. Cached data in containers make this difficult

Page 8: Spark on Yarn

Dynamic Provisioning: Part1

Page 9: Spark on Yarn

Dynamic Provisioning: Part1

• Implementation of 2 new apis:// Request 5 extra executorssc.requestExecutors(5)// Kill executors with IDs 1, 15, and

16sc.killExecutors(Seq("1", "15",

"16"))

Page 10: Spark on Yarn

requestExecutors

AMReporter Thread

E1 E2 En

• AM has reporter thread that has count of number of executors

• Reporter thread was used to restart died executors

• Driver increments count of number of executors when sc.requestExecutors is called.

Driver

Page 11: Spark on Yarn

removeExecutors

• To kill executors, one must precisely tell which executors need to be killed

• Driver maintains list of all executors and can be obtained by:sc.executorStorageStatuses.foreach(x => println(x.blockManagerId.executorId))

• Whats cached in each executor is also available using:sc.executorStorageStatuses.foreach(x => println(s”memUsed = ${x.memUsed} diskUsed=${x.diskUsed)”))

Page 12: Spark on Yarn

Removing Executors Tradeoffs

• BlockManager in each executor can have cached RDDs, shuffle and broadcast data

• Killing an executor with shuffle data will require the stage to rerun.

• To avoid this use external shuffle service introduced in spark-1.2

Page 13: Spark on Yarn

Dynamic Provisioning: Part2

Page 14: Spark on Yarn

Upscaling Heuristics

• Request Executors as many pending tasks• Request Executors in rounds if there are

pending tasks, doubling number of executors added in each round bounded by some upper limit

• Request executors by estimating workload• Introduced –max-executors as extra param

Page 15: Spark on Yarn

Downscaling Heuristics

• Remove Executors when they are idle• Remove Executors if then are idle for X secs• Cant downscale executors with shuffle data or

broadcast data.• --num-executors act as minimum executors

Page 16: Spark on Yarn

Scope

• Kill executors on spot nodes first• Flag for not killing up executors if they have

shuffle data

Page 17: Spark on Yarn

Yarn

Page 18: Spark on Yarn

Hadoop1

Page 19: Spark on Yarn

Disadvantages of hadoop1

• Limited to only MR• Separate Map and Reduce slots =>

underutilization • JT has multiple responsibilities of job

scheduling, monitoring and resource allocation.

Page 20: Spark on Yarn

Yarn Overview

Page 21: Spark on Yarn

Advantages of Spark on Yarn

• General cluster for running multiple workflows. AM can have custom logic for scheduling

• AM can ask for more containers when required and give up containers when free

• This become even better when yarn clusters can autoscale

• Get features like spot nodes etc which brings additional challenges

Page 22: Spark on Yarn

Cloud Cluster Management in Qubole

Page 23: Spark on Yarn

Cluster management

• Clusters run in customer accounts• Support for VPC and multiple regions and

multiple clouds• Various node types supported• Full ssh access to clusters for customers• Ability to run custom bootstrap code on node

start

Page 24: Spark on Yarn

Cluster Management Interface

Page 25: Spark on Yarn

Interfaces/APIs to submit Spark Jobs

Page 26: Spark on Yarn

Using SparkSQL - Command UI

Page 27: Spark on Yarn

Using SparkSQL - Results

Page 28: Spark on Yarn

Using SparkSQL - Notebook

• SQL, Python, Scala code can be input

Page 29: Spark on Yarn

Using SparkSQL - REST api - scalacurl --silent -X POST \ -H "X-AUTH-TOKEN: $AUTH_TOKEN" \ -H "Content-Type: application/json" \ -H "Accept: application/json" \ -d '{ "program" : "val s = new org.apache.spark.sql.hive.HiveContext(sc); s.sql(\"show tables\").collect.foreach(println)", "language" : "scala", "command_type" : "SparkCommand" }' \ https://api.qubole.net/api/latest/commands

Page 30: Spark on Yarn

Using SparkSQL - REST api - sql

curl --silent -X POST \ -H "X-AUTH-TOKEN: $AUTH_TOKEN" \ -H "Content-Type: application/json" \ -H "Accept: application/json" \ -d '{ "program" : "show tables", "language" : "sql", "command_type" : "SparkCommand" }' \ https://api.qubole.net/api/latest/commands

NOT RELEASE YET

Page 31: Spark on Yarn

Using SparkSQL - qds-sdk-py / java

from qds_sdk.commands import SparkCommand

with open(“test_spark.py”) as f: code = f.read()cmd = SparkCommand.run(language="python", label="spark", program=code)results = cmd.get_results()

Page 32: Spark on Yarn

Using SparkSQL - Cluster config

Page 33: Spark on Yarn

Spark UI container info

Page 34: Spark on Yarn

JobServer

Page 35: Spark on Yarn

Persistent History Server

Page 36: Spark on Yarn

Spark Hive Integration

Page 37: Spark on Yarn

What is involved?

• Spark programs should be able to access hive metastore

• Other Qubole services can be producers or consumers of data and metadata(hive, presto, pig etc)

Page 38: Spark on Yarn

Basic cluster organization

• DB instance in Qubole account• ssh tunnel from master to metastore DB• Metastore server running on master on port

10000• On master and slave nodes, hive-site.xml:-

hive.metastore.uris=thrift://master_ip:10000

Page 39: Spark on Yarn

Hosted metastore

Page 40: Spark on Yarn

Questions

Page 41: Spark on Yarn

Problems

• yarn overhead should be 20% (TPC-H)• Parquet needs higher PermGen• cached tables use actual table• alter table recover partitions not supported• VPC cluster has slow access to metastore• SchemaRDD gone - old jars dont run• hive jars needed on system classpath


Recommended