+ All Categories
Home > Documents > Basic&Spark&Programming&and&...

Basic&Spark&Programming&and&...

Date post: 23-Mar-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
57
Basic Spark Programming and Performance Diagnosis Jinliang Wei 15719 Spring 2017 Recita@on
Transcript
Page 1: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Basic  Spark  Programming  and  Performance  Diagnosis  

Jinliang  Wei  15-­‐719  Spring  2017  

Recita@on  

Page 2: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Today’s  Agenda  

•  PySpark  shell  and  submiHng  jobs  •  Basic  Spark  programming  –  Word  Count  •  How  does  Spark  execute  your  program?  •  Spark  monitoring  web  UI  •  What  is  shuffle  and  how  does  it  work?  •  Spark  programming  caveats    •  Generally  good  prac@ces  •  Important  configura@on  parameters  •  Basic  performance  diagnosis  

Page 3: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

PySpark  shell  and  submiHng  jobs    

Page 4: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Launch  A  Spark  +  HDFS  Cluster  on  EC2  

•  Firstly,  set  environment  variables:  – AWS_SECRET_ACCESS_KEY– AWS_ACCESS_KEY_ID

•  Get  spark-­‐ec2-­‐setup  •  Launch  a  cluster  with  4  slave  nodes:  ./spark-ec2 -k <key-id> -i <identity-file> \

-t m4.xlarge -s 4 -a ami-6d15ec7b \--ebs-vol-size=200 --ebs-vol-num=1 \--ebs-vol-type=gp2 \--spot-price=<proper-price> \launch SparkCluster

•  Login  as  root•  Replace  launch  with  destroy  to  terminate  the  

cluster

Page 5: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Your  Standalone  Spark  Cluster  Master  

Worker1   Worker2  •  Spark  master  is  the  cluster  manager  (analogous  to  YARN/

Mesos).  •  Workers  are  some@mes  referred  to  as  slaves.  •  When  your  applica@on  is  submided,  worker  nodes  run  

executors,  which  are  processes  that  run  computa@ons  and  store  data  for  your  applica@on.  

•  By  default,  an  executor  uses  all  cores  on  a  worker  node.  •  Configurable  via  spark.executor.cores  (normally  lee  as  

default  unless  too  many  cores  per  node)  

Page 6: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Standalone  Spark  Master  Web  UI    http://[master-node-public-ip]:8080For  an  overview  of  the  cluster  and  state  of  each  worker.  

Page 7: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

PySpark  Shell  

•  Spark  is  installed  under    /root/spark  

•  Launch  PySpark  shell  /root/spark/bin/pyspark  

Page 8: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Simple  math  using  PySpark  Shell  

•  Define  a  list  of  numbers:  a = [1, 3, 7, 4, 2]

•  Create  an  RDD  from  that  list:    rdd_a = sc.parallelize(a)

•  Double  each  element:    rdd_b = rdd_a.map(lambda x: x * 2)

•  Sum  the  elements  up:    c = rdd_b.reduce(lambda x, y: x + y)

Page 9: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Submit  Applica@ons  to  Spark  

•  Suppose  you  have  a  Spark  program  named  word_count.py,  submit  to  run  by  running  

   /root/spark/bin/spark-submit \[optional arguments to spark-submit] \word_count.py \[arguments to your program]  

Page 10: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

What  happens  when  you  submit  your  applica@on?  

•  Your  program  (driver  program)  runs  in  “client”  mode  –  a  client  outside  of  the  Spark  master.  

•  Spark  launches  executors  on  the  worker  nodes.  •  SparkContext  sends  tasks  to  the  executors  to  run.  

Page 11: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Basic  Spark  Programming  –  Word  Count  

Page 12: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

How  to  implement  a  word  count  w/  map-­‐reduce?  

•  Problem:  given  a  document,  count  the  occurrences  of  each  word  

•  Map:  take  in  a  chunk  of  the  document,  output  a  list  of  pairs  of  (word,  1)  

•  Shuffle:  group  KV  pairs  by  their  key  (word),  assign  each  group  to  a  reducer  

•  Reduce:  sum  up  the  values  of  each  group  

Page 13: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

How  to  implement  it  using  Spark?  import pyspark

if __name__ == "__main__": conf = pyspark.SparkConf().setAppName("WordCount") sc = pyspark.SparkContext(conf=conf)

text_rdd = sc.textFile("/README.md") tokens_rdd = text_rdd.flatMap( \

lambda x: [(a, 1) for a in x.split()])count_rdd = tokens_rdd.reduceByKey(lambda x, y: x + y)

tokens_count = count_rdd.collect()

sc.stop()

tokens_count.sort(key = lambda x: x[1], reverse=True) count = 0 for token_tuple in tokens_count: print "(%s, %d)" % token_tuple count += 1 if count >= 10: break

Page 14: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Lazy  Evalua@on  

•  Two  kinds  of  opera@ons  on  RDD:  –  Transforma@on:  RDD_A  -­‐>  RDD_B,  e.g.  flatMap– Ac@on:  RDD_A  -­‐>  outside  Spark,  e.g.  collect  

•  Transforma@on  is  “lazy  evaluated”.  –  Record  the  dependency  informa@on  when  called.  –  Evaluated  only  when  necessary  

•  An  ac@on  causes  the  RDD  and  the  ones  it  depends  on  to  be  computed.  

Page 15: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

How  does  Spark  execute  your  program?  

Why  should  you  care?  Because  you  may  need  to  do  performance    diagnosis  and  understand  the  terminology  to  interpret  the  Spark  monitoring  UIs.  

Page 16: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

The  lineage  graph  is  built  when  transforma@ons  are  invoked  

par@@on1  

par@@on2  

par@@on3  

text_rdd   tokens_rdd   tokens_rdd  

narrow  dependence   wide  dependence  

Pipelined  execu@on:  a  sequence  of  transforma@ons  applied  on  each  record  (par@@on),  independently  executed  of  other  records  (par@@ons)  

Shuffle:  every  node  reads  from  every  other  nodes;  might  cause  global  barrier  

Page 17: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

An  ac@on  causes  the  actual  evalua@on  

•  Spark  calls  it  a  job.  •  If  the  RDD  on  which  the  ac@on  was  invoked  exists,  then  compute  the  ac@on,  else  compute  the  RDD.  

•  Compu@ng  an  RDD  recursively  computes  its  parent  RDDs.  

Page 18: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Build  a  DAG  of  stages  from  the  lineage  graph  

•  RDD’s  with  narrow  dependence  between  them  are  grouped  into  the  same  stage.  

•  Stage  boundaries  are  shuffles.  •  Each  task  is  scheduled  to  a  core.  

stage    1   stage  2  

text_rdd   tokens_rdd   tokens_rdd  

a  task  

Page 19: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

A  stage  is  computed  as  a  set  of  parallel  tasks  

•  Each  par@@on  is  a  task  •  You  may  control  the  number  of  par@@ons  of  an  RDD  – partitionBy(num_partitions)–  Some  opera@ons  allow  you  to  explicitly  specify  the  number  of  par@@ons  

–  Configura@on  parameter:  spark.default.parallelism

•  This  is  where  most  of  the  parallelism  comes  from  

Page 20: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

What’s  the  proper  number  of  par@@ons  for  an  RDD?  

•  Want  to  have  sufficient  parallelism  and  balanced  load.  – Rule  of  thumb:  at  least  2  @mes  the  number  of  cores  

•  Don’t  want  too  many  tasks  otherwise  most  of  the  @me  will  be  spent  on  seHng  up  the  tasks.  – Rule  of  thumb:  at  least  hundreds  of  milliseconds  per  task  

•  Make  sure  each  par@@on  can  fit  in  memory.  

Page 21: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

How  does  Spark  run  Python  code?  

•  Your  Python  UDFs  are  executed  in  Python  processes  •  RDD  records  need  to  be  transferred  between  JVM  and  Python  

•  Serializa@on  could  be  a  performance  problem.  

Page 22: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

PySpark  “pipelines”  Python  func@on  automa@cally  

•  If  you  apply  mul@ple  transforma@ons  in  a  series,  Spark  “fuses”  the  Python  UDFs  to  avoid  mul@ple  transfers  between  Python  and  JVM.  

•  Example:  rdd_x.map(foo).map(bar)  – Func@on  foo(x)  takes  in  a  record  x  and  outputs  a  record  y  

– Func@on  bar(y)  takes  in  a  record  y  and  outputs  a  record  z  

– Spark  automa@cally  creates  a  func@on  foo_bar(x)  that  takes  in  a  record  x  and  outputs  a  record  z,  which  is  essen@ally  bar(foo(x)).  

Page 23: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Spark  Monitoring  Web  UIs  

Page 24: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Live  Monitoring  Web  UI  

http://[master-node-public-ip]:4040How  is  my  running  applica@on  doing?  

Page 25: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

History  Server  http://[master-node-public-ip]:18080Visualizing  the  logs  of  completed  applica@ons.  

 

Page 26: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

The  job  view  •  Jobs  –  why  is  there  only  one  job?  

Page 27: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Details  for  a  job  •  Stages:  what  opera@ons  are  in  stage  1  and  2?  

Page 28: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Understand  the  DAG  Visualiza@on  

•  Dots  are  RDDs.  •  Dots  inside  the  blue  box  are  RDDs  in  JVM.  •  Text  labels  are  transforma@on  that  generates  the  RDDs  – Problem:  PySpark  uses  some  transforma@ons  to  implement  other  transforma@ons  (reduceByKey  implement  by  par@@onBy  and  mapPar@@ons),  so  the  labels  are  not  exactly  the  same  as  your  code  

– But  if  you  know  the  stage  boundaries,  you  can  figure  out  which  opera@ons  belong  to  which  stage  

Page 29: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Details  for  a  stage  

Page 30: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Event  Timeline  

Page 31: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Recap:  Stage  DAG  

•  Each  RDD  par@@on  correspond  to  a  task  •  Number  of  RDD  par@@ons  can  oeen  be  controled  

stage    1   stage  2  

text_rdd   tokens_rdd   tokens_rdd  

Page 32: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

What  is  shuffle  write  and  shuffle  read?    

What  is  shuffle  and  how  does  it  work?  

Page 33: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

What  is  shuffle  and  what  is  it  used  for?  

•  Informally,  a  mechanism  that  redistributes  the  par@@oned  RDD  records.  

•  Informally,  it  is  needed  whenever  you  need  records  that  sa@sfy  certain  condi@on  (e.g.  the  same  key)  to  reside  in  the  same  par@@on.  

(“a”,  1),  (“b”,  1),  (“d”,  1)    

(“a”,  1),  (“b”,  1),  (“c”,  1)    

(“a”,  1),  (“a”,  1),    (“c”,  1)    

(“b”,  1),  (“b”,  1),  (“d”,  1)    

Page 34: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Opera@ons  that  may  cause  a  shuffle  

•  par@@onBy  •  reduceByKey  •  groupByKey  •  …  

Page 35: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

How  is  shuffle  implemented?  

•  Two  implementa@ons:  hash  shuffle  and  sort  shuffle  

•  You  don’t  need  to  know  the  details  for  this  project.  If  curious,  read  this  blog  post:  

hdps://0x0fff.com/spark-­‐architecture-­‐shuffle/  •  You  need  to  know:  – Mappers  (sources)  serializes  RDD  records  and  write  them  to  local  disk  (shuffle  write)  

–  Reducers  (des@na@ons)  reads  from  remote  disk  over  network  for  their  par@@on  of  records  (shuffle  read)  

Page 36: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Shuffle  is  expensive  

•  Data  is  serialized,  wriden  to  local  disk,  and  communicated  over  network  – Serializa@on  takes  @me,  disk  and  network  are  slow  

•  Everyone  depends  on  everyone  else,  if  there  is  a  straggler,  everyone  has  to  wait  

•  Minimize  the  number  of  shuffles  in  your  program  

Page 37: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Spark  Programming  Caveats  

Page 38: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Understanding  Closures  

•  Informally,  a  closure  is  a  func@on  with  its  surrounding  environment  when  the  closure  is  created.  

•  The  driver  program  sends  closures  to  executors  to  have  them  executed.  

•  RDD  opera@ons  (closures)  modify  variables  outside  of  their  scope  oeen  causes  confusion  (generally  don’t  do  that).  

Page 39: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

What’s  the  behavior  of  this  code?  

•  The  driver’s  counter  is  captured  when  the  closure  is  created  and  then  visible  to  executors.  

•  The  global  counter  that  the  executor  modifies  is  the  executor’s  local  variable,  i.e.  writes  are  not  see  by  driver.  

Page 40: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Broadcast  variable  

•  broadcastVar.value  can  be  read  by  any  worker  any@me  aeer  it’s  created  

•  Read-­‐only  variable  (to  avoid  dealing  with  concurrent  writes)  

•  One  copy  per  executor  

Page 41: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Ways  to  communicate  values  from  driver  to  executors  or  tasks  

•  Create  RDDs  •  Closure  •  Broadcast  variable  •  Ques@on:  when  should  you  use  each  one?  – Closure:  values  of  small  size  that  are  only  useful  for  this  func@on  

– Broadcast  variable:  more  efficient  for  larger  variables  and  when  you  want  to  reuse  the  values  across  stages  

– RDDs:  when  the  variable  is  too  large  

Page 42: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

How  do  executors  send  values  to  driver?  

•  Use  RDD  ac@ons  •  Accumulators  – Only  allow  associa@ve  and  commuta@ve  opera@ons  

– Because  concurrent  writes  can  be  easily  dealt  with  – Read  Spark  programming  guide  for  details  

Page 43: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

RDD  Persistence  

•  Spark  is  in-­‐memory  –  what  does  it  mean?  •  Spark  is  capable  of  persis@ng  (or  caching)  an  RDD  in  memory  across  ac@ons  (jobs).  –  Hadoop  can’t.  –  Spark  may  persist  RDDs  in  disks  too.  

•  If  RDDs  are  not  persisted,  they  are  recomputed  for  different  ac@ons.  –  RDDs  are  computed  at  most  once  per  job.  

•  But  you  need  to  tell  Spark  which  RDDs  to  persist.  •  Spark  some@mes  persists  an  RDD  automa@cally,  but  this  is  not  very  well  specified.  

Page 44: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Persis@ng  an  RDD  

•  persist()  op@ons:–  MEMORY_ONLY:  default,  if  not  enough  memory,  recompute  it

–  MEMORY_AND_DISK:  if  not  enough  memory,  persist  on  disk  

–  DISK_ONLY:  persist  on  disk  –  A  few  others  

•  cache()  is  persist(MEMORY_ONLY)

Page 45: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Generally  Good  Prac@ces  •  Generally,  avoid  shuffles  if  you  can  – A  shuffle  might  be  worth  doing  if  it  increases  parallelism  •  E.g.  more  par@@ons,  beder  load  balancing,  

•  For  shuffle,  pick  the  right  operators  – avoid  transferring  the  en@re  RDD  over  network  

•  Some  opera@ons  do  local  aggrega@ons  before  shuffles  before  shuffling  •  E.g.  groupByKey()  +  mapValues()  vs.  reduceByKey()  

Page 46: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Spark  Proper@es  –  Per  Applica@on  Proper@es  

Page 47: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

The  ones  that  you  should  understand  

•  spark.executor.memory: amount  of  memory  to  user  per  executor  process  (JVM  heap  size)  – Op@onal  reading  -­‐  Spark  memory  management:  hdp://spark.apache.org/docs/latest/tuning.html#memory-­‐management-­‐overview  

•  spark.default.parallelism:  default  number  of  par@@ons  in  RDDs  returned  by  certain  opera@ons,  when  not  set  by  user  –  You  can  explicitly  control  the  number  of  par@@ons  in  most  cases  

•  More  details  (op@onal  for  project  2):  hdp://spark.apache.org/docs/latest/[email protected]#spark-­‐proper@es  

Page 48: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

How  to  set  those  proper@es  

•  When  calling  spark-­‐submit,  use  op@on      –conf “config.property=value”One  property  per  conf.  

•  Can  be  set  programmably  using  SparkConf  when  crea@ng  SparkContext  (don’t  work  for  all  proper@es)  

•  conf/spark-defaults.conf (don’t  do  that  for  Project  2)  

Page 49: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Basic  Performance  Diagnosis:  What  do  I  do  if  my  applica@on  is  

running  slow?  

Page 50: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Q1:  Which  job  and  stage  is  the  bodleneck?  

•  Check  the  Spark  monitoring  UIs  •  Iden@fy  the  bodlenecking  stage  

Page 51: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Possible  sources  of  bodleneck  

•  CPUs  are  not  fully  u@lized  – Network  I/O  – Disk  I/O  –  Insufficient  parallelism  –  Imbalance  

•  CPUs  are  highly  u@lized  

Page 52: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Q1:  Are  you  fully  u@lizing  your  CPUs?  •  vmstat 2 20

–  One  update  every  2  seconds,  for  20  updates  –  The  first  line  is  an  average  since  the  machine  is  booted  –  Good  for  a  quick  overview  of  the  machine  

Page 53: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Q2:  Why  are  my  CPUs  not  fully  u@lized?  

•  Generally  you  can  find  answers  from  the  monitoring  web  UI  

•  Insufficient  parallelism  or  imbalance?  – Check  the  per  stage  @meline  

•  Blocked  on  network  or  disk  I/O?  – Shuffle  reads  and  writes  

•  How  to  op@mize  for  those  problems?  

Page 54: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Q2:  My  CPUs  are  highly  u@lized,  so?  

•  Which  func@ons  are  your  CPUs  spend  their  @me  on?  Answer:  profile  your  code.  

•  Spark  Python  profiler  --conf “spark.python.profile=true”--conf “spark.python.profile.dump=/root/spark_profile”  •  More  details:  hdp://spark.apache.org/docs/latest/[email protected]  hdps://docs.python.org/2/library/profile.html  •  If  most  @me  is  spent  in  JVM,  this  is  not  useful  and  it’s  beyond  your  control.  

Page 55: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Basic  Performance  Diagnosis:  What  do  I  do  if  I  get  Out-­‐Of-­‐Memory  

(OOM)  excep@ons?  OOM  can  manifest  as  other  

excep@ons.  

Page 56: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

A  Common  Pivall  

•  Driver  and  executor  memory  sizes  are  configurable,  and  the  defaults  are  1g  

•  You  can  configure  them  – spark.driver.memory– spark.executor.memory

Page 57: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017

Size  of  a  par@@on  maders  

•  Informally,  for  each  task,  the  executor  loads  the  corresponding  par@@on  into  memory.

•  If  the  par@@on  cannot  fit  in  memory,  you  get  OOMs.  

•  Then  you  want  more  and  smaller  par@@ons.  •  RDD  par@@oning  is  in  unit  of  records,  if  a  single  record  is  huge  then  repar@@oning  won’t  help.  


Recommended