+ All Categories
Home > Documents > Apache Spark - Duke University · 2016-04-05 · References: [1] Rubao Lee, Tian Luo, Yin Huai,...

Apache Spark - Duke University · 2016-04-05 · References: [1] Rubao Lee, Tian Luo, Yin Huai,...

Date post: 22-May-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
18
Apache Spark Prajakta Kalmegh Duke University 1
Transcript
Page 1: Apache Spark - Duke University · 2016-04-05 · References: [1] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. YSmart: Yet Another SQL-to-MapReduce

Apache SparkPrajakta KalmeghDuke University

1

Page 2: Apache Spark - Duke University · 2016-04-05 · References: [1] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. YSmart: Yet Another SQL-to-MapReduce

What  is  Spark?Not  a  modified  version  of  Hadoop

Separate,  fast,  MapReduce-­‐like  engine» In-­‐memory  data  storage  for  very  fast  iterative  queries»General  execution  graphs  and  powerful  optimizations»Up  to  40x  faster  than  Hadoop»Up  to  100x  faster  (2-­‐10x  on  disk)  

Compatible  with  Hadoop’s  storage  APIs»Can  read/write  to  any  Hadoop-­‐supported  system,  including  HDFS,  HBase,  SequenceFiles,  etc

Borrowed slide2

Page 3: Apache Spark - Duke University · 2016-04-05 · References: [1] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. YSmart: Yet Another SQL-to-MapReduce

ApplicationsIn-­‐memory  analytics  &  anomaly  detection  (Conviva)

Interactive  queries  on  data  streams  (Quantifind)

Exploratory  log  analysis  (Foursquare)

Traffic  estimation  w/  GPS  data  (Mobile  Millennium)

Twitter  spam  classification  (Monarch)

.  .  .

Borrowed slide3

Page 4: Apache Spark - Duke University · 2016-04-05 · References: [1] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. YSmart: Yet Another SQL-to-MapReduce

Why  a  New  Programming  Model?

MapReduce  greatly  simplified  big  data  analysis

But  as  soon  as  it  got  popular,  users  wanted  more:»More  complex,  multi-­‐stage  applications  (graph  algorithms,  machine  learning)»More  interactive  ad-­‐hoc  queries»More  real-­‐time  online  processing

All  three  of  these  apps  require  fast  data  sharing  across  parallel  jobs

Borrowed slide

NOTE: What were the workarounds in MR world? Ysmart [1], Stubby[2], PTF[3], Haloop [4], Twister [5]

4

Page 5: Apache Spark - Duke University · 2016-04-05 · References: [1] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. YSmart: Yet Another SQL-to-MapReduce

Interactive  speed

execution  time  (sec)

percentage  of  queries

5Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10

Most  queries  complete  under  10  sec

Monthly  query  workloadof  one  3000-­node  Dremel  instance

Borrowed slide

Page 6: Apache Spark - Duke University · 2016-04-05 · References: [1] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. YSmart: Yet Another SQL-to-MapReduce

Data  Sharing  in  MapReduce

iter.  1 iter.  2 .    .    .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query  1

query  2

query  3

result  1

result  2

result  3

.    .    .

HDFSread

Slow  due  to  replication,  serialization,  and  disk  IOBorrowed slide

6

Page 7: Apache Spark - Duke University · 2016-04-05 · References: [1] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. YSmart: Yet Another SQL-to-MapReduce

iter.  1 iter.  2 .    .    .

Input

Data  Sharing  in  Spark

Distributedmemory

Input

query  1

query  2

query  3

.    .    .

one-­‐timeprocessing

10-­‐100× faster  than  network  and  diskBorrowed slide

7

Page 8: Apache Spark - Duke University · 2016-04-05 · References: [1] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. YSmart: Yet Another SQL-to-MapReduce

RDD:  Spark  Programming  Model

Key  idea:  resilient  distributed  datasets  (RDDs)»Distributed  collections  of  objects  that  can  be  cached  in  memory  or  disk  across  cluster  nodes»Manipulated  through  various  parallel  operators»Automatically  rebuilt  on  failure

Interface»Clean  language-­‐integrated  API  in  Scala»Can  be  used  interactively from  Scala  console

Borrowed slide8

Page 9: Apache Spark - Duke University · 2016-04-05 · References: [1] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. YSmart: Yet Another SQL-to-MapReduce

More  on  RDDs• Immutable:  RDD  is  a  read-­‐only,  partitioned  collection  of  records

‣ Checkpoint  RDDs  with  long  lineage  chains  can  be  done  in  the  background.  

‣ Using  stragglers:  We  can  use  backup  tasks  to  recompute  transformations  on  RDDs

• Transformations:  Created  through  deterministic  operations  on  either  

‣ data  in  stable  storage  or  

‣ other  RDDs

• Lineage:  RDD  has  enough  information  about  how  it  was  derived  from  other  datasets

• Persistence  level:  Users  can  choose  a  re-­‐use  storage  strategy  (caching  in  memory,  storing  the  RDD  only  on  disk  or  replicating  it  across  machines;  also  chose  a  persistence  priority  for  data  spills)  

• Partitioning:  Users  can  ask  that  an  RDD’s  elements  be  partitioned  across  machines  based  on  a  key  in  each  record

9

Page 10: Apache Spark - Duke University · 2016-04-05 · References: [1] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. YSmart: Yet Another SQL-to-MapReduce

RDD  Transformations  and  Actions

*http://www.tothenew.com/blog/spark-1o3-spark-internals/

*https://spark.apache.org/docs/1.0.1/cluster-overview.html

Note: Lazy Evaluation: A very important concept

10

Page 11: Apache Spark - Duke University · 2016-04-05 · References: [1] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. YSmart: Yet Another SQL-to-MapReduce

DAG of RDDs

*https://trongkhoanguyenblog.wordpress.com/2014/11/27/understand-rdd-operations-transformations-and-actions/11

Page 12: Apache Spark - Duke University · 2016-04-05 · References: [1] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. YSmart: Yet Another SQL-to-MapReduce

Fault  ToleranceRDDs  track  the  series  of  transformations  used  to  build  them  (their  lineage)  to  recompute  lost  data

E.g: messages = textFile(...).filter(_.contains(“error”)).map(_.split(‘\t’)(2))

HadoopRDDpath  =  hdfs://…

FilteredRDDfunc  =  _.contains(...)

MappedRDDfunc  =  _.split(…)

Borrowed slide12

Page 13: Apache Spark - Duke University · 2016-04-05 · References: [1] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. YSmart: Yet Another SQL-to-MapReduce

Representing  RDDs• Graph-­‐based  representation.  Five  components  :

13

Page 14: Apache Spark - Duke University · 2016-04-05 · References: [1] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. YSmart: Yet Another SQL-to-MapReduce

Representing  RDDs  (Dependencies)

14

Page 15: Apache Spark - Duke University · 2016-04-05 · References: [1] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. YSmart: Yet Another SQL-to-MapReduce

Representing  RDDs  (An  example)

15

Page 16: Apache Spark - Duke University · 2016-04-05 · References: [1] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. YSmart: Yet Another SQL-to-MapReduce

Advantages  of  the  RDD  model

16

Page 17: Apache Spark - Duke University · 2016-04-05 · References: [1] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. YSmart: Yet Another SQL-to-MapReduce

Other  Engine  Features:  Implementation

• Not  covered  in  details  

• Some  Summary:

• Spark  local  vs  Spark  Standalone  vs  Spark  cluster  (Resource  sharing  handled  by  Yarn/Mesos)

• Job  Scheduling:   DAGScheduler  vs  TaskScheduler

• Interpreter  Integration: Ship  external  instances  of  variables  referenced  in  a  closure  along  with  the  closure  class  to  worker  nodes  in  order  to  give  them  access  to  these  variables

• Memory  Management:  serialized  in-­‐memory(fastest)  VS  deserialized  in-­‐memory  VS  on-­‐disk  persistent

• Support  for  Checkpointing:  Tradeoff  between  using  lineage  for  recomputing  partitions    VS  checkpointing  partitions  on  stable  storage

17

End of Lecture 20

Page 18: Apache Spark - Duke University · 2016-04-05 · References: [1] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. YSmart: Yet Another SQL-to-MapReduce

References:[1] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. YSmart: Yet Another SQL-to-MapReduce Translator. In Proceedings of the 2011 31st International Conference on Distributed Computing Systems (ICDCS '11). IEEE Computer Society, Washington, DC, USA, 25-36

[2] Harold Lim, Herodotos Herodotou, and Shivnath Babu. 2012. Stubby: a transformation-based optimizer for MapReduce workflows. Proc. VLDB Endow. 5, 11 (July 2012), 1196-1207.

[3] PTF: http://www.slideshare.net/Hadoop_Summit/analytical-queries-with-hive

[4] Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. 2010. HaLoop: efficient iterative data processing on largeclusters. Proc. VLDB Endow. 3, 1-2 (September 2010), 285-296.

[5] Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox. 2010. Twister: a runtime for iterative MapReduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC '10). ACM, New York, NY, USA, 810-818

[6] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2.

[7] Spark and Shark: <http://www.slideshare.net/jetlore/spark-and-shark-lightningfast-analytics-over-hadoop-and-hive-data?related=1>

[8] Spark SQL: <http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014?related=2>

18


Recommended