+ All Categories
Home > Documents > Lightning Fast Cluster Computing - Apache Software...

Lightning Fast Cluster Computing - Apache Software...

Date post: 19-Mar-2018
Category:
Upload: trinhhanh
View: 225 times
Download: 8 times
Share this document with a friend
46
Lightning Fast Cluster Computing Michast Michael Armbrust - @michaelarmbrust Reflections | Projections 2015
Transcript
Page 1: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Lightning Fast Cluster Computing

Michast

Michael Armbrust - @michaelarmbrust Reflections | Projections 2015

Page 2: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

2

What is Apache ?

Page 3: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

What is Apache ?

Fast and general computing engine for clusters created by students at UC Berkeley •  Makes it easy to process large (GB-PB) datasets •  Support for Java, Scala, Python, R •  Libraries for SQL, streaming, machine learning, … •  100x faster than Hadoop Map/Reduce for some

applications

Page 4: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Spark Model

Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) > Collections of objects that can be stored in memory

or disk across a cluster >  Parallel functional transformations (map, filter, …) > Automatically rebuilt on failure

Page 5: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Example: Log Mining Load messages from a log file into memory, then interactively search for the problem

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

messages.cache() Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda x: “foo” in x).count()

messages.filter(lambda x: “bar” in x).count()

. . .

tasks

results Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec"(vs 170 sec for on-disk data)

Page 6: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Fault Tolerance

file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

filter reduce map

Inpu

t file

RDDs track lineage info to rebuild lost data

Page 7: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

filter reduce map

Inpu

t file

Fault Tolerance

file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

RDDs track lineage info to rebuild lost data

Page 8: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Speed-up ML Using Memory

0 500

1000 1500 2000 2500 3000 3500 4000

1 5 10 20 30

Runn

ing T

ime

(s)

Number of Iterations

Hadoop Spark

110 s / iteration

first iteration 80 s further iterations 1 s

Page 9: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

9  

On-Disk Sort Record: Time to sort 100TB

2100 machines 2013 Record: Hadoop

2014 Record: Spark

Source: Daytona GraySort benchmark, sortbenchmark.org

72 minutes

207 machines

23 minutes

Also sorted 1PB in 4 hours

Page 10: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Higher-Level Libraries

Spark

Spark Streaming

real-time

Spark SQL structured data

MLlib machine learning

GraphX graph

Page 11: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Seamlessly switch components

//  Load  data  using  SQL  points  =  ctx.sql(“select  latitude,  longitude  from  tweets”)  

//  Train  a  machine  learning  model  model  =  KMeans.train(points,  10)  

//  Apply  it  to  a  stream  sc.twitterStream(...)      .map(lambda  t:  (model.predict(t.location),  1))      .reduceByWindow(“5s”,  lambda  a,  b:  a  +  b)  

Page 12: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

0

20000

40000

60000

80000

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

Powerful Stack – Agile Development

Page 13: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

0

20000

40000

60000

80000

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

Streaming

Powerful Stack – Agile Development

Page 14: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

0

20000

40000

60000

80000

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

SparkSQL Streaming

Powerful Stack – Agile Development

Page 15: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Powerful Stack – Agile Development

0

20000

40000

60000

80000

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

GraphX

Streaming SparkSQL

Page 16: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Powerful Stack – Agile Development

0

20000

40000

60000

80000

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

GraphX

Streaming SparkSQL

Your App?

Page 17: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Open Source Ecosystem Applications

Environments Data Sources

Page 18: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Over 1000 production users, clusters up to 8000 nodes

Many talks online at spark-summit.org

Spark Community

Page 19: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning
Page 20: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Get Involved on

Check us out at Contribute code through

Best way to get started is to fix a bug

Don’t forget to write a test!

Page 21: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

About Databricks

•  The hardest part of using Spark is managing 100s of machines.

•  Databricks makes this easy

21  

Founded by creators of Spark and remains largest contributor.

Page 22: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Demo

Using to analyze emojoi use on Twitter

Page 23: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

What’s next for ?

Page 24: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

+ declarative programming

Create and Running Spark Programs Faster:

•  Write less code •  Read less data •  Let the optimizer do the hard work

Page 25: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

DataFrame noun – [dey-tuh-freym]

1.  A distributed collection of rows organized into named columns.

2.  An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas).

Page 26: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Write Less Code: Compute an Average

private  IntWritable  one  =        new  IntWritable(1)  private  IntWritable  output  =      new  IntWritable()  proctected  void  map(          LongWritable  key,          Text  value,          Context  context)  {      String[]  fields  =  value.split("\t")      output.set(Integer.parseInt(fields[1]))      context.write(one,  output)  }    IntWritable  one  =  new  IntWritable(1)  DoubleWritable  average  =  new  DoubleWritable()    protected  void  reduce(          IntWritable  key,          Iterable<IntWritable>  values,          Context  context)  {      int  sum  =  0      int  count  =  0      for(IntWritable  value  :  values)  {            sum  +=  value.get()            count++          }      average.set(sum  /  (double)  count)      context.Write(key,  average)  }  

data  =  sc.textFile(...).split("\t")  data.map(lambda  x:  (x[0],  [x.[1],  1]))  \        .reduceByKey(lambda  x,  y:  [x[0]  +  y[0],  x[1]  +  y[1]])  \        .map(lambda  x:  [x[0],  x[1][0]  /  x[1][1]])  \        .collect()  

Page 27: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Write Less Code: Compute an Average

Using RDDs  

data  =  sc.textFile(...).split("\t")  data.map(lambda  x:  (x[0],  [int(x[1]),  1]))  \        .reduceByKey(lambda  x,  y:  [x[0]  +  y[0],  x[1]  +  y[1]])  \        .map(lambda  x:  [x[0],  x[1][0]  /  x[1][1]])  \        .collect()        

Using DataFrames  

sqlCtx.table("people")  \        .groupBy("name")  \        .agg("name",  avg("age"))  \        .collect()      

Using SQL  

SELECT  name,  avg(age)  FROM  people  GROUP  BY  name  

Page 28: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Not Just Less Code: Faster Implementations

0 2 4 6 8 10

RDD Scala

RDD Python

DataFrame Scala

DataFrame Python

DataFrame SQL

Time to Aggregate 10 million int pairs (secs)

Page 29: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Machine Learning Pipelines tokenizer  =  Tokenizer(inputCol="text",  outputCol="words”)  hashingTF  =  HashingTF(inputCol="words",  outputCol="features”)  lr  =  LogisticRegression(maxIter=10,  regParam=0.01)  pipeline  =  Pipeline(stages=[tokenizer,  hashingTF,  lr])    df  =  sqlCtx.load("/path/to/data")  model  =  pipeline.fit(df)  

df0 df1 df2 df3 tokenizer hashingTF lr.model

lr

Pipeline Model  

Page 30: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Optimization happens as late as possible, therefore Spark SQL can

optimize across functions.

30

Page 31: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

31

def  add_demographics(events):        u  =  sqlCtx.table("users")                                      #  Load  Hive  table        events  \            .join(u,  events.user_id  ==  u.user_id)  \      #  Join  on  user_id                      .withColumn("city",  zipToCity(df.zip))        #  udf  adds  city  column    events  =  add_demographics(sqlCtx.load("/data/events",  "json"))    training_data  =  events.where(events.city  ==  “Champaign")                                              .select(events.timestamp).collect()    

Logical Plan

filter

join

events file users table

expensive

only join relevant users

Physical Plan

join

scan (events) filter

scan (users)

Page 32: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

32

def  add_demographics(events):        u  =  sqlCtx.table("users")                                      #  Load  partitioned  Hive  table        events  \            .join(u,  events.user_id  ==  u.user_id)  \      #  Join  on  user_id                      .withColumn("city",  zipToCity(df.zip))        #  Run  udf  to  add  city  column    

Physical Plan with Predicate Pushdown

and Column Pruning

join

optimized scan

(events) optimized

scan (users)

events  =  add_demographics(sqlCtx.load("/data/events",  "parquet"))    training_data  =  events.where(events.city  ==  “Champaign")                                              .select(events.timestamp).collect()    

Logical Plan

filter

join

events file users table

Physical Plan

join

scan (events) filter

scan (users)

Page 33: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Plan Optimization & Execution

Set Footer from Insert Dropdown Menu 33

SQL AST

DataFrame

Unresolved Logical

Plan

Logical Plan

Optimized Logical

Plan RDDs

Selected Physical

Plan

Analysis Logical Optimization

Physical Planning

Cost

Mod

el

Physical Plans

Catalog

DataFrames and SQL share the same optimization/execution pipeline

Code Generation

Page 34: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Writing Rules as Tree Transformations 1.  Find filters on top of

projections. 2.  Check that the filter

can be evaluated without the result of the project.

3.  If so, switch the operators.

Projectname

Projectid,name

Filterid = 1

People

OriginalPlan

Projectname

Projectid,name

Filterid = 1

People

FilterPush-Down

34

Page 35: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Prior Work: "Optimizer Generators Volcano / Cascades: •  Create a custom language for expressing

rules that rewrite trees of relational operators.

•  Build a compiler that generates executable code for these rules.

Cons:  Developers  need  to  learn  this  custom  language.  Language  might  not  be  powerful  enough.   35

Page 36: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  

36

Page 37: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  

Partial Function Tree

37

Page 38: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  

Find Filter on Project

38

Page 39: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  

Check that the filter can be evaluated without the result of the project.

39

Page 40: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  

If so, switch the order.

40

Page 41: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  

Scala: Pattern Matching

41

Page 42: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  

Catalyst: Attribute Reference Tracking

42

Page 43: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  Scala: Copy Constructors

43

Page 44: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Optimizing with Rules

Projectname

Projectid,name

Filterid = 1

People

OriginalPlan

Projectname

Projectid,name

Filterid = 1

People

FilterPush-Down

Projectname

Filterid = 1

People

CombineProjection

IndexLookupid = 1

return: name

PhysicalPlan

44

Page 45: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

• Type-safe: operate on domain objects with compiled lambda functions • Fast: Code-generated

encoders for fast serialization • Interoperable: Easily

convert DataFrames to Datasets without boiler plate

45

Coming Soon: Datasets

val  df  =  ctx.read.json("people.json")  //  Convert  to  custom  objects.  case  class  Person(name:  String,  age:  Int)  val  ds:  Dataset[Person]  =  df.as[Person]  ds.filter(_.age  >  30)    //  Compute  histogram  of  age  by  name.  ds.groupBy(_.name).mapGroups  {      case  (name,  people)  =>          val  buckets  =  Array[Int](10)                      people.map(_.age).foreach  {  a  =>                              buckets(a  /  10)  +=  1                  }                            (name,  buckets)  }

Page 46: Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Questions?

https://databricks.com/company/careers https://github.com/apache/spark

46


Recommended