+ All Categories
Home > Software > Intro to Spark and Spark SQL

Intro to Spark and Spark SQL

Date post: 02-Jul-2015
Category:
Upload: jeykottalam
View: 10,360 times
Download: 6 times
Share this document with a friend
Description:
"Intro to Spark and Spark SQL" talk by Michael Armbrust of Databricks at AMP Camp 5
43
Intro to Spark and Spark SQL Michael Armbrust - @michaelarmbrust AMP Camp 2014
Transcript
Page 1: Intro to Spark and Spark SQL

Intro to Spark and Spark SQL

Michael Armbrust - @michaelarmbrust

AMP Camp 2014

Page 2: Intro to Spark and Spark SQL

What is Apache Spark?

Fast and general cluster computing system, interoperable with Hadoop, included in all major distros Improves efficiency through:

>  In-memory computing primitives >  General computation graphs

Improves usability through: >  Rich APIs in Scala, Java, Python >  Interactive shell

Up to 100× faster (2-10× on disk)

2-5× less code

Page 3: Intro to Spark and Spark SQL

Spark Model

Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) > Collections of objects that can be stored in memory

or disk across a cluster >  Parallel functional transformations (map, filter, …) > Automatically rebuilt on failure

Page 4: Intro to Spark and Spark SQL

More than Map & Reduce

map

filter

groupBy

sort

union

join

leftOuterJoin

rightOuterJoin

reduce

count

fold

reduceByKey

groupByKey

cogroup

cross

zip

sample

take

first

partitionBy

mapWith

pipe

save

...

Page 5: Intro to Spark and Spark SQL

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

val  lines  =  spark.textFile(“hdfs://...”)  val  errors  =  lines.filter(_  startswith  “ERROR”)  

val  messages  =  errors.map(_.split(“\t”)(2))  messages.cache()   lines

Block 1

lines Block 2

lines Block 3

Worker

Worker

Worker

Driver

messages.filter(_  contains  “foo”).count()  

messages.filter(_  contains  “bar”).count()  

. . .

tasks

results

messages Cache 1

messages Cache 2

messages Cache 3

Base RDD Transformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec#(vs 170 sec for on-disk data)

Page 6: Intro to Spark and Spark SQL

A General Stack

Spark

Spark Streaming#

real-time

Spark SQL

GraphX graph

MLlib machine learning …

Spark SQL

Page 7: Intro to Spark and Spark SQL

0

20000

40000

60000

80000

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

Powerful Stack – Agile Development

Page 8: Intro to Spark and Spark SQL

0

20000

40000

60000

80000

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

Streaming

Powerful Stack – Agile Development

Page 9: Intro to Spark and Spark SQL

0

20000

40000

60000

80000

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

SparkSQL Streaming

Powerful Stack – Agile Development

Page 10: Intro to Spark and Spark SQL

Powerful Stack – Agile Development

0

20000

40000

60000

80000

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

GraphX

Streaming SparkSQL

Page 11: Intro to Spark and Spark SQL

Powerful Stack – Agile Development

0

20000

40000

60000

80000

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

GraphX

Streaming SparkSQL

Your App?

Page 12: Intro to Spark and Spark SQL

Community Growth

0 20 40 60 80

100 120 140 160 180 200

0.6.0 0.7.0 0.8.0 0.9.0 1.0.0 1.1.0

Page 13: Intro to Spark and Spark SQL

Not Just for In-Memory Data

Hadoop Record

Spark 100TB Spark 1PB

Data Size 102.5TB 100TB 1000TB

Time 72 min 23 min 234 min

# Cores 50400 6592 6080

Rate 1.42 TB/min 4.27 TB/min 4.27 TB/min

Environment Dedicate Cloud (EC2) Cloud (EC2)

Page 14: Intro to Spark and Spark SQL

SQL Overview

•  Newest component of Spark initially contributed by databricks (< 1 year old)

•  Tightly integrated way to work with structured data (tables with rows/columns)

•  Transform RDDs using SQL •  Data source integration: Hive, Parquet, JSON,

and more

Page 15: Intro to Spark and Spark SQL

Shark modified the Hive backend to run over Spark, but had two challenges:

>  Limited integration with Spark programs > Hive optimizer not designed for Spark

Spark SQL reuses the best parts of Shark:

Relationship to

Borrows •  Hive data loading •  In-memory column store

Adds •  RDD-aware optimizer •  Rich language interfaces

Page 16: Intro to Spark and Spark SQL

Adding Schema to RDDs

Spark + RDDs Functional transformations on partitioned collections of opaque objects.

SQL + SchemaRDDs Declarative transformations on partitioned collections of tuples.

User User User

User User User

Name Age Height Name Age Height Name Age Height Name Age Height Name Age Height Name Age Height

Page 17: Intro to Spark and Spark SQL

SchemaRDDs: More than SQL

{JSON}  

SchemaRDD

Image credit: http://barrymieny.deviantart.com/

SQL

QL

Parquet

MLlib

Unified interface for structured data

JDBC ODBC

Page 18: Intro to Spark and Spark SQL

Getting Started: Spark SQL

SQLContext/HiveContext  •  Entry point for all SQL functionality •  Wraps/extends existing spark context

from  pyspark.sql  import  SQLContext  sqlCtx  =  SQLContext(sc)    

Page 19: Intro to Spark and Spark SQL

Example Dataset

A text file filled with people’s names and ages:  Michael,  30  Andy,  31  …  

Page 20: Intro to Spark and Spark SQL

RDDs as Relations (Python)

   # Load a text file and convert each line to a dictionary. lines = sc.textFile("examples/…/people.txt") parts = lines.map(lambda l: l.split(",")) people = parts.map(lambda p: Row(name=p[0],age=int(p[1]))) # Infer the schema, and register the SchemaRDD as a table peopleTable = sqlCtx.inferSchema(people) peopleTable.registerAsTable("people")

Page 21: Intro to Spark and Spark SQL

RDDs as Relations (Scala)

val  sqlContext  =  new  org.apache.spark.sql.SQLContext(sc)  

import  sqlContext._    

//  Define  the  schema  using  a  case  class.  

case  class  Person(name:  String,  age:  Int)  

//  Create  an  RDD  of  Person  objects  and  register  it  as  a  table.  

val  people  =  

   sc.textFile("examples/src/main/resources/people.txt")  

       .map(_.split(","))  

       .map(p  =>  Person(p(0),  p(1).trim.toInt))    

people.registerAsTable("people")  

   

Page 22: Intro to Spark and Spark SQL

RDDs as Relations (Java) public  class  Person  implements  Serializable  {      private  String  _name;      private  int  _age;      public  String  getName()  {  return  _name;    }      public  void  setName(String  name)  {  _name  =  name;  }      public  int  getAge()  {  return  _age;  }      public  void  setAge(int  age)  {  _age  =  age;  }  }    JavaSQLContext  ctx  =  new  org.apache.spark.sql.api.java.JavaSQLContext(sc)  JavaRDD<Person>  people  =  ctx.textFile("examples/src/main/resources/people.txt").map(      new  Function<String,  Person>()  {          public  Person  call(String  line)  throws  Exception  {              String[]  parts  =  line.split(",");              Person  person  =  new  Person();              person.setName(parts[0]);              person.setAge(Integer.parseInt(parts[1].trim()));              return  person;          }      });  JavaSchemaRDD  schemaPeople  =  sqlCtx.applySchema(people,  Person.class);  

     

Page 23: Intro to Spark and Spark SQL

Querying Using SQL

#  SQL  can  be  run  over  SchemaRDDs  that  have  been  registered  #  as  a  table.  teenagers  =  sqlCtx.sql("""      SELECT  name  FROM  people  WHERE  age  >=  13  AND  age  <=  19""")    #  The  results  of  SQL  queries  are  RDDs  and  support  all  the  normal  #  RDD  operations.  teenNames  =  teenagers.map(lambda  p:  "Name:  "  +  p.name)    

Page 24: Intro to Spark and Spark SQL

Spark SQL includes a server that exposes its data using JDBC/ODBC •  Query data from HDFS/S3, •  Including formats like Hive/Parquet/JSON* •  Support for caching data in-memory * Coming in Spark 1.2    

Existing Tools, New Data Sources

Page 25: Intro to Spark and Spark SQL

Caching Tables In-Memory

Spark SQL can cache tables using an in-memory columnar format: •  Scan only required columns •  Fewer allocated objects (less GC) •  Automatically selects best compression

cacheTable("people")

schemaRDD.cache() – *Requires Spark 1.2

Page 26: Intro to Spark and Spark SQL

Caching Comparison

Spark MEMORY_ONLY Caching SchemaRDD Columnar Caching ByteBuffer ByteBuffer ByteBuffer

User Object

User Object

User Object

User Object

User Object

User Object

Name Age Height Name Age Height Name Age Height

Name Age Height Name Age Height Name Age Height

java.lang.String java.lang.String java.lang.String

java.lang.String java.lang.String java.lang.String

Page 27: Intro to Spark and Spark SQL

Language Integrated UDFs

registerFunction(“countMatches”,      lambda  (pattern,  text):            re.subn(pattern,  '',  text)[1])    sql("SELECT  countMatches(‘a’,  text)…")  

Page 28: Intro to Spark and Spark SQL

SQL and Machine Learning

training_data_table  =  sql("""  

   SELECT  e.action,  u.age,  u.latitude,  u.logitude  

       FROM  Users    u      

       JOIN  Events  e  ON  u.userId  =  e.userId""")  

 

def  featurize(u):  

 LabeledPoint(u.action,  [u.age,  u.latitude,  u.longitude])  

 

//  SQL  results  are  RDDs  so  can  be  used  directly  in  Mllib.  

training_data  =  training_data_table.map(featurize)  

model  =  new  LogisticRegressionWithSGD.train(training_data)  

Page 29: Intro to Spark and Spark SQL

Machine Learning Pipelines

//  training:{eventId:String,  features:Vector,  label:Int}  val  training  =  parquetFile("/path/to/training")  val  lr  =  new  LogisticRegression().fit(training)    //  event:  {eventId:  String,  features:  Vector}  val  event  =  parquetFile("/path/to/event")  val  prediction  =        lr.transform(event).select('eventId,  'prediction)    prediction.saveAsParquetFile("/path/to/prediction”)  

Page 30: Intro to Spark and Spark SQL

Reading Data Stored in Hive

from  pyspark.sql  import  HiveContext  hiveCtx  =  HiveContext(sc)        hiveCtx.hql("""        CREATE  TABLE  IF  NOT  EXISTS  src  (key  INT,  value  STRING)""")    hiveCtx.hql("""      LOAD  DATA  LOCAL  INPATH  'examples/…/kv1.txt'  INTO  TABLE  src""")    #  Queries  can  be  expressed  in  HiveQL.  results  =  hiveCtx.hql("FROM  src  SELECT  key,  value").collect()    

Page 31: Intro to Spark and Spark SQL

Parquet Compatibility

Native support for reading data in Parquet: •  Columnar storage avoids reading unneeded data. •  RDDs can be written to parquet files, preserving the

schema. •  Convert other slower formats into Parquet for repeated

querying

Page 32: Intro to Spark and Spark SQL

Using Parquet

#  SchemaRDDs  can  be  saved  as  Parquet  files,  maintaining  the  #  schema  information.    peopleTable.saveAsParquetFile("people.parquet")      #  Read  in  the  Parquet  file  created  above.    Parquet  files  are    #  self-­‐describing  so  the  schema  is  preserved.  The  result  of    #  loading  a  parquet  file  is  also  a  SchemaRDD.  parquetFile  =  sqlCtx.parquetFile("people.parquet”)    #  Parquet  files  can  be  registered  as  tables  used  in  SQL.  parquetFile.registerAsTable("parquetFile”)  teenagers  =  sqlCtx.sql("""      SELECT  name  FROM  parquetFile  WHERE  age  >=  13  AND  age  <=  19""")    

Page 33: Intro to Spark and Spark SQL

{JSON} Support

•  Use jsonFile or jsonRDD to convert a collection of JSON objects into a SchemaRDD

•  Infers and unions the schema of each record •  Maintains nested structures and arrays

Page 34: Intro to Spark and Spark SQL

{JSON} Example

#  Create  a  SchemaRDD  from  the  file(s)  pointed  to  by  path    people  =  sqlContext.jsonFile(path)        #  Visualized  inferred  schema  with  printSchema().  people.printSchema()    #  root    #    |-­‐-­‐  age:  integer  #    |-­‐-­‐  name:  string    #  Register  this  SchemaRDD  as  a  table.    people.registerTempTable("people")    

 

Page 35: Intro to Spark and Spark SQL

Data Sources API

Allow easy integration with new sources of structured data: CREATE  TEMPORARY  TABLE  episodes    USING  com.databricks.spark.avro    OPTIONS  (      path  ”./episodes.avro”  )    https://github.com/databricks/spark-­‐avro  

Page 36: Intro to Spark and Spark SQL

Efficient Expression Evaluation

Interpreting expressions (e.g., ‘a + b’) can very expensive on the JVM: •  Virtual function calls •  Branches based on expression type •  Object creation due to primitive boxing •  Memory consumption by boxed primitive

objects

Page 37: Intro to Spark and Spark SQL

Interpreting “a+b”

1.  Virtual call to Add.eval() 2.  Virtual call to a.eval() 3.  Return boxed Int 4.  Virtual call to b.eval() 5.  Return boxed Int 6.  Integer addition 7.  Return boxed result

Add

Attributea

Attributeb

Page 38: Intro to Spark and Spark SQL

Using Runtime Reflection

def  generateCode(e:  Expression):  Tree  =  e  match  {      case  Attribute(ordinal)  =>          q"inputRow.getInt($ordinal)"      case  Add(left,  right)  =>          q"""              {                  val  leftResult  =  ${generateCode(left)}                  val  rightResult  =  ${generateCode(right)}                  leftResult  +  rightResult              }            """  }  

Page 39: Intro to Spark and Spark SQL

Code Generating “a + b”

val  left:  Int  =  inputRow.getInt(0)  val  right:  Int  =  inputRow.getInt(1)  val  result:  Int  =  left  +  right  resultRow.setInt(0,  result)  

•  Fewer function calls • No boxing of primitives

Page 40: Intro to Spark and Spark SQL

Performance Comparison

0 5

10 15 20 25 30 35 40

Intepreted Evaluation Hand-written Code Generated with Scala Reflection

Millisec

onds

Evaluating 'a+a+a' One Billion Times!

Page 41: Intro to Spark and Spark SQL

Performance vs.

Deep Analytics

Interactive Reporting

0

50

100

150

200

250

300

350

400

450

34 46 59 79 19 42 52 55 63 68 73 98 27 3 43 53 7 89

TPC-DS

Shark Spark SQL

Page 42: Intro to Spark and Spark SQL

What’s Coming in Spark 1.2?

•  MLlib pipeline support for SchemaRDDs •  New APIs for developers to add external data

sources •  Full support for Decimal and Date types. •  Statistics for in-memory data •  Lots of bug fixes and improvements to Hive

compatibility

Page 43: Intro to Spark and Spark SQL

Questions?


Recommended