+ All Categories
Home > Software > Simplifying Big Data Analytics with Apache Spark

Simplifying Big Data Analytics with Apache Spark

Date post: 15-Jul-2015
Category:
Upload: databricks
View: 2,484 times
Download: 5 times
Share this document with a friend
Popular Tags:
45
Simplifying Big Data Analysis with Apache Spark Matei Zaharia April 27, 2015
Transcript
Page 1: Simplifying Big Data Analytics with Apache Spark

Simplifying Big Data Analysis with Apache Spark Matei Zaharia April 27, 2015

Page 2: Simplifying Big Data Analytics with Apache Spark

What is Apache Spark?

Fast and general cluster computing engine interoperable with Apache Hadoop

Improves efficiency through: –  In-memory data sharing –  General computation graphs

Improves usability through: –  Rich APIs in Java, Scala, Python –  Interactive shell

Up to 100× faster

2-5× less code

Page 3: Simplifying Big Data Analytics with Apache Spark

Spark Core

Spark Streaming

real-time

Spark SQL structured

data

MLlib machine learning

GraphX graph

A General Engine

…  

Page 4: Simplifying Big Data Analytics with Apache Spark

A Large Community

Map

Redu

ce

YARN

HD

FS St

orm

Sp

ark

0 500

1000 1500 2000 2500 3000 3500 4000 4500

Map

Redu

ce

YARN

HD

FS Stor

m

Spar

k

0

100000

200000

300000

400000

500000

600000

700000

800000

Commits in past year Lines of code changed in past year

Page 5: Simplifying Big Data Analytics with Apache Spark

About Databricks

Founded by creators of Spark and remains largest contributor

Offers a hosted service, Databricks Cloud –  Spark on EC2 with notebooks, dashboards, scheduled jobs

Page 6: Simplifying Big Data Analytics with Apache Spark

This Talk

Introduction to Spark Built-in libraries New APIs in 2015

–  DataFrames –  Data sources – ML Pipelines

Page 7: Simplifying Big Data Analytics with Apache Spark

Why a New Programming Model?

MapReduce simplified big data analysis

But users quickly wanted more: – More complex, multi-pass analytics (e.g. ML, graph) – More interactive ad-hoc queries – More real-time stream processing

All 3 need faster data sharing in parallel apps

Page 8: Simplifying Big Data Analytics with Apache Spark

Data Sharing in MapReduce

iter. 1 iter. 2 . . . Input

HDFS read

HDFS write

HDFS read

HDFS write

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFS read

Slow due to data replication and disk I/O

Page 9: Simplifying Big Data Analytics with Apache Spark

iter. 1 iter. 2 . . . Input

What We’d Like

Distributed memory

Input

query 1

query 2

query 3

. . .

one-time processing

10-100× faster than network and disk

Page 10: Simplifying Big Data Analytics with Apache Spark

Spark Model

Write programs in terms of transformations on datasets Resilient Distributed Datasets (RDDs)

–  Collections of objects that can be stored in memory or disk across a cluster

–  Built via parallel transformations (map, filter, …) –  Automatically rebuilt on failure

Page 11: Simplifying Big Data Analytics with Apache Spark

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(‘\t’)[2])

messages.cache()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda s: “foo” in s).count()

messages.filter(lambda s: “bar” in s).count()

. . .

tasks

results Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Action

Full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Page 12: Simplifying Big Data Analytics with Apache Spark

Fault Tolerance

RDDs track the transformations used to build them (their lineage) to recompute lost data

messages = textFile(...).filter(lambda s: “ERROR” in s) .map(lambda s: s.split(“\t”)[2])

HadoopRDD path = hdfs://…

FilteredRDD func = lambda s: …

MappedRDD func = lambda s: …

Page 13: Simplifying Big Data Analytics with Apache Spark

0

1000

2000

3000

4000

1 5 10 20 30

Runn

ing

Tim

e (s

)

Number of Iterations

Hadoop Spark

110 s / iteration

first iteration 80 s later iterations 1 s

Example: Logistic Regression

Page 14: Simplifying Big Data Analytics with Apache Spark

14

On-Disk Performance Time to sort 100TB

Source: Daytona GraySort benchmark, sortbenchmark.org

2100 machines 2013 Record: Hadoop

72 minutes

2014 Record: Spark

207 machines

23 minutes

Page 15: Simplifying Big Data Analytics with Apache Spark

Supported Operators

map

filter

groupBy

union

join

leftOuterJoin

rightOuterJoin

reduce

count

fold

reduceByKey

groupByKey

cogroup

flatMap

take

first

partitionBy

pipe

distinct

save

...

Page 16: Simplifying Big Data Analytics with Apache Spark

// Scala:

val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count()

// Java:

JavaRDD<String> lines = sc.textFile(...); lines.filter(s -> s.contains(“ERROR”)).count();

Spark in Scala and Java

Page 17: Simplifying Big Data Analytics with Apache Spark

User Community

Over 500 production users Clusters up to 8000 nodes, processing 1 PB/day Single jobs over 1 PB

17

Page 18: Simplifying Big Data Analytics with Apache Spark

This Talk

Introduction to Spark Built-in libraries New APIs in 2015

–  DataFrames –  Data sources – ML Pipelines

Page 19: Simplifying Big Data Analytics with Apache Spark

Spark Core

Spark Streaming

real-time

Spark SQL structured data

MLlib machine learning

GraphX graph

Built-in Libraries

Page 20: Simplifying Big Data Analytics with Apache Spark

Key Idea

Instead of having separate execution engines for each task, all libraries work directly on RDDs Caching + DAG model is enough to run them efficiently Combining libraries into one program is much faster

20

Page 21: Simplifying Big Data Analytics with Apache Spark

Represents tables as RDDs Tables = Schema + Data

Spark SQL

c = HiveContext(sc)

rows = c.sql(“select text, year from hivetable”)

rows.filter(lambda r: r.year > 2013).collect()

From Hive:

{“text”: “hi”, “user”: { “name”: “matei”, “id”: 123 }}

c.jsonFile(“tweets.json”).registerTempTable(“tweets”)

c.sql(“select text, user.name from tweets”)

From JSON: tweets.json

Page 22: Simplifying Big Data Analytics with Apache Spark

Time

Input

Spark Streaming

Page 23: Simplifying Big Data Analytics with Apache Spark

RDD RDD RDD RDD RDD RDD

Represents streams as a series of RDDs over time

sc.twitterStream(...) .map(lambda t: (t.username, 1)) .reduceByWindow(“30s”, lambda a, b: a + b) .print()

Spark Streaming Time

Page 24: Simplifying Big Data Analytics with Apache Spark

Vectors, Matrices = RDD[Vector] Iterative computation

MLlib

points = sc.textFile(“data.txt”).map(parsePoint) model = KMeans.train(points, 10) model.predict(newPoint)

Page 25: Simplifying Big Data Analytics with Apache Spark

Represents graphs as RDDs of vertices and edges

GraphX

Page 26: Simplifying Big Data Analytics with Apache Spark

Represents graphs as RDDs of vertices and edges

GraphX

Page 27: Simplifying Big Data Analytics with Apache Spark

Hive

Im

pala

(disk

) Im

pala

(mem

) Sp

ark

(disk

) Sp

ark

(mem

)

0

10

20

30

40

50

Resp

onse

Tim

e (s

ec)

SQL

Mah

out

Grap

hLab

Sp

ark

0

10

20

30

40

50

60

Resp

onse

Tim

e (m

in)

ML

Performance vs. Specialized Engines

Stor

m

Spar

k

0

5

10

15

20

25

30

35

Thro

ughp

ut (M

B/s/

node

)

Streaming

Page 28: Simplifying Big Data Analytics with Apache Spark

Combining Processing Types

// Load data using SQL points = ctx.sql(“select latitude, longitude from hive_tweets”)

// Train a machine learning model model = KMeans.train(points, 10)

// Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)

Page 29: Simplifying Big Data Analytics with Apache Spark

Combining Processing Types

Separate engines:

. . . HDFS read

HDFS write pr

epar

e

HDFS read

HDFS write tra

in

HDFS read

HDFS write ap

ply

HDFS write

HDFS read prep

are

train

ap

ply

Spark:

HDFS

Interactive analysis

Page 30: Simplifying Big Data Analytics with Apache Spark

This Talk

Introduction to Spark Built-in libraries New APIs in 2015

–  DataFrames –  Data sources – ML Pipelines

Page 31: Simplifying Big Data Analytics with Apache Spark

31

Main Directions in 2015

Data Science Making it easier for wider class of users

Platform Interfaces Scaling the ecosystem

Page 32: Simplifying Big Data Analytics with Apache Spark

From MapReduce to Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }

}

val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Page 33: Simplifying Big Data Analytics with Apache Spark

Beyond MapReduce Experts

33

Early adopters

Data Scientists Statisticians R users PyData …

users

understands MapReduce &

functional APIs

Page 34: Simplifying Big Data Analytics with Apache Spark

Data Frames De facto data processing abstraction for data science

(R and Python)

Google Trends for “dataframe”

Page 35: Simplifying Big Data Analytics with Apache Spark

From RDDs to DataFrames

35

Page 36: Simplifying Big Data Analytics with Apache Spark

36

Spark DataFrames

Collections of structured data similar to R, pandas

Automatically optimized via Spark SQL

–  Columnar storage –  Code-gen. execution

df = jsonFile(“tweets.json”)

df[df[“user”] == “matei”]

.groupBy(“date”)

.sum(“retweets”)

0

5

10

Python Scala DataFrame Ru

nnin

g Ti

me

Page 37: Simplifying Big Data Analytics with Apache Spark

Optimization via Spark SQL

37

DataFrame expressions are relational queries, letting Spark inspect them Automatically perform expression optimization, join algorithm selection, columnar storage, compilation to Java bytecode

Page 38: Simplifying Big Data Analytics with Apache Spark

38

Machine Learning Pipelines

tokenizer = Tokenizer()

tf = HashingTF(numFeatures=1000)

lr = LogisticRegression()

pipe = Pipeline([tokenizer, tf, lr])

model = pipe.fit(df)

tokenizer TF LR

model DataFrame

High-level API similar to SciKit-Learn

Operates on DataFrames Grid search to tune params across a whole pipeline

Page 39: Simplifying Big Data Analytics with Apache Spark

39

Spark R Interface

Exposes DataFrames and ML pipelines in R Parallelize calls to R code

df = jsonFile(“tweets.json”) 

summarize(                         

  group_by(                        

    df[df$user == “matei”,],

    “date”),

  sum(“retweets”)) 

Target: Spark 1.4 (June)

Page 40: Simplifying Big Data Analytics with Apache Spark

40

Main Directions in 2015

Data Science Making it easier for wider class of users

Platform Interfaces Scaling the ecosystem

Page 41: Simplifying Big Data Analytics with Apache Spark

41

Data Sources API

Allows plugging smart data sources into Spark

Returns DataFrames usable in Spark apps or SQL

Pushes logic into sources

Spark

{JSON}

Page 42: Simplifying Big Data Analytics with Apache Spark

42

Allows plugging smart data sources into Spark

Returns DataFrames usable in Spark apps or SQL

Pushes logic into sources

SELECT * FROM mysql_users u JOIN

hive_logs h

WHERE u.lang = “en”

Spark

{JSON}

SELECT * FROM users WHERE lang=“en”

Data Sources API

Page 43: Simplifying Big Data Analytics with Apache Spark

43

Current Data Sources

Built-in: Hive, JSON, Parquet, JDBC

Community: CSV, Avro, ElasticSearch, Redshift, Cloudant, Mongo, Cassandra, SequoiaDB

List at spark-packages.org

Page 44: Simplifying Big Data Analytics with Apache Spark

44

Goal: unified engine across data sources, workloads and environments

Page 45: Simplifying Big Data Analytics with Apache Spark

To Learn More

Downloads & docs: spark.apache.org Try Spark in Databricks Cloud: databricks.com Spark Summit: spark-summit.org

45


Recommended