+ All Categories
Home > Documents > Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ......

Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ......

Date post: 21-May-2020
Category:
Upload: others
View: 12 times
Download: 0 times
Share this document with a friend
75
Apache Spark Easy and Fast Big Data Analytics Pat McDonough
Transcript
Page 1: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Apache SparkEasy and Fast Big Data Analytics

Pat McDonough

Page 2: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Founded by the creators of Apache Sparkout of UC Berkeley’s AMPLab

Fully committed to 100% open source Apache Spark

Support and Grow the Spark Community and Ecosystem

Building Databricks Cloud

Page 3: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Databricks & DatastaxApache Spark is packaged as part of Datastax

Enterprise Analytics 4.5

Databricks & Datstax Have Partnered for Apache Spark Engineering and Support

Page 4: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Big Data AnalyticsWhere We’ve Been

• 2003 & 2004 - Google GFS & MapReduce Papers are Precursors to Hadoop

• 2006 & 2007 - Google BigTable and Amazon DynamoDB Paper Precursor to Cassandra, HBase, Others

Page 5: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Big Data AnalyticsA Zoo of Innovation

Page 6: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Big Data AnalyticsA Zoo of Innovation

Page 7: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Big Data AnalyticsA Zoo of Innovation

Page 8: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Big Data AnalyticsA Zoo of Innovation

Page 9: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

What's Working?

Many Excellent Innovations Have Come From Big Data Analytics:

• Distributed & Data Parallel is disruptive ... because we needed it

• We Now Have Massive throughput… Solved the ETL Problem

• The Data Hub/Lake Is Possible

Page 10: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

What Needs to Improve? Go Beyond MapReduce

MapReduce is a Very Powerful and Flexible Engine

Processing Throughput Previously Unobtainable on

Commodity Equipment

But MapReduce Isn’t Enough:

• Essentially Batch-only

• Inefficient with respect to memory use, latency

• Too Hard to Program

Page 11: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

What Needs to Improve? Go Beyond (S)QL

SQL Support Has Been A Welcome Interface on Many

Platforms

And in many cases, a faster alternative

But SQL Is Often Not Enough:

• Sometimes you want to write real programs (Loops, variables, functions, existing libraries) but don’t want to build UDFs.

• Machine Learning (see above, plus iterative)

• Multi-step pipelines

• Often an Additional System

Page 12: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

What Needs to Improve? Ease of Use

Big Data Distributions Provide a number of Useful Tools and

Systems

Choices are Good to Have

But This Is Often Unsatisfactory:

• Each new system has it’s own configs, APIs, and management, coordination of multiple systems is challenging

• A typical solution requires stringing together disparate systems - we need unification

• Developers want the full power of their programming language

Page 13: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

What Needs to Improve? Latency

Big Data systems are throughput-oriented

Some new SQL Systems provide interactivity

But We Need More:

• Interactivity beyond SQL interfaces

• Repeated access of the same datasets (i.e. caching)

Page 14: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Can Spark Solve These Problems?

Page 15: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Apache SparkOriginally developed in 2009 in UC Berkeley’s

AMPLab

Fully open sourced in 2010 – now at Apache Software Foundation

http://spark.apache.org

Page 16: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Project ActivityJune 2013 June 2014

total contributors 68 255

companies contributing 17 50

total linesof code 63,000 175,000

Page 17: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Project ActivityJune 2013 June 2014

total contributors 68 255

companies contributing 17 50

total linesof code 63,000 175,000

Page 18: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Compared to Other Projects

0

300

600

900

1200

0

75000

150000

225000

300000

Commits Lines of Code Changed

Activity in past 6 months

Page 19: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Compared to Other Projects

0

300

600

900

1200

0

75000

150000

225000

300000

Commits Lines of Code Changed

Activity in past 6 months

Spark is now the most active project in the Hadoop ecosystem

Page 20: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Spark on GithubSo active on Github, sometimes we break it

Over 1200 Forks (can’t display Network Graphs)

~80 commits to master each week

So many PRs We Built our own PR UI

Page 21: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Apache Spark - Easy to Use And Very Fast

Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra

Improved Efficiency: • In-memory computing primitives

• General computation graphs

Improved Usability: • Rich APIs

• Interactive shell

Page 22: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Apache Spark - Easy to Use And Very Fast

Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra

Improved Efficiency: • In-memory computing primitives

• General computation graphs

Improved Usability: • Rich APIs

• Interactive shell

Up to 100× faster (2-10× on disk)

2-5× less code

Page 23: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Apache Spark - A Robust SDK for Big Data Applications

SQL Machine Learning Streaming Graph

Core

Unified System With Libraries to Build a Complete Solution !

Full-featured Programming Environment in Scala, Java, Python…

Very developer-friendly, Functional API for working with Data !

Runtimes available on several platforms

Page 24: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Spark Is A Part Of Most Big Data Platforms

• All Major Hadoop Distributions Include Spark

• Spark Is Also Integrated With Non-Hadoop Big Data Platforms like DSE

• Spark Applications Can Be Written Once and Deployed Anywhere

SQL Machine Learning Streaming Graph

Core

Deploy Spark Apps Anywhere

Page 25: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Cassandra + Spark: A Great Combination

Both are Easy to Use

Spark Can Help You Bridge Your Hadoop and Cassandra Systems

Use Spark Libraries, Caching on-top of Cassandra-stored Data

Combine Spark Streaming with Cassandra Storage Datastaxspark-cassandra-connector:https://github.com/datastax/spark-cassandra-connector

Page 26: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Easy: Get Started Immediately

Interactive Shell Multi-language support

Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count()

Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count()

Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();

Page 27: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Easy: Clean API

Resilient Distributed Datasets

• Collections of objects spread across a cluster, stored in RAM or on Disk

• Built through parallel transformations

• Automatically rebuilt on failure

Operations

• Transformations (e.g. map, filter, groupBy)

• Actions(e.g. count, collect, save)

Write programs in terms of transformations on distributed datasets

Page 28: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Easy: Expressive APImap reduce

Page 29: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Easy: Expressive APImap filter groupBy sort union join leftOuterJoin rightOuterJoin

reduce count fold reduceByKey groupByKey cogroup cross zip

sample take first partitionBy mapWith pipe save ...

Page 30: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Easy: Example – Word Count

public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { ! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } !public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Hadoop MapReduceval spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Spark

Page 31: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Easy: Example – Word Count

public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { ! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } !public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Hadoop MapReduceval spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Spark

Page 32: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Easy: Works Well With Hadoop

Data Compatibility

• Access your existing Hadoop Data

• Use the same data formats

• Adheres to data locality for efficient processing

!

Deployment Models

• “Standalone” deployment

• YARN-based deployment

• Mesos-based deployment

• Deploy on existing Hadoop cluster or side-by-side

Page 33: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Example: Logistic Regression

data = spark.textFile(...).map(readPoint).cache()

!

w = numpy.random.rand(D)

!

for i in range(iterations):

gradient = data

.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y)

w -= gradient

!

print “Final w: %s” % w

Page 34: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Fast: Using RAM, Operator Graphs

In-memory Caching

• Data Partitions read from RAM instead of disk

Operator Graphs

• Scheduling Optimizations

• Fault Tolerance

=  cached  partition

=  RDD

join

filter

groupBy

Stage  3

Stage  1

Stage  2

A: B:

C: D: E:

F:

map

Page 35: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Fast: Logistic Regression Performance

Runn

ing

Tim

e (s

)

0

1000

2000

3000

4000

Number of Iterations1 5 10 20 30

Hadoop Spark

110  s  /  iteration

first  iteration  80  s  further  iterations  1  s

Page 36: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Fast: Scales Down SeamlesslyEx

ecution  time  (s)

0

25

50

75

100

%  of  working  set  in  cache

Cache  disabled 25% 50% 75% Fully  cached

11.5304

29.747140.7407

58.061468.8414

Page 37: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Easy: Fault RecoveryRDDs track lineage information that can be used to

efficiently recompute lost data

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])

HDFS File Filtered RDDMapped

RDDfilter(func  =  startsWith(…))

map(func  =  split(...))

Page 38: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

How Spark Works

Page 39: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Working With RDDs

Page 40: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Working With RDDs

RDD

textFile = sc.textFile(”SomeFile.txt”)

Page 41: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Working With RDDs

RDDRDD

RDDRDD

Transformations

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

textFile = sc.textFile(”SomeFile.txt”)

Page 42: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Working With RDDs

RDDRDD

RDDRDD

Transformations

Action Value

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

linesWithSpark.count() 74 !linesWithSpark.first() # Apache Spark

textFile = sc.textFile(”SomeFile.txt”)

Page 43: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 44: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Worker

Worker

Worker

Driver

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 45: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Worker

Worker

Worker

Driver

lines = spark.textFile(“hdfs://...”)

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 46: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Worker

Worker

Driver

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 47: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Worker

Worker

Driver

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 48: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 49: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count() Action

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 50: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 51: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Drivertasks

tasks

tasks

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 52: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Read HDFS Block

Read HDFS Block

Read HDFS Block

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 53: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

Process& Cache Data

Process& Cache Data

Process& Cache Data

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 54: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

results

results

results

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 55: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 56: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

tasks

tasks

tasks

Driver

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 57: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driver

ProcessfromCache

ProcessfromCache

ProcessfromCache

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 58: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driverresults

results

results

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 59: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Example: Log Mining

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driver

Cache your data ➔ Faster Results Full-text search of Wikipedia • 60GB on 20 EC2 machines • 0.5 sec from cache vs. 20s for on-disk

Load error messages from a log into memory, then interactively search for various patterns

Page 60: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Spark’s Libraries

SQL Machine Learning Streaming Graph

Core

Page 61: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Spark SQL

Page 62: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

What is Spark SQL?

• Out of the box APIs built on the same system

• SQL interfaces, SchemaRDDs, and a LINQ-like DSL for end users

• An optimizer framework for manipulating trees of relational operators.

• Native support for executing relational queries (SQL) in Spark.

• Optimized integration with external sources

Page 63: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

SparkSQL Architecture

Page 64: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Relationship to Shark

Borrows

• Hive data loading code / in-memory columnar representation

• hardened spark execution engine

Adds

• RDD-aware optimizer / query planner

• execution engine

• language interfaces.

Catalyst/SparkSQL is a nearly from scratch rewrite that leverages the best parts of Shark

Page 65: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Hive CompatibilityInterfaces to access data and code in the Hive ecosystem: o Support for writing queries in HQL o Catalog for that interfaces with the

Hive MetaStore o Tablescan operator that uses Hive SerDes o Wrappers for Hive UDFs, UDAFs, UDTFs

Page 66: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Parquet SupportNative support for reading data stored in Parquet:

• Columnar storage avoids reading unneeded data.

• Nested Data support

• RDDs can be written to parquet files, preserving the schema.

• Predicate push-down support

Page 67: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

JSON SupportNative support for reading data stored in JSON: !

• Schema-inference through sampling

• Nested data support

Page 68: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Built-in Driver

JDBC available OOTB as of Spark 1.1

Page 69: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Optimizations

• In addition to the standard Spark framework’s optimizations…

• Predicate push-down

• Partition pruning

• Code gen

• Automatic Broadcasts (based on statistics)

Page 70: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Example: SparkSQL, Core APIs, and MLlib Working Together

val trainingDataTable = sql(""" SELECT e.action,

u.age, u.latitude, u.logitude FROM Users u JOIN

Events e ON u.userId = e.userId""")// Since `sql`

returns an RDD, the results of can be easily used in

MLlib val trainingData = trainingDataTable.map { row => val features = Array[Double](row(1), row(2), row(3)) LabeledPoint(row(0), features) } val model = new

LogisticRegressionWithSGD().run(trainingData)

Page 71: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Recent Roadmap Updates

Performance and Usability Improvements

• Disk spilling for skewed blocks during cache operations

• Disk spilling during aggregations for PySpark

• “sort-based shuffle”

• usability improvements for monitoring the performance long running or complex jobs

Page 72: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Recent Roadmap UpdatesSparkSQL

• JDBC/ODBC server built-in

• Support for loading JSON data directly into Spark’s SchemaRDD format, including automatic schema inference.

• Dynamic bytecode generation significantly speeding up execution for queries that perform complex expression evaluation.

• This release also adds support for registering Python, Scala, and Java lambda functions as UDF

• Spark 1.1 adds a public types API to allow users to create SchemaRDD’s from custom data sources.

• Many, many optimizations (Parquet-specific, cost-based

Page 73: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Recent Roadmap Updates

MLlib

• New library of statistical packages which provides exploratory analytic functions *stratified sampling, correlations, chi-squared tests, creating random datasets…)

• Utilities for feature extraction (Word2Vec and TF-IDF) and feature transformation (normalization and standard scaling).

• Support for nonnegative matrix factorization and SVD via Lanczos.

• Decision tree algorithm has been added in Python and Java.

• Tree aggregation primitive

• Performance improves across the board, with improvements of around 2-3X for many algorithms and up to 5X for large scale decision tree problems.

Page 74: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Recent Roadmap Updates

Spark Streaming

• New data source for Amazon Kinesis

• Apache Flume: a new pull-based mode (simplifying deployment and providing high availability)

• The first of a set of streaming machine learning algorithms is introduced with streaming linear regression.

• Rate limiting has been added for streaming inputs

Page 75: Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Thank You!Visit http://databricks.com:Blogs, Tutorials and more

!

Questions?


Recommended