Intro to Spark - for Denver Big Data Meetup

transcript

Introduction to SparkGwen Shapira, Solutions Architect

Spark is next-generation Map Reduce

MapReduce has been around for a whileIt made distributed compute easier

But, can we do better?

MapReduce Issues

• Launching mappers and reducers takes time• One MR job can rarely do a full computation• Writing to disk (in triplicate!) between each job• Going back to queue between jobs• No in-memory caching• No iterations• Very high latency• Not the greatest APIs either

Spark:Easy to Develop, Fast to Run

Spark Features

• In-memory cache• General execution graphs• APIs in Scala, Java and Python• Integrates but does not depend on Hadoop

Why is it better?

• (Much) Faster than MR• Iterative programming – Must have for ML• Interactive – allows rapid exploratory analytics• Flexible execution graph:

• Map, map, reduce, reduce, reduce, map• High productivity compared to MapReduce

Word Count

file = spark.textFile(“hdfs://…”)

file.flatMap(line = > line.split(“ “)) .map(word=>(word,1)) .reduceByKey(_+_)

Remember MapReduce WordCount?

Agenda

• Concepts• Examples• Streaming• Summary

Concepts

AMP Lab BDAS

CDH5 (simplified)

HDFS + In-Memory Cache

Spark MR Impala

Spark Streaming ML Lib

How Spark runs on a Cluster

Driver

Worker

RAMWorker

Results

Workflow

• SparkContext in driver connects to Master• Master allocates resources for app on cluster• SC acquires executors on worker nodes • SC sends the app code (JAR) to executors• SC sends tasks to executors

RDD – Resilient Distributed Dataset

• Collection of elements• Read-only• Partitioned• Fault-tolerant• Supports parallel operations

RDD Types

• Parallelized Collection• Parallelize(Seq)

• HDFS files• Text, Sequence or any InputFormat

• Both support same operations

Operations

Transformations• Map• Filter• Sample• Join• ReduceByKey• GroupByKey• Distinct

Actions• Reduce• Collect• Count• First, Take• SaveAs• CountByKey

Transformations are lazy

Lazy transformation

Find all lines that mention “MySQL”

Only the timestamp portion of the line

Set the date and hour as key, 1 as value

Now reduce by key and sum the values

Return the result as Array so I can print

Find lines, get timestamp…

Aha! Finally something

to do!

Persistence / Caching

• Store RDD in memory for later use• Each node persists a partition• Persist() marks an RDD for caching• It will be cached first time an action is performed

• Use for iterative algorithms

Caching – Storage Levels

• MEMORY_ONLY• MEMORY_AND_DISK• MEMORY_ONLY_SER• MEMORY_AND_DISK_SER• DISK_ONLY• MEMORY_ONLY_2, MEMORY_AND_DISK_2…

Fault Tolerance

• Lost partitions can be re-computed from source data• Because we remember all transformations

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])

HDFS File Filtered RDD Mapped RDDfilter

(func = startsWith(…))map

(func = split(...))

Examples

Word Count

file = spark.textFile(“hdfs://…”)

file.flatMap(line = > line.split(“ “)) .map(word=>(word,1)) .reduceByKey(_+_)

Remember MapReduce WordCount?

Log Mining

• Load error messages from a log into memory• Interactively search for patterns

Log Mining

lines = spark.textFile(“hdfs://…”)errors = lines.filter(_.startsWith(“ERROR”)messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()

cachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count…

Base RDD

Transformed RDD

Action

Logistic Regression

• Read two sets of points• Looks for a plane W that separates them• Perform gradient descent:

• Start with random W• On each iteration, sum a function of W over the data• Move W in a direction that improves it

Intuition

Logistic Regression

val points = spark.textFile(…).map(parsePoint).cache()

val w = Vector.random(D)

for (I <- 1 to ITERATIONS) {val gradient = points.map(p =>

(1/(1+exp(-p.t*(w dot p.x))) -1 *p.y *p.x ).reduce(_+_)

w -= gradient

}println(“Final separating plane: ” + w)

Conviva Use-Case

• Monitor online video consumption• Analyze trends

Need to run tens of queries like this a day:

SELECT videoName, COUNT(1)FROM summariesWHERE date='2011_12_12' AND customer='XYZ'GROUP BY videoName;

Conviva With Spark

val sessions = sparkContext.sequenceFile[SessionSummary,NullWritable](pathToSessionSummaryOnHdfs)

val cachedSessions = sessions.filter(whereConditionToFilterSessions).cache

val mapFn : SessionSummary => (String, Long) = { s => (s.videoName, 1) }val reduceFn : (Long, Long) => Long = { (a,b) => a+b }

val results = cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap

Streaming

What is it?

• Extension of Spark API• For high-throughput fault-tolerant processing

of live data streams

Sources & Outputs

• Kafka• Flume• Twitter• JMS Queues• TCP sockets

• HDFS• Databases• Dashboards

Architecture

InputStreaming

ContextSpark

Context

DStreams

• Stream is broken down into micro-batches• Each micro-batch is an RDD• This means any Spark function or library can apply to

a stream• Including ML-Lib, graph processing, etc.

Processing DStreams

Processing Dstreams - Stateless

Processing Dstreams - Stateful

Dstream Operators

• Transformationproduce DStream from one or more parent streams• Stateless (independent per interval)Map, reduce

• Stateful (share data across intervals)Window, incremental aggregation, time-skewed join

• OutputWrite data to external system (save RDD to HDFS)Save, foreach

Fault Recovery

• Input from TCP, Flume or Kafka is stored on 2 nodes• In case of failure:

missing RDDs will be re-computed from surviving nodes.• RDDs are deterministic• So any computation will lead to the same result• Transformation can guarantee

exactly once semantics.• Even through failure

Key Question -

How fast can the system recover?

Example – Streaming WordCount

import org.apache.spark.streaming.{Seconds, StreamingContext}import StreamingContext._...

// Create the context and set up a network input stream val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1))val lines = ssc.socketTextStream(args(1), args(2).toInt)

// Split the lines into words, count them// print some of the counts on the masterval words = lines.flatMap(_.split(" "))val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)wordCounts.print()

// Start the computationssc.start()

Shark Architecture

• Identical to Hive• Same CLI, JDBC, SQL Parser, Metastore

• Replaced the optimizer, plan generator and the execution engine.

• Added Cache Manager. • Generate Spark code instead of Map Reduce

Hive Compatibility

• MetaStore• HQL• UDF / UDAF• SerDes• Scripts

Dynamic Query Plans

• Hive MetaData often lacks statistics• Join types often requires hinting

• Shark gathers statistics per partition• While materializing map output

• Partition sizes, record count, skew, histograms• Alter plan accordingly

Columnar Memory Store

• Better compression• CPU efficiency• Cache Locality

Spark + Shark Integration

val users = sql2rdd("SELECT * FROM user u JOIN comment c ON c.uid=u.uid")

val features = users.mapRows { row => new Vector(extractFeature1(row.getInt("age")),

extractFeature2(row.getStr("country")), ...)} val trainedVector = logRegress(features.cache())

Summary

Why Spark?

• Flexible • High performance• Machine learning,

iterative algorithms• Interactive data

explorations• Developer productivity

Why not Spark?

• Still immature• Uses *lots* of memory• Equivalent functionality

in Impala, Storm, etc

How Spark Works?

• RDDs – resilient distributed data• Lazy transformations• Fault tolerant caching• Streams – micro-batches of RDDs

Intro to Spark - for Denver Big Data Meetup

Engineering