Spark - Philly JUG

transcript

Brian O’Neill (@boneill42)Monetate

Agenda● History / Context

○Hadoop○Lambda

●Spark Basics○RDDs, Dataframe, SQL, Streaming

● Play along / Demo

We work at Monetate...Client

(e.g. Retailer)

DecisionEngine

AnalyticsEngine

consumer marketer

Dashboard

Warehouse

Observations

We call it a...Personalization Platform

Not so hard until...m’s → B’s

100ms’s → 10ms’sdays → minutes

(sessions / month)

(response times)

(analytics lag)

HISTORY

history - hadoop

map / reduce

tuple = (key, value)map(x) -> tuple[]reduce(key, value[]) -> tuple[]

word count The Codedef map(doc) doc.each do |word| emit(word, 1) endend

def reduce(key, values[]) sum = values.inject {|sum,x| sum + x } emit(key, sum)end

The Rundoc1 = “boy meets girl”doc2 = ”girl likes boy”)map (doc1) -> (boy, 1), (meets, 1), (girl, 1)map (doc2) -> (girl, 1), (likes, 1), (boy, 1)reduce (boy, [1, 1]) -> (boy, 2)reduce (girl, [1, 1]) -> (girl, 2)reduce (likes [1]) -> (likes, 1) reduce (meets, [1]) -> (meets, 1)

Jobs on top of jobs...

Real-time? Different hammer.

Let’s invent some terminology...

Traditional lambda...

Can we collapse the lambda?

Spark- FTW!

Lambda on Spark (e.g.)

Dataframe

SPARK BASICS

Concept : RDDs“Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.”

http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds

Concept : Transformations & OperationsTransformation:

RDD(s) → RDDe.g. map, filter, groupBy, etc.

Action:RDD → valuee.g. reduce, count, etc.

Code: RDDsJavaPairRDD<Integer, Product> productsRDD = javaFunctions(sc) .cassandraTable("java_api", "products", productReader) .keyBy(new Function<Product, Integer>() { @Override public Integer call(Product product) throws Exception { return product.getId(); }});

Lazily evaluated!

Concept : DataFramesDataFrames = RDD + Schema“A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.”

http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes

Concept : Spark SQL

SELECT min(event_time) AS start_time, max(event_time) AS end_time, account_id FROM events GROUP BY account_id

Code: SQL + Dataframes

StructType schema = configuration.getSchemaForProduct();DataFrame dataFrame = sqlContext.createDataFrame(productsRDD, schema);sqlContext.registerDataFrameAsTable(dataFrame, “products”);

And remember Uncle Ben…

“With great power, comes great responsibility.”

Concept : Streaming.forEachRDD

Code: Streaming JavaStreamingContext streamingContext = new JavaStreamingContext(getSparkConf(), SessionizerState.getConfig().getSparkStreamingBatchDuration()); JavaReceiverInputDStream<byte[]> kinesisStream = KinesisUtils.createStream(...); kinesisStream.foreachRDD(new VoidFunction<JavaRDD<byte[]>>() { @Override public void call(JavaRDD<byte[]> rdd) throws Exception { JavaRDD<String> lines = rdd.map(new Function<byte[], String>( { public String call(byte[] bytes) throws IOException { return new String(bytes, Charset.forName("UTF-8")); } }); processRdd(lines); } });

DEPLOYMENT

Basic Architecture

http://spark.apache.org/docs/latest/cluster-overview.html

Kinesis / Streaming Architecture

Amazon’s EMR

Play along demo.

Get stuff...Get Spark...http://spark.apache.org/downloads.html

Get Cassandra…http://cassandra.apache.org/download/

Get Code…https://github.com/boneill42/spark-on-cassandra-quickstart

Configure stuff...$spark/conf> cp spark-env.sh.template spark-env.sh$spark/conf> echo "SPARK_MASTER_IP=127.0.0.1" >> spark-env.sh

Start stuff...# Start Master$spark> sbin/start-master.sh$spark> tail -f logs/*

# Start Worker$spark> bin/spark-class org.apache.spark.deploy.worker.Worker \ spark://127.0.0.1:7077

Build and launch stuff...# Build$code> mvn clean install[INFO] ------------------------------------------------------------------------[INFO] BUILD SUCCESS[INFO] ------------------------------------------------------------------------

# Launch$code> spark-submit --class com.github.boneill42.JavaDemo \ --master spark://127.0.0.1:7077 \ target/spark-on-cassandra-0.0.1-SNAPSHOT-jar-with-dependencies.jar \ spark://127.0.0.1:7077 127.0.0.1

A message from our sponsor

Advertisements...https://github.com/monetate/koupler

https://github.com/monetate/ectou-metadata

https://github.com/monetate/ectou-export

Spark - Philly JUG

Technology