+ All Categories
Home > Technology > Spark - Philly JUG

Spark - Philly JUG

Date post: 08-Jan-2017
Category:
Upload: brian-oneill
View: 358 times
Download: 0 times
Share this document with a friend
37
Spark Brian O’Neill (@boneill42) Monetate
Transcript
Page 1: Spark  - Philly JUG

Spark

Brian O’Neill (@boneill42)Monetate

Page 2: Spark  - Philly JUG

Agenda● History / Context

○Hadoop○Lambda

●Spark Basics○RDDs, Dataframe, SQL, Streaming

● Play along / Demo

Page 3: Spark  - Philly JUG

We work at Monetate...Client

(e.g. Retailer)

DecisionEngine

Data

AnalyticsEngine

consumer marketer

Dashboard

Warehouse

Meta

Observations

Page 4: Spark  - Philly JUG

We call it a...Personalization Platform

Not so hard until...m’s → B’s

100ms’s → 10ms’sdays → minutes

(sessions / month)

(response times)

(analytics lag)

Page 5: Spark  - Philly JUG

HISTORY

Page 6: Spark  - Philly JUG

history - hadoop

Page 7: Spark  - Philly JUG

map / reduce

tuple = (key, value)map(x) -> tuple[]reduce(key, value[]) -> tuple[]

Page 8: Spark  - Philly JUG

word count The Codedef map(doc) doc.each do |word| emit(word, 1) endend

def reduce(key, values[]) sum = values.inject {|sum,x| sum + x } emit(key, sum)end

The Rundoc1 = “boy meets girl”doc2 = ”girl likes boy”)map (doc1) -> (boy, 1), (meets, 1), (girl, 1)map (doc2) -> (girl, 1), (likes, 1), (boy, 1)reduce (boy, [1, 1]) -> (boy, 2)reduce (girl, [1, 1]) -> (girl, 2)reduce (likes [1]) -> (likes, 1) reduce (meets, [1]) -> (meets, 1)

Page 9: Spark  - Philly JUG

Jobs on top of jobs...

Page 10: Spark  - Philly JUG

Real-time? Different hammer.

Page 11: Spark  - Philly JUG

Let’s invent some terminology...

Page 12: Spark  - Philly JUG

Traditional lambda...

Page 13: Spark  - Philly JUG

Can we collapse the lambda?

Page 14: Spark  - Philly JUG

Spark- FTW!

Page 15: Spark  - Philly JUG

Lambda on Spark (e.g.)

S3

Kafka

MySQL

RDD

RDD

Dataframe

Druid

Page 16: Spark  - Philly JUG

SPARK BASICS

Page 17: Spark  - Philly JUG

Concept : RDDs“Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.”

http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds

Page 18: Spark  - Philly JUG

Concept : Transformations & OperationsTransformation:

RDD(s) → RDDe.g. map, filter, groupBy, etc.

Action:RDD → valuee.g. reduce, count, etc.

Page 19: Spark  - Philly JUG

Code: RDDsJavaPairRDD<Integer, Product> productsRDD = javaFunctions(sc) .cassandraTable("java_api", "products", productReader) .keyBy(new Function<Product, Integer>() { @Override public Integer call(Product product) throws Exception { return product.getId(); }});

Page 20: Spark  - Philly JUG

DAGs

Lazily evaluated!

Page 21: Spark  - Philly JUG

Concept : DataFramesDataFrames = RDD + Schema“A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.”

http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes

Page 22: Spark  - Philly JUG

Concept : Spark SQL

SELECT min(event_time) AS start_time, max(event_time) AS end_time, account_id FROM events GROUP BY account_id

Page 23: Spark  - Philly JUG

Code: SQL + Dataframes

StructType schema = configuration.getSchemaForProduct();DataFrame dataFrame = sqlContext.createDataFrame(productsRDD, schema);sqlContext.registerDataFrameAsTable(dataFrame, “products”);

Page 24: Spark  - Philly JUG

And remember Uncle Ben…

“With great power, comes great responsibility.”

Page 25: Spark  - Philly JUG

Concept : Streaming.forEachRDD

Page 26: Spark  - Philly JUG

Code: Streaming JavaStreamingContext streamingContext = new JavaStreamingContext(getSparkConf(), SessionizerState.getConfig().getSparkStreamingBatchDuration()); JavaReceiverInputDStream<byte[]> kinesisStream = KinesisUtils.createStream(...); kinesisStream.foreachRDD(new VoidFunction<JavaRDD<byte[]>>() { @Override public void call(JavaRDD<byte[]> rdd) throws Exception { JavaRDD<String> lines = rdd.map(new Function<byte[], String>( { public String call(byte[] bytes) throws IOException { return new String(bytes, Charset.forName("UTF-8")); } }); processRdd(lines); } });

Page 27: Spark  - Philly JUG

DEPLOYMENT

Page 28: Spark  - Philly JUG

Basic Architecture

http://spark.apache.org/docs/latest/cluster-overview.html

YARN!

Page 29: Spark  - Philly JUG

Kinesis / Streaming Architecture

Page 30: Spark  - Philly JUG

Amazon’s EMR

Page 31: Spark  - Philly JUG

Play along demo.

Page 32: Spark  - Philly JUG

Get stuff...Get Spark...http://spark.apache.org/downloads.html

Get Cassandra…http://cassandra.apache.org/download/

Get Code…https://github.com/boneill42/spark-on-cassandra-quickstart

Page 33: Spark  - Philly JUG

Configure stuff...$spark/conf> cp spark-env.sh.template spark-env.sh$spark/conf> echo "SPARK_MASTER_IP=127.0.0.1" >> spark-env.sh

Page 34: Spark  - Philly JUG

Start stuff...# Start Master$spark> sbin/start-master.sh$spark> tail -f logs/*

# Start Worker$spark> bin/spark-class org.apache.spark.deploy.worker.Worker \ spark://127.0.0.1:7077

Page 35: Spark  - Philly JUG

Build and launch stuff...# Build$code> mvn clean install[INFO] ------------------------------------------------------------------------[INFO] BUILD SUCCESS[INFO] ------------------------------------------------------------------------

# Launch$code> spark-submit --class com.github.boneill42.JavaDemo \ --master spark://127.0.0.1:7077 \ target/spark-on-cassandra-0.0.1-SNAPSHOT-jar-with-dependencies.jar \ spark://127.0.0.1:7077 127.0.0.1

Page 36: Spark  - Philly JUG

A message from our sponsor

Page 37: Spark  - Philly JUG

Advertisements...https://github.com/monetate/koupler

https://github.com/monetate/ectou-metadata

https://github.com/monetate/ectou-export


Recommended