+ All Categories
Home > Data & Analytics > Kafka pours and Spark resolves

Kafka pours and Spark resolves

Date post: 21-Apr-2017
Category:
Upload: alexey-zinoviev
View: 319 times
Download: 5 times
Share this document with a friend
113
Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM
Transcript
Page 1: Kafka pours and Spark resolves

Kafka pours and Spark

resolves!

Alexey Zinovyev, Java/BigData Trainer in EPAM

Page 2: Kafka pours and Spark resolves

About

With IT since 2007

With Java since 2009

With Hadoop since 2012

With Spark since 2014

With EPAM since 2015

Page 3: Kafka pours and Spark resolves

3Spark Streaming from Zinoviev Alexey

Contacts

E-mail : [email protected]

Twitter : @zaleslaw @BigDataRussia

Facebook: https://www.facebook.com/zaleslaw

vk.com/big_data_russia Big Data Russia

vk.com/java_jvm Java & JVM langs

Page 4: Kafka pours and Spark resolves

4Spark Streaming from Zinoviev Alexey

Spark

Family

Page 5: Kafka pours and Spark resolves

5Spark Streaming from Zinoviev Alexey

Spark

Family

Page 6: Kafka pours and Spark resolves

6Spark Streaming from Zinoviev Alexey

Pre-summary

• Before RealTime

• Spark + Cassandra

• Sending messages with Kafka

• DStream Kafka Consumer

• Structured Streaming in Spark 2.1

• Kafka Writer in Spark 2.2

Page 7: Kafka pours and Spark resolves

7Spark Streaming from Zinoviev Alexey

< REAL-TIME

Page 8: Kafka pours and Spark resolves

8Spark Streaming from Zinoviev Alexey

Modern Java in 2016Big Data in 2014

Page 9: Kafka pours and Spark resolves

9Spark Streaming from Zinoviev Alexey

Batch jobs produce reports. More and more..

Page 10: Kafka pours and Spark resolves

10Spark Streaming from Zinoviev Alexey

But customer can wait forever (ok, 1h)

Page 11: Kafka pours and Spark resolves

11Spark Streaming from Zinoviev Alexey

Big Data in 2017

Page 12: Kafka pours and Spark resolves

12Spark Streaming from Zinoviev Alexey

Machine Learning EVERYWHERE

Page 13: Kafka pours and Spark resolves

13Spark Streaming from Zinoviev Alexey

Data Lake in promotional brochure

Page 14: Kafka pours and Spark resolves

14Spark Streaming from Zinoviev Alexey

Data Lake in production

Page 15: Kafka pours and Spark resolves

15Spark Streaming from Zinoviev Alexey

Simple Flow in Reporting/BI systems

Page 16: Kafka pours and Spark resolves

16Spark Streaming from Zinoviev Alexey

Let’s use Spark. It’s fast!

Page 17: Kafka pours and Spark resolves

17Spark Streaming from Zinoviev Alexey

MapReduce vs Spark

Page 18: Kafka pours and Spark resolves

18Spark Streaming from Zinoviev Alexey

MapReduce vs Spark

Page 19: Kafka pours and Spark resolves

19Spark Streaming from Zinoviev Alexey

Simple Flow in Reporting/BI systems with Spark

Page 20: Kafka pours and Spark resolves

20Spark Streaming from Zinoviev Alexey

Spark handles last year logs with ease

Page 21: Kafka pours and Spark resolves

21Spark Streaming from Zinoviev Alexey

Where can we store events?

Page 22: Kafka pours and Spark resolves

22Spark Streaming from Zinoviev Alexey

Let’s use Cassandra to store events!

Page 23: Kafka pours and Spark resolves

23Spark Streaming from Zinoviev Alexey

Let’s use Cassandra to read events!

Page 24: Kafka pours and Spark resolves

24Spark Streaming from Zinoviev Alexey

Cassandra

CREATE KEYSPACE mySpace WITH replication = {'class':

'SimpleStrategy', 'replication_factor': 1 };

USE test;

CREATE TABLE logs

( application TEXT,

time TIMESTAMP,

message TEXT,

PRIMARY KEY (application, time));

Page 25: Kafka pours and Spark resolves

25Spark Streaming from Zinoviev Alexey

Cassandra

to Spark

val dataSet = sqlContext

.read

.format("org.apache.spark.sql.cassandra")

.options(Map( "table" -> "logs", "keyspace" -> "mySpace"

))

.load()

dataSet

.filter("message = 'Log message'")

.show()

Page 26: Kafka pours and Spark resolves

26Spark Streaming from Zinoviev Alexey

Simple Flow in Pre-Real-Time systems

Page 27: Kafka pours and Spark resolves

27Spark Streaming from Zinoviev Alexey

Spark cluster over Cassandra Cluster

Page 28: Kafka pours and Spark resolves

28Spark Streaming from Zinoviev Alexey

More events every second!

Page 29: Kafka pours and Spark resolves

29Spark Streaming from Zinoviev Alexey

SENDING MESSAGES

Page 30: Kafka pours and Spark resolves

30Spark Streaming from Zinoviev Alexey

Your Father’s Messaging System

Page 31: Kafka pours and Spark resolves

31Spark Streaming from Zinoviev Alexey

Your Father’s Messaging System

Page 32: Kafka pours and Spark resolves

32Spark Streaming from Zinoviev Alexey

Your Father’s Messaging System

InitialContext ctx = new InitialContext();

QueueConnectionFactory f =

(QueueConnectionFactory)ctx.lookup(“qCFactory"); QueueConnection con =

f.createQueueConnection();

con.start();

Page 33: Kafka pours and Spark resolves

33Spark Streaming from Zinoviev Alexey

KAFKA

Page 34: Kafka pours and Spark resolves

34Spark Streaming from Zinoviev Alexey

Kafka

• messaging system

Page 35: Kafka pours and Spark resolves

35Spark Streaming from Zinoviev Alexey

Kafka

• messaging system

• distributed

Page 36: Kafka pours and Spark resolves

36Spark Streaming from Zinoviev Alexey

Kafka

• messaging system

• distributed

• supports Publish-Subscribe model

Page 37: Kafka pours and Spark resolves

37Spark Streaming from Zinoviev Alexey

Kafka

• messaging system

• distributed

• supports Publish-Subscribe model

• persists messages on disk

Page 38: Kafka pours and Spark resolves

38Spark Streaming from Zinoviev Alexey

Kafka

• messaging system

• distributed

• supports Publish-Subscribe model

• persists messages on disk

• replicates within the cluster (integrated with Zookeeper)

Page 39: Kafka pours and Spark resolves

39Spark Streaming from Zinoviev Alexey

The main benefits of Kafka

Scalability with zero down time

Zero data loss due to replication

Page 40: Kafka pours and Spark resolves

40Spark Streaming from Zinoviev Alexey

Kafka Cluster consists of …

• brokers (leader or follower)

• topics ( >= 1 partition)

• partitions

• partition offsets

• replicas of partition

• producers/consumers

Page 41: Kafka pours and Spark resolves

41Spark Streaming from Zinoviev Alexey

Kafka Components with topic “messages” #1

Producer

Thread #1

Producer

Thread #2

Producer

Thread #3

Topic: MessagesData

Data

Data

Zookeeper

Page 42: Kafka pours and Spark resolves

42Spark Streaming from Zinoviev Alexey

Kafka Components with topic “messages” #2

Producer

Thread #1

Producer

Thread #2

Producer

Thread #3

Topic: Messages

Part #1

Part #2

Data

Data

Data

Broker #1

Zookeeper

Part #1

Part #2

Leader

Leader

Page 43: Kafka pours and Spark resolves

43Spark Streaming from Zinoviev Alexey

Kafka Components with topic “messages” #3

Producer

Thread #1

Producer

Thread #2

Producer

Thread #3

Topic: Messages

Part #1

Part #2

Data

Data

Data

Broker #1

Broker #2

Zookeeper

Part #1

Part #1

Part #2

Part #2

Page 44: Kafka pours and Spark resolves

44Spark Streaming from Zinoviev Alexey

Why do we need Zookeeper?

Page 45: Kafka pours and Spark resolves

45Spark Streaming from Zinoviev Alexey

Kafka

Demo

Page 46: Kafka pours and Spark resolves

46Spark Streaming from Zinoviev Alexey

REAL TIME WITH

DSTREAMS

Page 47: Kafka pours and Spark resolves

47Spark Streaming from Zinoviev Alexey

RDD Factory

Page 48: Kafka pours and Spark resolves

48Spark Streaming from Zinoviev Alexey

From socket to console with DStreams

Page 49: Kafka pours and Spark resolves

49Spark Streaming from Zinoviev Alexey

DStream

val conf = new SparkConf().setMaster("local[2]")

.setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

ssc.start()

ssc.awaitTermination()

Page 50: Kafka pours and Spark resolves

50Spark Streaming from Zinoviev Alexey

DStream

val conf = new SparkConf().setMaster("local[2]")

.setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

ssc.start()

ssc.awaitTermination()

Page 51: Kafka pours and Spark resolves

51Spark Streaming from Zinoviev Alexey

DStream

val conf = new SparkConf().setMaster("local[2]")

.setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

ssc.start()

ssc.awaitTermination()

Page 52: Kafka pours and Spark resolves

52Spark Streaming from Zinoviev Alexey

DStream

val conf = new SparkConf().setMaster("local[2]")

.setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

ssc.start()

ssc.awaitTermination()

Page 53: Kafka pours and Spark resolves

53Spark Streaming from Zinoviev Alexey

DStream

val conf = new SparkConf().setMaster("local[2]")

.setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

ssc.start()

ssc.awaitTermination()

Page 54: Kafka pours and Spark resolves

54Spark Streaming from Zinoviev Alexey

Kafka as a main entry point for Spark

Page 55: Kafka pours and Spark resolves

55Spark Streaming from Zinoviev Alexey

DStreams

Demo

Page 56: Kafka pours and Spark resolves

56Spark Streaming from Zinoviev Alexey

How to avoid DStreams with RDD-like API?

Page 57: Kafka pours and Spark resolves

57Spark Streaming from Zinoviev Alexey

SPARK 2.2 DISCUSSION

Page 58: Kafka pours and Spark resolves

58Spark Streaming from Zinoviev Alexey

Continuous Applications

Page 59: Kafka pours and Spark resolves

59Spark Streaming from Zinoviev Alexey

Continuous Applications cases

• Updating data that will be served in real time

• Extract, transform and load (ETL)

• Creating a real-time version of an existing batch job

• Online machine learning

Page 60: Kafka pours and Spark resolves

60Spark Streaming from Zinoviev Alexey

The main concept of Structured Streaming

You can express your streaming computation the

same way you would express a batch computation

on static data.

Page 61: Kafka pours and Spark resolves

61Spark Streaming from Zinoviev Alexey

Batch

Spark 2.2

// Read JSON once from S3

logsDF = spark.read.json("s3://logs")

// Transform with DataFrame API and save

logsDF.select("user", "url", "date")

.write.parquet("s3://out")

Page 62: Kafka pours and Spark resolves

62Spark Streaming from Zinoviev Alexey

Real

Time

Spark 2.2

// Read JSON continuously from S3

logsDF = spark.readStream.json("s3://logs")

// Transform with DataFrame API and save

logsDF.select("user", "url", "date")

.writeStream.parquet("s3://out")

.start()

Page 63: Kafka pours and Spark resolves

63Spark Streaming from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

Page 64: Kafka pours and Spark resolves

64Spark Streaming from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

Page 65: Kafka pours and Spark resolves

65Spark Streaming from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

Page 66: Kafka pours and Spark resolves

66Spark Streaming from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

Don’t forget

to start

Streaming

Page 67: Kafka pours and Spark resolves

67Spark Streaming from Zinoviev Alexey

Unlimited Table

Page 68: Kafka pours and Spark resolves

68Spark Streaming from Zinoviev Alexey

WordCount with Structured Streaming [Complete Mode]

Page 69: Kafka pours and Spark resolves

69Spark Streaming from Zinoviev Alexey

Kafka -> Structured Streaming -> Console

Page 70: Kafka pours and Spark resolves

70Spark Streaming from Zinoviev Alexey

Kafka To

Console

Demo

Page 71: Kafka pours and Spark resolves

71Spark Streaming from Zinoviev Alexey

OPERATIONS

Page 72: Kafka pours and Spark resolves

72Spark Streaming from Zinoviev Alexey

You can …

• filter

• sort

• aggregate

• join

• foreach

• explain

Page 73: Kafka pours and Spark resolves

73Spark Streaming from Zinoviev Alexey

Operators

Demo

Page 74: Kafka pours and Spark resolves

74Spark Streaming from Zinoviev Alexey

How it works?

Page 75: Kafka pours and Spark resolves

75Spark Streaming from Zinoviev Alexey

Deep Diving in Spark Internals

Dataset

Page 76: Kafka pours and Spark resolves

76Spark Streaming from Zinoviev Alexey

Deep Diving in Spark Internals

Dataset Logical Plan

Page 77: Kafka pours and Spark resolves

77Spark Streaming from Zinoviev Alexey

Deep Diving in Spark Internals

Dataset Logical PlanOptimized

Logical Plan

Page 78: Kafka pours and Spark resolves

78Spark Streaming from Zinoviev Alexey

Deep Diving in Spark Internals

Dataset Logical PlanOptimized

Logical PlanLogical Plan

Logical PlanLogical PlanPhysical

Plan

Page 79: Kafka pours and Spark resolves

79Spark Streaming from Zinoviev Alexey

Deep Diving in Spark Internals

Dataset Logical PlanOptimized

Logical PlanLogical Plan

Logical PlanLogical PlanPhysical

Plan

Selected

Physical Plan

Planner

Page 80: Kafka pours and Spark resolves

80Spark Streaming from Zinoviev Alexey

Deep Diving in Spark Internals

Dataset Logical PlanOptimized

Logical PlanLogical Plan

Logical PlanLogical PlanPhysical

Plan

Selected

Physical Plan

Page 81: Kafka pours and Spark resolves

81Spark Streaming from Zinoviev Alexey

Deep Diving in Spark Internals

Dataset Logical PlanOptimized

Logical PlanLogical Plan

Logical PlanLogical PlanPhysical

Plan

Selected

Physical Plan

Incremental

#1

Incremental

#2

Incremental

#3

Page 82: Kafka pours and Spark resolves

82Spark Streaming from Zinoviev Alexey

Incremental Execution: Planner polls

Planner

Page 83: Kafka pours and Spark resolves

83Spark Streaming from Zinoviev Alexey

Incremental Execution: Planner runs

Planner

Offset: 1 - 5Incremental

#1

R

U

N

Count: 5

Page 84: Kafka pours and Spark resolves

84Spark Streaming from Zinoviev Alexey

Incremental Execution: Planner runs #2

Planner

Offset: 1 - 5Incremental

#1

Incremental

#2Offset: 6 - 9

R

U

N

Count: 5

Count: 4

Page 85: Kafka pours and Spark resolves

85Spark Streaming from Zinoviev Alexey

Aggregation with State

Planner

Offset: 1 - 5Incremental

#1

Incremental

#2Offset: 6 - 9

R

U

N

Count: 5

Count: 5+4=9

5

Page 86: Kafka pours and Spark resolves

86Spark Streaming from Zinoviev Alexey

DataSet.explain()

== Physical Plan ==Project [avg(price)#43,carat#45]+- SortMergeJoin [color#21], [color#47]

:- Sort [color#21 ASC], false, 0: +- TungstenExchange hashpartitioning(color#21,200), None: +- Project [avg(price)#43,color#21]: +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as

bigint)),mode=Final,isDistinct=false)], output=[color#21,avg(price)#43]): +- TungstenExchange hashpartitioning(cut#20,color#21,200), None: +- TungstenAggregate(key=[cut#20,color#21],

functions=[(avg(cast(price#25 as bigint)),mode=Partial,isDistinct=false)], output=[cut#20,color#21,sum#58,count#59L])

: +- Scan CsvRelation(-----)+- Sort [color#47 ASC], false, 0

+- TungstenExchange hashpartitioning(color#47,200), None+- ConvertToUnsafe

+- Scan CsvRelation(----)

Page 87: Kafka pours and Spark resolves

87Spark Streaming from Zinoviev Alexey

What’s the difference between

Complete and Append

output modes?

Page 88: Kafka pours and Spark resolves

88Spark Streaming from Zinoviev Alexey

COMPLETE, APPEND &

UPDATE

Page 89: Kafka pours and Spark resolves

89Spark Streaming from Zinoviev Alexey

There are two main modes and one in future

• append (default)

Page 90: Kafka pours and Spark resolves

90Spark Streaming from Zinoviev Alexey

There are two main modes and one in future

• append (default)

• complete

Page 91: Kafka pours and Spark resolves

91Spark Streaming from Zinoviev Alexey

There are two main modes and one in future

• append (default)

• complete

• update [in dreams]

Page 92: Kafka pours and Spark resolves

92Spark Streaming from Zinoviev Alexey

Aggregation with watermarks

Page 93: Kafka pours and Spark resolves

93Spark Streaming from Zinoviev Alexey

SOURCES & SINKS

Page 94: Kafka pours and Spark resolves

94Spark Streaming from Zinoviev Alexey

Spark Streaming is a brick in the Big Data Wall

Page 95: Kafka pours and Spark resolves

95Spark Streaming from Zinoviev Alexey

Let's save to Parquet files

Page 96: Kafka pours and Spark resolves

96Spark Streaming from Zinoviev Alexey

Let's save to Parquet files

Page 97: Kafka pours and Spark resolves

97Spark Streaming from Zinoviev Alexey

Let's save to Parquet files

Page 98: Kafka pours and Spark resolves

98Spark Streaming from Zinoviev Alexey

Can we write to Kafka?

Page 99: Kafka pours and Spark resolves

99Spark Streaming from Zinoviev Alexey

Nightly Build

Page 100: Kafka pours and Spark resolves

100Spark Streaming from Zinoviev Alexey

Kafka-to-Kafka

Page 101: Kafka pours and Spark resolves

101Spark Streaming from Zinoviev Alexey

J-K-S-K-S-C

Page 102: Kafka pours and Spark resolves

102Spark Streaming from Zinoviev Alexey

Pipeline

Demo

Page 103: Kafka pours and Spark resolves

103Spark Streaming from Zinoviev Alexey

I didn’t find sink/source for XXX…

Page 104: Kafka pours and Spark resolves

104Spark Streaming from Zinoviev Alexey

Console

Foreach

Sink

import org.apache.spark.sql.ForeachWriter

val customWriter = new ForeachWriter[String] {

override def open(partitionId: Long, version: Long) = true

override def process(value: String) = println(value)

override def close(errorOrNull: Throwable) = {}

}

stream.writeStream

.queryName(“ForeachOnConsole")

.foreach(customWriter)

.start

Page 105: Kafka pours and Spark resolves

105Spark Streaming from Zinoviev Alexey

Pinch of wisdom

• check checkpointLocation

• don’t use MemoryStream

• think about GC pauses

• be careful about nighty builds

• use .groupBy.count() instead count()

• use console sink instead .show() function

Page 106: Kafka pours and Spark resolves

106Spark Streaming from Zinoviev Alexey

We have no ability…

• join two streams

• work with update mode

• make full outer join

• take first N rows

• sort without pre-aggregation

Page 107: Kafka pours and Spark resolves

107Spark Streaming from Zinoviev Alexey

Roadmap 2.2

• Support other data sources (not only S3 + HDFS)

• Transactional updates

• Dataset is one DSL for all operations

• GraphFrames + Structured MLLib

• KafkaWriter

• TensorFrames

Page 108: Kafka pours and Spark resolves

108Spark Streaming from Zinoviev Alexey

IN CONCLUSION

Page 109: Kafka pours and Spark resolves

109Spark Streaming from Zinoviev Alexey

Final point

Scalable Fault-Tolerant Real-Time Pipeline with

Spark & Kafka

is ready for usage

Page 110: Kafka pours and Spark resolves

110Spark Streaming from Zinoviev Alexey

A few papers about Spark Streaming and Kafka

Introduction in Spark + Kafka

http://bit.ly/2mJjE4i

Page 111: Kafka pours and Spark resolves

111Spark Streaming from Zinoviev Alexey

Source code from Demo

Spark Tutorial

https://github.com/zaleslaw/Spark-Tutorial

Page 112: Kafka pours and Spark resolves

112Spark Streaming from Zinoviev Alexey

Contacts

E-mail : [email protected]

Twitter : @zaleslaw @BigDataRussia

Facebook: https://www.facebook.com/zaleslaw

vk.com/big_data_russia Big Data Russia

vk.com/java_jvm Java & JVM langs

Page 113: Kafka pours and Spark resolves

113Spark Streaming from Zinoviev Alexey

Any questions?


Recommended