+ All Categories
Home > Technology > Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Date post: 12-Jan-2017
Category:
Upload: edureka
View: 4,596 times
Download: 4 times
Share this document with a friend
36
www.edureka.co/r-for-analyti www.edureka.co/apache-Kafka Apache Kafka with Spark Streaming - Real Time Analytics Redefined
Transcript
Page 1: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

www.edureka.co/r-for-analyticswww.edureka.co/apache-Kafka

Apache Kafka with Spark Streaming - Real Time Analytics Redefined

Page 2: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 2Slide 2Slide 2 www.edureka.co/apache-Kafka

AgendaAt the end of this webinar we will be able understand :

What Is Kafka?

Why We Need Kafka ?

Kafka Components

How Kafka Works

Which Companies Are Using Kafka ?

Kafka And Spark Integration Hands on

Page 3: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 3Slide 3Slide 3 www.edureka.co/apache-Kafka

Why Kafka ??

Page 4: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 4Slide 4Slide 4 www.edureka.co/apache-Kafka

Why Kafka?When we have other messaging systems

Aren’t they Good?

Kafka Vs Other Message Broker?

Page 5: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 5Slide 5Slide 5 www.edureka.co/apache-Kafka

They all are GoodBut not for all use-cases.

Page 6: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 6Slide 6Slide 6 www.edureka.co/apache-Kafka

• Transportation of logs• Activity Stream in Real time.• Collection of Performance Metrics

– CPU/IO/Memory usage– Application Specific

• Time taken to load a web-page.• Time taken by Multiple Services while building a web-page.• No of requests.• No of hits on a particular page/url.

So what are my Use-cases…

Page 7: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 7Slide 7Slide 7 www.edureka.co/apache-Kafka

What is Common?

Scalable : Need to be Highly Scalable. A lot of Data. It can be billions of message.

Reliability of messages, What If, I loose a small no. of messages. Is it fine with me ?

Distributed : Multiple Producers, Multiple Consumers

High-throughput : Does not need to have JMS Standards, as it may be an overkill for some

use-cases like transportation of logs.

As per JMS, each message has to be acknowledged back.

Exactly one delivery guarantee requires two-phase commit.

Page 8: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 8Slide 8Slide 8 www.edureka.co/apache-Kafka

Why LinkedIn built Kafka ?

To collect its growing data, LinkedIn developed many custom data pipelines for streaming and queueing data, like :

To flow data into data warehouse

To send batches of data into our

hadoop workflow for

analytics

To collect and aggregate logs

from every service

To collect tracking events like page views

To queue their inmail

messaging system

To keep their people search system up to

date whenever someone

updated their profile

As the site needed to scale, each individual pipeline needed to scale and many other pipelines were needed.

Something had to give !!! Kafka

Page 9: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 9Slide 9Slide 9 www.edureka.co/apache-Kafka

The number has been growing since

Source : confluent

Page 10: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 10Slide 10Slide 10 www.edureka.co/apache-Kafkahttp://gigaom.com/2013/12/09/netflix-open-sources-its-data-traffic-cop-suro/

A diagram of LinkedIn’s data architecture as of February 2013, including everything from Kafka to Teradata.

diagram of LinkedIn’s data architecture

Page 11: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 11Slide 11Slide 11 www.edureka.co/apache-Kafka

Kafka ?

Built with speed and scalability in mind.

Enabled near real-time access to any data source

Empowered hadoop jobs Allowed us to build real-time analytics

Apache Kafka Hits 1.1 Trillion Messages Per Day (September 2015) 

Kafka is a distributed pub-sub messaging platform

Universal pipeline, built around the concept of a commit log

Kafka as a universal stream broker

Page 12: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 12Slide 12Slide 12 www.edureka.co/apache-Kafka

Kafka Benchmarks

Page 13: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 13Slide 13Slide 13 www.edureka.co/apache-Kafka

Kafka Producer/Consumer PerformanceProcesses hundred of thousands of messages in a second

Page 14: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 14Slide 14Slide 14 www.edureka.co/apache-Kafka14http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

How fast is Kafka?

• “Up to 2 million writes/sec on 3 cheap machines”– Using 3 producers on 3 different machines, 3x async replication

• Only 1 producer/machine because NIC already saturated• Sustained throughput as stored data grows

– Slightly different test config than 2M writes/sec above.

• Test setup– Kafka trunk as of April 2013, but 0.8.1+ should be similar.– 3 machines: 6-core Intel Xeon 2.5 GHz, 32GB RAM, 6x 7200rpm SATA, 1GigE

Page 15: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 15Slide 15Slide 15 www.edureka.co/apache-Kafka

• Fast writes:– While Kafka persists all data to disk, essentially all writes go to the

page cache of OS, i.e. RAM.– Cf. hardware specs and OS tuning (we cover this later)

• Fast reads:– Very efficient to transfer data from page cache to a network socket– Linux: sendfile() system call

• Combination of the two = fast Kafka!– Example (Operations): On a Kafka cluster where the consumers are mostly caught up you will

see no read activity on the disks as they will be serving data entirely from cache.

15http://kafka.apache.org/documentation.html#persistence

Why is Kafka so fast?

Page 16: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 16Slide 16Slide 16 www.edureka.co/apache-Kafka

• Example: loggly.com, who run Kafka & Co. on Amazon AWS

– “99.99999% of the time our data is coming from disk cache and RAM; only very rarely do we hit the disk.”

– “One of our consumer groups (8 threads) which maps a log to a customer can process about 200,000

events per second draining from 192 partitions spread across 3 brokers.”

• Brokers run on m2.xlarge Amazon EC2 instances backed by provisioned IOPS

16http://www.developer-tech.com/news/2014/jun/10/why-loggly-loves-apache-kafka-how-unbreakable-infinitely-scalable-messaging-makes-log-management-better/

Why is Kafka so fast?

Page 17: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 17Slide 17Slide 17 www.edureka.co/apache-Kafka

How it works ??

Page 18: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 18Slide 18Slide 18 www.edureka.co/apache-Kafka

• The who is who– Producers write data to brokers.– Consumers read data from brokers.– All this is distributed.

• The data– Data is stored in topics.– Topics are split into partitions, which are replicated.

18

A first look

Page 19: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 19Slide 19Slide 19 www.edureka.co/apache-Kafka

Broker(s)

19

• Topic: feed name to which messages are published– Example: “zerg.hydra”

new

Producer A1

Producer A2

Producer An…

Kafka prunes “head” based on age or max size or “key”

Older msgs Newer msgs

Kafka topic

Topics

Producers always append to “tail”(think: append to a file)

Page 20: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 20Slide 20Slide 20 www.edureka.co/apache-Kafka

Broker(s)

20

new

Producer A1

Producer A2

Producer An…

Producers always append to “tail”(think: append to a file)

Older msgs Newer msgs

Consumer group C1 Consumers use an “offset pointer” totrack/control their read progress

(and decide the pace of consumption)Consumer group C2

Topics

Page 21: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 21Slide 21Slide 21 www.edureka.co/apache-Kafka

• A topic consists of partitions.• Partition: ordered + immutable sequence of messages that is continually

appended

Topics

Page 22: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 22Slide 22Slide 22 www.edureka.co/apache-Kafka22

• #partitions of a topic is configurable• #partitions determines max consumer (group) parallelism

– Consumer group A, with 2 consumers, reads from a 4-partition topic– Consumer group B, with 4 consumers, reads from the same topic

Topics

Page 23: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 23Slide 23Slide 23 www.edureka.co/apache-Kafka23

• Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset– Consumers track their pointers via (offset, partition, topic) tuples

Consumer group C1

Topics

Page 24: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 24Slide 24Slide 24 www.edureka.co/apache-Kafka24

http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/

Partition

Page 25: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 25Slide 25Slide 25 www.edureka.co/apache-Kafka

Consumer3(Group2)Kafka

Broker

Consumer4(Group2)

Producer

Zookeeper

Consumer2(Group1)

Consumer1(Group1)

get K

afka b

roke

r add

ress

Streaming Fetch messages

Update ConsumedMessage offset

QueueTopology

Topic Topology

Kafka Broker

Broker does not Push messages to Consumer, Consumer Polls messages from Broker.

Broker

Page 26: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 26Slide 26Slide 26 www.edureka.co/apache-Kafka26

http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/

Putting it altogether

Page 27: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 27Slide 27Slide 27 www.edureka.co/apache-Kafka

Kafka + Spark = Real Time Analytics

Page 28: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 28Slide 28Slide 28 www.edureka.co/apache-Kafka

Analytics Flow

Page 29: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 29Slide 29Slide 29 www.edureka.co/apache-Kafka

Data Ingestion Source

Page 30: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 30Slide 30Slide 30 www.edureka.co/apache-Kafka

Real time Analysis with Spark Streaming

Page 31: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 31Slide 31Slide 31 www.edureka.co/apache-Kafka

Analytics Result Displayed/Stored

Page 32: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 32Slide 32Slide 32 www.edureka.co/apache-Kafka

Streaming In Detail

Credit : http://spark.apache.org

Page 33: Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Page 34: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 34Slide 34Slide 34 www.edureka.co/apache-Kafka

• LinkedIn : activity streams, operational metrics, data bus– 400 nodes, 18k topics, 220B msg/day (peak 3.2M msg/s), May 2014

• Netflix : real-time monitoring and event processing• Twitter : as part of their Storm real-time data pipelines• Spotify : log delivery (from 4h down to 10s), Hadoop• Loggly : log collection and processing• Mozilla : telemetry data• Airbnb, Cisco, Gnip, InfoChimps, Ooyala, Square, Uber, …

34

https://cwiki.apache.org/confluence/display/KAFKA/Powered+By

Kafka adoption and use cases

Page 35: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Questions

Slide 35

Page 36: Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Slide 36

Your feedback is vital for us, be it a compliment, a suggestion or a complaint. It helps us to make your experience better!

Please spare few minutes to take the survey after the webinar.

Survey


Recommended