+ All Categories
Home > Technology > Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

Date post: 16-Apr-2017
Category:
Upload: hortonworks
View: 6,638 times
Download: 2 times
Share this document with a friend
43
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Harnessing Data-in- Motion with Hortonworks DataFlow Apache NiFi, Kafka and Storm Better Together Bryan Bende Sr. Software Engineer Haimo Liu Product Manager
Transcript
Page 1: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Harnessing Data-in-Motion with Hortonworks DataFlow

Apache NiFi, Kafka and Storm Better Together

Bryan BendeSr. Software Engineer

Haimo LiuProduct Manager

Page 2: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Agenda• Introduction to Hortonworks Data Flow

• Introduction to Apache projects

• Better together

• Best Practices

• Demo

Page 3: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Connected Data Platforms

Page 4: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Stream Processing

Flow Management

Enterprise Services

At the edge

Secu

rity

Visu

aliza

tion

On premises In the cloud

Registries/Catalogs Governance (Security/Compliance) Operations

HDF 2.0 – Data in Motion Platform

Page 5: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Flow Management Flow management + Stream Processing

D A T A I N M O T I O N D A T A A T R E S T

IoT Data Sources AWSAzure

Google CloudHadoop

NiFiKafka

Storm

Others…NiFi

NiFi NiFi

MiNiFi

MiNiFi

MiNiFi

MiNiFi

MiNiFi

MiNiFi

MiNiFi

NiFi

HDF 2.0 – Data in Motion Platform

Enterprise Services

Ambari Ranger Other services

Page 6: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

Introduction to Apache Projects

Page 7: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What is Apache NiFi?

• Created to address the challenges of global enterprise dataflow• Key features:

– Visual Command and Control

– Data Lineage (Provenance)

– Data Prioritization

– Data Buffering/Back-Pressure

– Control Latency vs. Throughput

– Secure Control Plane / Data Plane

– Scale Out Clustering

– Extensibility

Page 8: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache NiFi

What is Apache NiFi used for?• Reliable and secure transfer of data between systems• Delivery of data from sources to analytic platforms• Enrichment and preparation of data:

– Conversion between formats– Extraction/Parsing– Routing decisions

What is Apache NiFi NOT used for?• Distributed Computation• Complex Event Processing• Complex Rolling Window Operations

Page 9: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

NiFi Terminology

FlowFile• Unit of data moving through the system• Content + Attributes (key/value pairs)

Processor• Performs the work, can access FlowFiles

Connection• Links between processors• Queues that can be dynamically prioritized

Page 10: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What is Apache Kafka? APACHE KAFKA

• Distributed streaming platform that allows publishing and subscribing to streams of records

• Streams of records are organized into categories called topics

• Topics can be partitioned and/or replicated

• Records consist of a key, value, and timestamp

http://kafka.apache.org/intro

Kafka Cluster

producer

producer

producer

consumer

consumer

consumer

Page 11: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Kafka: Anatomy of a Topic

Partition 0

Partition 1

Partition 2

0 0 0

1 1 1

2 2 2

3 3 3

4 4 4

5 5 5

6 6 6

7 7 7

8 8 8

9 9 9

10 10

11 11

12

Writes

Old

New

Partitioning allows topics to scale beyond a single machine/node

Topics can also be replicated, for high availability.

APACHE KAFKA

Page 12: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

NiFi and Kafka Are Complementary

NiFiProvide dataflow solution• Centralized management, from edge to core• Great traceability, event level data provenance

starting when data is born• Interactive command and control – real time

operational visibility• Dataflow management, including prioritization,

back pressure, and edge intelligence• Visual representation of global dataflow

KafkaProvide durable stream store• Low latency• Distributed data durability• Decentralized management of producers &

consumers

+

Page 13: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What is Apache Storm?

• Distributed, low-latency, fault-tolerant, Stream Processing platform.• Provides processing guarantees.• Key concepts include:• Tuples• Streams• Spouts• Bolts• Topology

Page 14: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Storm - Tuples and Streams

• What is a Tuple?– Fundamental data structure in Storm–Named list of values that can be of any data type

•What is a Stream?–An unbounded sequences of tuples.–Core abstraction in Storm and are what you “process” in Storm

Page 15: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Storm - Spouts

• What is a Spout?– Source of data – E.g.: JMS, Twitter, Log, Kafka Spout– Can spin up multiple instances of a Spout and dynamically adjust as needed

Page 16: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Storm - Bolts

• What is a Bolt?– Processes any number of input streams and produces output streams– Common processing in bolts are functions, aggregations, joins, R/W to data stores, alerting logic– Can spin up multiple instances of a Bolt and dynamically adjust as needed

Page 17: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Storm - Topology

• What is a Topology?–A network of spouts and bolts wired together into a workflow

Page 18: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

+

NiFi and Storm Are Complementary

NiFiSimple event processing• Manages flow of data between producers and

consumers across the enterprise• Data enrichment, splitting, aggregation,

format conversion, schema translation…• Scale out to handle gigabytes per second, or

scale down to a Raspberry PI handling tens of thousands of events per second

StormComplex and distributed processing• Complex processing from multiple streams (JOIN

operations)• Analyzing data across time windows (rolling window

aggregation, standard deviation, etc.)• Scale out to thousands of nodes if needed

+

Page 19: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

Better Together+ +

Page 20: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Key Integration Points

• NiFi - Kafka– NiFi Kafka Producer– NiFi Kafka Consumer

• Storm - Kafka– Storm Kafka Consumer– Storm Kafka Producer

+ +

Page 21: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Key Integration Points – NiFi & Kafka

NiFi

MiNiFi

MiNiFi

MiNiFi

Kafka

Consumer 1

Consumer 2

Consumer N

• Producer Processors• PutKafka (0.8 Kafka Client)• PublishKafka (0.9 Kafka Client)• PublishKafka_0_10 (0.10 Kafka Client)

+

Page 22: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Key Integration Points – NiFi & Kafka

Kafka

Producer 1

Producer 2

Producer N

NiFi

Destination 1

Destination 2

Destination 3

• Consumer Processors• GetKafka (0.8 Kafka Client)• ConsumeKafka (0.9 Kafka Client)• ConsumeKafka_0_10 (0.10 Kafka Client)

+

Page 23: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Key Integration Points – Storm & Kafka

• storm-kafka module– KafkaSpout (Core & Trident) & KafkaBolt– Compatible with Kafka 0.8 and 0.9 client– Kafka client declared by topology developer

• storm-kafka-client module– KafkaSpout & KafkaSpoutTuplesBuilder– Compatible with Kafka 0.9 and 0.10 client– Kafka client declared by topology developer

Kafka StormIncoming Topic

Results Topic

KafkaSpout

KafkaBolt

+

Page 24: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Better Together

NiFiMiNiFi Kafka

StormIncoming Topic

Results Topic

PublishKafka

ConsumeKafka

Destinations

MiNiFi

• MiNiFi – Collection, filtering, and prioritization at the edge• NiFi - Central data flow management, routing, enriching, and transformation• Kafka - Central messaging bus for subscription by downstream consumers• Storm - Streaming analytics focused on complex event processing

+ +

Page 25: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

Best Practices

Page 26: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

NiFi PublishKafka

Apache NiFi - Node 1

Apache Kafka

Topic 1 - Partition 1

Topic 1 - Partition 2

PublishKafka

Apache NiFi – Node 2

PublishKafka

= Concurrent Task

• Each NiFi node runs an instance of PublishKafka

• Each instance has one or more concurrent tasks (threads)

• Each concurrent task is an independent producer, sends data round-robin to partitions of a topic

+

Page 27: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

NiFi ConsumeKafka – Nodes = Partitions

Apache NiFi - Node 1

Apache Kafka

Topic 1 - Partition 1

Topic 1 - Partition 2

ConsumeKafka(consumer group 1)

Apache NiFi – Node 2

ConsumeKafka(consumer group 1)= Concurrent Task

• Each NiFi node runs an instance of ConsumeKafka

• Each instance has one or more concurrent tasks (threads)

• Each concurrent task is a consumer assigned to a single partition

• Kafka Client ensures a given partition can only have one consumer/thread in a consumer group

+

Page 28: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

NiFi ConsumeKafka – Nodes > PartitionsApache NiFi - Node 1

Apache Kafka

Topic 1 - Partition 1

Topic 1 - Partition 2

ConsumeKafka(consumer group 1)

Apache NiFi – Node 2

ConsumeKafka(consumer group 1)

= Concurrent TaskApache NiFi – Node 3

ConsumeKafka(consumer group 1)

• Remember… each partition can only have one consumer from the same group

• When there are more NiFi nodes than partitions, some nodes won’t consume anything

+

Page 29: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

NiFi ConsumeKafka – Nodes < Partitions

Apache NiFi - Node 1Apache Kafka

Topic 1 - Partition 1

Topic 1 - Partition 2

ConsumeKafka(consumer group 1)

Apache NiFi – Node 2

ConsumeKafka(consumer group 1)

= Concurrent Task

Topic 1 - Partition 3

Topic 1 - Partition 4

• When there are less NiFi nodes/tasks than partitions, multiple partitions will be assigned to each node/task

Page 30: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

NiFi ConsumeKafka – Tasks = Partitions

Apache NiFi - Node 1Apache Kafka

Topic 1 - Partition 1

Topic 1 - Partition 2

ConsumeKafka(consumer group 1)

Apache NiFi – Node 2

ConsumeKafka(consumer group 1)

= Concurrent Task

Topic 1 - Partition 3

Topic 1 - Partition 4

• When there are less NiFi nodes than partitions, we can increase the concurrent tasks on each node

• Kafka Client will automatically rebalance partition assignment

• Improves throughput

Page 31: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

NiFi ConsumeKafka – Tasks > Partitions

Apache NiFi - Node 1

ConsumeKafka(consumer group 1)

Apache NiFi – Node 2

ConsumeKafka(consumer group 1)

= Concurrent Task

Apache Kafka

Topic 1 - Partition 1

Topic 1 - Partition 2

• Increasing concurrent tasks only makes sense when the number of partitions is greater than the number of nodes

• Otherwise we end up with some tasks not consuming anything

+

Page 32: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Kafka Processors & Batching Messages

• PublishKafka - ‘Message Demarcator’• If not specified, flow file content sent as a single message• If specified, flow file content separated into multiple messages based on demarcator• Ex: Sending 1 million messages to Kafka – significantly better performance with 1 flow file

containing 1 million demarcated messages vs. 1 million flow files with a single message

• ConsumeKafka - ‘Message Demarcator’• If not specified, a flow file is produced for each message consumed• If specified, multiple messages written to a single flow file separated by the demarcator• Maximum # of messages written to a single flow file equals ‘Max Poll Records’

Page 33: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Best Practice Summary

• PublishKafka• Each concurrent task is an independent producer• Scale number of concurrent tasks according to data flow

• ConsumeKafka• Kafka client assigns one thread per-partition with in a consumer group• Create optimal alignment between # of partitions and # of consumer tasks• Avoid having more tasks than partitions

• Batching• Message Demarcator property on PublishKafka and ConsumeKafka• Can achieve significantly better performance

Page 34: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

Demo!

Page 35: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Summary of the Demo Scenario

Truck Sensors

NiFiMiNiFi

Kafka StormSpeed Events

Average Speed

PublishKafka

ConsumeKafka

Dashboard

Windowed Avg. Speed

• MiNiFi – Collects data from truck sensors• NiFi – Filter/enrich truck data, deliver to Kafka, consume results• Kafka - Central messaging bus, Storm consumes from and publishes to• Storm – Computes average speed over a time window per driver & route

+ ++

Page 36: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Demo – Data Generator

Geo Event

2016-11-07 10:34:52.922|truck_geo_event|73|10|George Vetticaden|1390372503|Saint Louis to Tulsa|Normal|38.14|-91.3|1| Speed Event

2016-11-07 10:34:52.922|truck_speed_event|73|10|George Vetticaden|1390372503|Saint Louis to Tulsa|70|

Page 37: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Demo – MiNiFi

Processors:- name: TailFile class: org.apache.nifi.processors.standard.TailFile ... Properties: File Location: Local File to Tail: /tmp/truck-sensor-data/truck-1.txt ...Connections:- name: TailFile/success/2042214b-0158-1000-353d-654ef72c7307 source name: TailFile ...Remote Processing Groups:- name: http://localhost:9090/nifi url: http://localhost:9090/nifi ... Input Ports: - id: 2042214b-0158-1000-353d-654ef72c7307 name: Truck Events ...

Page 38: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Demo - NiFi

Page 39: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Demo - Storm

Page 40: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Demo - Dashboard

Page 41: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Questions?

Hortonworks Community Connection:Data Ingestion and Streaminghttps://community.hortonworks.com/

Page 42: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Kerberized interaction w/Kafka GetKafka PutKafkaKafka broker 0.8 (HDP 2.3.2) Supported SupportedKafka broker 0.9 (HDP 2.3.4 +) Supported SupportedKafka broker 0.8 (Apache) N/A N/AKafka broker 0.9 (Apache) Not Supported Not Supported

Non-Kerberized interaction w/Kafka GetKafka PutKafkaKafka broker 0.8 (HDP 2.3.2) Supported SupportedKafka broker 0.9 (HDP 2.3.4 +) Supported SupportedKafka broker 0.8 (Apache) Supported SupportedKafka broker 0.9 (Apache) Supported Supported

SSL Interaction w/ Kafka GetKafka PutKafkaKafka broker 0.8 (HDP 2.3.2) N/A N/AKafka broker 0.9 (HDP 2.3.4 +) Not Supported Not SupportedKafka broker 0.8 (Apache) N/A N/AKafka broker 0.9 (Apache) Not Supported Not Supported

HDF Kafka Processor Compatibility

Page 43: Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Kerberized interaction w/Kafka ConsumeKafka (2 sets) PublishKafka (2 sets)Kafka broker 0.8 (HDP 2.3.2) Not Supported Not SupportedKafka broker 0.9/0.10 (HDP 2.3.4 +) Supported SupportedKafka broker 0.8 (Apache) N/A N/AKafka broker 0.9/0.10 (Apache) Supported Supported

Non-Kerberized interaction w/Kafka ConsumeKafka (2 sets) PublishKafka (2 sets)Kafka broker 0.8 (HDP 2.3.2) Not Supported Not SupportedKafka broker 0.9/0.10 (HDP 2.3.4 +) Supported SupportedKafka broker 0.8 (Apache) Not Supported Not SupportedKafka broker 0.9/0.10 (Apache) Supported Supported

SSL Interaction w/ Kafka ConsumeKafka (2 sets) PublishKafka (2 sets)Kafka broker 0.8 (HDP 2.3.2) N/A N/A

Kafka broker 0.9/0.10 (HDP 2.3.4 +) Supported Supported

Kafka broker 0.8 (Apache) N/A N/A

Kafka broker 0.9/0.10 (Apache) Supported Supported

HDF Kafka Processor Compatibility


Recommended