Streaming Data and Stream Processing with Apache Kafka

Post on 22-Jan-2018

673 views 0 download

transcript

11

Streaming Data and Stream Processing with Apache Kafka™

David Tucker, Director of Partner Engineering, Confluent

Sid Goel, Partner and Solution Architect, KPI Partners

33

The opportunity: The shift to streams & digital transformation

By 2020, 70% of organizations will adopt data streaming to enable real-time analytics.

- Gartner | Nov 2016

Streaming ingestion and analytics will become a must-have for digital winners.

- Forrester | Nov.

2015

44

More Facts & Figures

90% of CEO’s believe the digital economy will have a major impact on their industry.

- MIT Sloan / Capgemini (2013)

#1 most important capability executives hope to improve via digital transformation: Ability to support real-time transactions.

- The Economist (2015)

Digital disruptors will displace 40% of incumbent companies over the next 5 years.

- Center for Digital Transformation (2015)

55

Vision of a Streaming Enterprise

Search

NewSQL / NoSQL

RDBMS Monitoring

Document StoreReal-time Analytics Data Warehouse

Mobile Apps

Legacy Apps

Hadoop

Streaming Platform

66

What Can You Do with a Streaming Platform ?

• Publish and Subscribe to streams of data

• Analogous to traditional messaging systems

• Store streams of data

• Consumers can look back in time

• Process streams of data

• Analyze and correlate events in real time

77

The typical architecture

Search Security

Fraud Detection Application

User Tracking Operational Logs Operational Metrics

Data WarehouseApp

Databases

Storage

Interfaces

Monitoring App

Databases

Storage

Interfaces

88

Challenges abound

Search Security

Fraud Detection Application

User Tracking Operational Logs Operational Metrics

HadoopData

WarehouseApp

Databases

Storage

Interfaces

Monitoring

App

Databases

Storage

Interfaces

Diverse data sets, arriving at

an increasing rate

Many complex

data pipelines

Require a separate cluster

for real-time

Difficult & time consuming

to change

Require mission critical

availability into most

recent/relevant data

Difficult to handle

massive amounts

of data

99

Modernized architecture using Apache Kafka

Search Security

Fraud Detection Application

Streams API

App

Streams API

Monitoring

App Data

Warehouse

User Tracking Operational Logs Operational Metrics

1010

Search Security

Fraud Detection Application

Streams API

App

Streams API

Monitoring

App Data

Warehouse

User Tracking Operational Logs Operational Metrics

Modernized architecture using Apache Kafka

Pub/sub to data streams,

alleviate back pressure

Lightweight, easy to modify

with minimal disruption

Decoupled from upstream

apps creating agility

Real-time, context specific

data in the moment

Handle any

volume of data

with ease Scale to meet demands of

diverse streams

1111

Stream Data isThe Faster the Better

Stream Data can beBig or Fast (Lambda)

Stream Data will beBig AND Fast

(Kappa)

Our vision: from big data to stream data

Apache Kafka is the Enabling Technology of this Transition

Big Data wasThe More the Better

Valu

e o

f D

ata

Volume of Data

Valu

e o

f D

ata

Age of Data

Job 1 Job 2

Streams

Table 1 Table 2

DB

Speed Table Batch Table

DB

Streams Hadoop

1212

Kafka Adoption in Large Enterprises Growing Rapidly

Travel Global Banks Insurance Telecom

6 of top 10 7 of top 10 8 of top 10 9 of top 10

Over 35% of the Fortune 500 are using Apache

Kafka™

1313

Industries & Use Cases

Universal Use Cases: IoT, Data Pipelines, Microservices, Monitoring

Industry Use Cases

Financial Services Fraud Detection, Trade Data Capture, Customer 360

Retail Inventory Management, Product Catalog, A/B Testing, Proactive Alerts

Automotive Connected Car, Manufacturing Data Processing

Enterprise Tech Analytics, Security Operations, Collect Performance Data

Telecom Personalized Ad Placement, Customer 360, Network Integrity Systems

Entertainment/Media Log Delivery, Increase Ad Delivery Operations, Cross-Device Insights

Travel/ Leisure Visitor Segmentation, Fraud Detection

Consumer Tech Streaming Video, Personalized Customer Experience, Device Telemetry and Analytics

Healthcare Patient Monitoring, Pharma Substance control, Patient Relapse, Lab Results Alerts

1515

Kafka Adoption Across Key Companies

Financial Services Enterprise Tech Consumer Tech

Entertainment & Media Telecom Retail Travel & Leisure

1616

Confluent Enterprise

The only enterprise streaming platform

based entirely on Apache KafkaTM

1717

Confluent Platform: Enterprise Streaming based on Apache Kafka™

Database

ChangesLog Events loT Data

Web

Events…

CRM

Data Warehouse

Database

Hadoop

Data

Integration

Monitoring

Analytics

Custom Apps

Transformations

Real-time

Applications

Apache Open Source Confluent Open Source Confluent Enterprise

Confluent Platform

Apache Kafka™

Data Compatibility

Monitoring & Administration

Operations

Clients Connectors

Complete Open Trusted Enterprise Grade

1818

Feature Benefit Apache Kafka Confluent Open Source Confluent Enterprise

Apache KafkaHigh throughput, low latency, high availability, secure distributed streaming

platform

Kafka Connect API Advanced API for connecting external sources/destinations into Kafka

Kafka Streams APISimple library that enables streaming application development within the Kafka

framework

Additional Clients Supports non-Java clients; C, C++, Python, etc.

REST ProxyProvides universal access to Kafka from any network connected device via

HTTP

Schema RegistryCentral registry for the format of Kafka data – guarantees all data is always

consumable

Pre-Built ConnectorsHDFS, JDBC, elasticsearch and other connectors fully certified

and fully supported by Confluent

Confluent Control Center Enables easy connector management and stream monitoring

Auto Data Balancing Rebalancing data across cluster to remove bottlenecks

Replication Multi-datacenter replication simplifies and automates MDC Kafka clusters

SupportEnterprise class support to keep your Kafka environment running at top

performanceCommunity Community 24x7x365

Confluent Completes Kafka

1919

How do I get streams of data

into and out of my apps?

Connect Clients REST

2020

Apache KafkaTM Connect – Streaming Data Capture

JDBC

IRC / Twitter

CDC

Elastic

NoSQL

HDFS

Kafka Connect API

Kafka Pipeline

Connector

Connector

Connector

Connector

Connector

Connector

Sources Sinks

Fault tolerant

Manage hundreds of data sources and sinks

Preserves data schema

Part of Apache Kafka project

Integrated within Confluent Platform’s Control Center

2121

Kafka Connect API, Part of the Apache KafkaTM Project

Connect any source to any target system with Apache Kafka

Integrated

• 100% compatible with Kafka v0.9 and higher

• Integrated with Confluent’s Schema Registry

• Easy to manage with Confluent Control Center

Flexible

• 40+ open source connectors available

• Easy to develop additional connectors

• Flexible support for data types and formats

Compatible

• Maintains critical metadata

• Preserves schema information

• Supports schema evolution

Reliable

• Automated failover

• At-least-once guaranteed

• Balances workload between nodes

2222

Kafka Connect API Library of Connectors

* Denotes Connectors developed at Confluent and distributed by Confluent. Extensive validation and testing have been performed.

Databases

*

Analytics

*

Applications / Other

Datastore/File Store

*

*

2323

New in Kafka 0.10.2: Single Message Transforms for Kafka Connect

Modify events before storing in Kafka:

• Mask sensitive information

• Add identifiers

• Tag events

• Store lineage

• Remove unnecessary columns

Modify events going out of Kafka:

• Route high priority events to faster data stores

• Direct events to different ElasticSearch indexes

• Cast data types to match destination

• Remove unnecessary columns

2424

Kafka Clients

Ruby Proxy http/REST

Stdin/stdout

Apache Kafka Native Clients

Confluent Native Clients

Community Supported Clients

2525

REST Proxy: Talking to Non-native Kafka Apps and Outside the Firewall

REST Proxy

Non-Java Applications

Native Kafka Java

Applications

Schema Registry

REST / HTTP

Simplifiesadministrative actions

Simplifies message creation and consumption

Provides a RESTful interface to a Kafka cluster

2626

How do I maintain my data

formats and ensure compatibility?

2727

The Challenge of Data Compatibility at Scale

App 1

App 2

App 3

Many sources without a policy causes mayhem in a centralized data pipeline

Ensuring downstream systems can use the data is key to an operational stream pipeline

Example: Date formats

Even within a single application, different formats can be presented

Incompatibly formatted message

2828

Schema Registry

Elastic

Cassandra

HDFS

Example Consumers

SerializerApp 1

SerializerApp 2

!

Kafka Topic!

Schema

Registry

Define the expected fields for each Kafka topic

Automatically handle schema changes (e.g. new fields)

Prevent backwards incompatible changes

Supports multi-datacenter environments

2929

How do I build stream

processing apps?

3030

Kafka Streams API: the Easiest Way to Process Data in Apache Kafka™

Example Use Cases

• Microservices

• Large-scale continuous queries and transformations

• Event-triggered processes

• Reactive applications

• Customer 360-degree view, fraud detection, location-based marketing, smart electrical grids, fleet management, …

Key Benefits of Apache Kafka’s Streams API

• Build Apps, Not Clusters: no additional cluster required

• Elastic, highly-performant, distributed, fault-tolerant, secure

• Equally viable for small, medium, and large-scale use cases

• “Run Everywhere”: integrates with your existing deployment strategies such as containers, automation, cloud

Your App

Kafka

Streams

API

3131

Architecture Example

Before: Complexity for development and operations, heavy footprint

1 2 3

Capture businessevents in Kafka

Must process events with separate,

special-purpose clusters

Write resultsback to Kafka

Your Processing Job

3232

Architecture Example

With Kafka Streams: App-centric architecture that blends well into your existing infrastructure

1 2

3a

Capture businessevents in Kafka

Process events fast, reliably, securely with

standard Java applicationsWrite resultsback to Kafka

3b

Query latest results directly from

external apps

AppApp

Your App

Kafka

Streams API

3333

New in Kafka 0.10.2 : Session windows in Kafka Streams API

Group events in a stream based on session windows

• Sessions are periods of activity terminated by agap of inactivity

• Purely time-based windows are incorrect for session-based data analysis

Input data

Colors representdifferent users event

Results

User sessions,grouped by event-time session windows

processing-time

event-time

session windowing

Alice

Bob

Dave

3535

How do I synchronize and migrate data

to and from the cloud?

3636

Before: Hybrid Cloud Environments Today

DC1

DB2

DB1

DWH

App2

App3

App4

KV2KV3

DB3

App2-v2

App5

App7

App1-v2

AWS

App8

DWH

App1

Challenges

• Each team/department

must execute their own cloud

migration

• May be moving the same data

multiple times

• Each box represented here

require development, testing,

deployment, monitoring and

maintenance

KV

3737

DC1

After: Cloud Synchronization and Migrations with Confluent Platform

DB2

DB1

KV

DWH

App2

App4

KV2KV3

App2-v2

App5 App7

App1-v2

AWS

App8

DWH

App1K

afk

a

Ka

fka

App3

Benefits

• Continuous low-latency

synchronization

• Centralized manageability and

monitoring

– Track at event level data

produced in all data centers

• Security and governance– Track and control where data

comes from and who is

accessing it

• Cost Savings– Move Data Once

DB3

3838

How do I manage and monitor

my streaming platform at scale?

3939

What Does End-to-End Mean?

“Clocks and Cables” Monitoring

How fast is the throughput?

How many CPU cycles are we using?

End-to-End Monitoring

Did you

leave?

Did you

arrive?

4040

Confluent Control Center: Cluster Health & Administration

Cluster health dashboard

• Monitor the health of your Kafka clustersand get alerts if any problems occur

• Measure system load, performance,and operations

• View aggregate statistics or drill downby broker or topic

Cluster administration

• Monitor topic configurations

4141

Confluent Control Center: End-to-end Monitoring

See exactly where your messages are going in your Kafka cluster

4242

Confluent Control Center: Connector Management

4343

Confluent Control Center: Alerting

Alerts

• Configure alerts on incomplete data

delivery, high latency, Kafka connector

status, and more

• Manage alerts for different users and

applications from a web UI

• Manage alerts for different users and

applications from a web UI

User authentication

• Control access to Confluent Control

Center

• Integrates with existing enterprise

authentication systems

4444

Auto Data Balancing

Dynamically move partitions to optimize resource utilization and reliability

• Easily add and remove nodes from your Kafka cluster

• Rack aware algorithm rebalances partitions across a cluster

• Traffic from balancer is throttled when datatransfer occurs

Befo

re

After

Rebalanc

e

4545

Multi-Datacenter Replication

An easy reliable way to run Kafka across datacenters

Improve reliability

• Easily configure & maintain crosscluster replication

Simplify management

• Centralized configuration and monitoring

• Replicate entire cluster or a subset of topics

• Automatic replication of topic configuration

• Use Kafka’s SASL for Kerberos,Active Directory

• SSL encryption between datacenters

4646

Get Started with Apache Kafka Today!

https://www.confluent.io/downloads/

THE place to start with Apache Kafka!

Thoroughly tested and

quality assured

More extensible developer

experience

Easy upgrade path to

Confluent Enterprise

4747

Thank You