Date post: | 22-Jan-2018 |
Category: |
Data & Analytics |
Upload: | confluent |
View: | 672 times |
Download: | 0 times |
11
Streaming Data and Stream Processing with Apache Kafka™
David Tucker, Director of Partner Engineering, Confluent
Sid Goel, Partner and Solution Architect, KPI Partners
33
The opportunity: The shift to streams & digital transformation
By 2020, 70% of organizations will adopt data streaming to enable real-time analytics.
- Gartner | Nov 2016
Streaming ingestion and analytics will become a must-have for digital winners.
- Forrester | Nov.
2015
44
More Facts & Figures
90% of CEO’s believe the digital economy will have a major impact on their industry.
- MIT Sloan / Capgemini (2013)
#1 most important capability executives hope to improve via digital transformation: Ability to support real-time transactions.
- The Economist (2015)
Digital disruptors will displace 40% of incumbent companies over the next 5 years.
- Center for Digital Transformation (2015)
55
Vision of a Streaming Enterprise
Search
NewSQL / NoSQL
RDBMS Monitoring
Document StoreReal-time Analytics Data Warehouse
Mobile Apps
Legacy Apps
Hadoop
Streaming Platform
66
What Can You Do with a Streaming Platform ?
• Publish and Subscribe to streams of data
• Analogous to traditional messaging systems
• Store streams of data
• Consumers can look back in time
• Process streams of data
• Analyze and correlate events in real time
77
The typical architecture
Search Security
Fraud Detection Application
User Tracking Operational Logs Operational Metrics
Data WarehouseApp
Databases
Storage
Interfaces
Monitoring App
Databases
Storage
Interfaces
88
Challenges abound
Search Security
Fraud Detection Application
User Tracking Operational Logs Operational Metrics
HadoopData
WarehouseApp
Databases
Storage
Interfaces
Monitoring
App
Databases
Storage
Interfaces
Diverse data sets, arriving at
an increasing rate
Many complex
data pipelines
Require a separate cluster
for real-time
Difficult & time consuming
to change
Require mission critical
availability into most
recent/relevant data
Difficult to handle
massive amounts
of data
99
Modernized architecture using Apache Kafka
Search Security
Fraud Detection Application
Streams API
App
Streams API
Monitoring
App Data
Warehouse
User Tracking Operational Logs Operational Metrics
1010
Search Security
Fraud Detection Application
Streams API
App
Streams API
Monitoring
App Data
Warehouse
User Tracking Operational Logs Operational Metrics
Modernized architecture using Apache Kafka
Pub/sub to data streams,
alleviate back pressure
Lightweight, easy to modify
with minimal disruption
Decoupled from upstream
apps creating agility
Real-time, context specific
data in the moment
Handle any
volume of data
with ease Scale to meet demands of
diverse streams
1111
Stream Data isThe Faster the Better
Stream Data can beBig or Fast (Lambda)
Stream Data will beBig AND Fast
(Kappa)
Our vision: from big data to stream data
Apache Kafka is the Enabling Technology of this Transition
Big Data wasThe More the Better
Valu
e o
f D
ata
Volume of Data
Valu
e o
f D
ata
Age of Data
Job 1 Job 2
Streams
Table 1 Table 2
DB
Speed Table Batch Table
DB
Streams Hadoop
1212
Kafka Adoption in Large Enterprises Growing Rapidly
Travel Global Banks Insurance Telecom
6 of top 10 7 of top 10 8 of top 10 9 of top 10
Over 35% of the Fortune 500 are using Apache
Kafka™
1313
Industries & Use Cases
Universal Use Cases: IoT, Data Pipelines, Microservices, Monitoring
Industry Use Cases
Financial Services Fraud Detection, Trade Data Capture, Customer 360
Retail Inventory Management, Product Catalog, A/B Testing, Proactive Alerts
Automotive Connected Car, Manufacturing Data Processing
Enterprise Tech Analytics, Security Operations, Collect Performance Data
Telecom Personalized Ad Placement, Customer 360, Network Integrity Systems
Entertainment/Media Log Delivery, Increase Ad Delivery Operations, Cross-Device Insights
Travel/ Leisure Visitor Segmentation, Fraud Detection
Consumer Tech Streaming Video, Personalized Customer Experience, Device Telemetry and Analytics
Healthcare Patient Monitoring, Pharma Substance control, Patient Relapse, Lab Results Alerts
1515
Kafka Adoption Across Key Companies
Financial Services Enterprise Tech Consumer Tech
Entertainment & Media Telecom Retail Travel & Leisure
1616
Confluent Enterprise
The only enterprise streaming platform
based entirely on Apache KafkaTM
1717
Confluent Platform: Enterprise Streaming based on Apache Kafka™
Database
ChangesLog Events loT Data
Web
Events…
CRM
Data Warehouse
Database
Hadoop
Data
Integration
…
Monitoring
Analytics
Custom Apps
Transformations
Real-time
Applications
…
Apache Open Source Confluent Open Source Confluent Enterprise
Confluent Platform
Apache Kafka™
Data Compatibility
Monitoring & Administration
Operations
Clients Connectors
Complete Open Trusted Enterprise Grade
1818
Feature Benefit Apache Kafka Confluent Open Source Confluent Enterprise
Apache KafkaHigh throughput, low latency, high availability, secure distributed streaming
platform
Kafka Connect API Advanced API for connecting external sources/destinations into Kafka
Kafka Streams APISimple library that enables streaming application development within the Kafka
framework
Additional Clients Supports non-Java clients; C, C++, Python, etc.
REST ProxyProvides universal access to Kafka from any network connected device via
HTTP
Schema RegistryCentral registry for the format of Kafka data – guarantees all data is always
consumable
Pre-Built ConnectorsHDFS, JDBC, elasticsearch and other connectors fully certified
and fully supported by Confluent
Confluent Control Center Enables easy connector management and stream monitoring
Auto Data Balancing Rebalancing data across cluster to remove bottlenecks
Replication Multi-datacenter replication simplifies and automates MDC Kafka clusters
SupportEnterprise class support to keep your Kafka environment running at top
performanceCommunity Community 24x7x365
Confluent Completes Kafka
1919
How do I get streams of data
into and out of my apps?
Connect Clients REST
2020
Apache KafkaTM Connect – Streaming Data Capture
JDBC
IRC / Twitter
CDC
Elastic
NoSQL
HDFS
Kafka Connect API
Kafka Pipeline
Connector
Connector
Connector
Connector
Connector
Connector
Sources Sinks
Fault tolerant
Manage hundreds of data sources and sinks
Preserves data schema
Part of Apache Kafka project
Integrated within Confluent Platform’s Control Center
2121
Kafka Connect API, Part of the Apache KafkaTM Project
Connect any source to any target system with Apache Kafka
Integrated
• 100% compatible with Kafka v0.9 and higher
• Integrated with Confluent’s Schema Registry
• Easy to manage with Confluent Control Center
Flexible
• 40+ open source connectors available
• Easy to develop additional connectors
• Flexible support for data types and formats
Compatible
• Maintains critical metadata
• Preserves schema information
• Supports schema evolution
Reliable
• Automated failover
• At-least-once guaranteed
• Balances workload between nodes
2222
Kafka Connect API Library of Connectors
* Denotes Connectors developed at Confluent and distributed by Confluent. Extensive validation and testing have been performed.
Databases
*
Analytics
*
Applications / Other
Datastore/File Store
*
*
2323
New in Kafka 0.10.2: Single Message Transforms for Kafka Connect
Modify events before storing in Kafka:
• Mask sensitive information
• Add identifiers
• Tag events
• Store lineage
• Remove unnecessary columns
Modify events going out of Kafka:
• Route high priority events to faster data stores
• Direct events to different ElasticSearch indexes
• Cast data types to match destination
• Remove unnecessary columns
2424
Kafka Clients
Ruby Proxy http/REST
Stdin/stdout
Apache Kafka Native Clients
Confluent Native Clients
Community Supported Clients
2525
REST Proxy: Talking to Non-native Kafka Apps and Outside the Firewall
REST Proxy
Non-Java Applications
Native Kafka Java
Applications
Schema Registry
REST / HTTP
Simplifiesadministrative actions
Simplifies message creation and consumption
Provides a RESTful interface to a Kafka cluster
2626
How do I maintain my data
formats and ensure compatibility?
2727
The Challenge of Data Compatibility at Scale
App 1
App 2
App 3
Many sources without a policy causes mayhem in a centralized data pipeline
Ensuring downstream systems can use the data is key to an operational stream pipeline
Example: Date formats
Even within a single application, different formats can be presented
Incompatibly formatted message
2828
Schema Registry
Elastic
Cassandra
HDFS
Example Consumers
SerializerApp 1
SerializerApp 2
!
Kafka Topic!
Schema
Registry
Define the expected fields for each Kafka topic
Automatically handle schema changes (e.g. new fields)
Prevent backwards incompatible changes
Supports multi-datacenter environments
2929
How do I build stream
processing apps?
3030
Kafka Streams API: the Easiest Way to Process Data in Apache Kafka™
Example Use Cases
• Microservices
• Large-scale continuous queries and transformations
• Event-triggered processes
• Reactive applications
• Customer 360-degree view, fraud detection, location-based marketing, smart electrical grids, fleet management, …
Key Benefits of Apache Kafka’s Streams API
• Build Apps, Not Clusters: no additional cluster required
• Elastic, highly-performant, distributed, fault-tolerant, secure
• Equally viable for small, medium, and large-scale use cases
• “Run Everywhere”: integrates with your existing deployment strategies such as containers, automation, cloud
Your App
Kafka
Streams
API
3131
Architecture Example
Before: Complexity for development and operations, heavy footprint
1 2 3
Capture businessevents in Kafka
Must process events with separate,
special-purpose clusters
Write resultsback to Kafka
Your Processing Job
3232
Architecture Example
With Kafka Streams: App-centric architecture that blends well into your existing infrastructure
1 2
3a
Capture businessevents in Kafka
Process events fast, reliably, securely with
standard Java applicationsWrite resultsback to Kafka
3b
Query latest results directly from
external apps
AppApp
Your App
Kafka
Streams API
3333
New in Kafka 0.10.2 : Session windows in Kafka Streams API
Group events in a stream based on session windows
• Sessions are periods of activity terminated by agap of inactivity
• Purely time-based windows are incorrect for session-based data analysis
Input data
Colors representdifferent users event
Results
User sessions,grouped by event-time session windows
processing-time
event-time
session windowing
Alice
Bob
Dave
3535
How do I synchronize and migrate data
to and from the cloud?
3636
Before: Hybrid Cloud Environments Today
DC1
DB2
DB1
DWH
App2
App3
App4
KV2KV3
DB3
App2-v2
App5
App7
App1-v2
AWS
App8
DWH
App1
Challenges
• Each team/department
must execute their own cloud
migration
• May be moving the same data
multiple times
• Each box represented here
require development, testing,
deployment, monitoring and
maintenance
KV
3737
DC1
After: Cloud Synchronization and Migrations with Confluent Platform
DB2
DB1
KV
DWH
App2
App4
KV2KV3
App2-v2
App5 App7
App1-v2
AWS
App8
DWH
App1K
afk
a
Ka
fka
App3
Benefits
• Continuous low-latency
synchronization
• Centralized manageability and
monitoring
– Track at event level data
produced in all data centers
• Security and governance– Track and control where data
comes from and who is
accessing it
• Cost Savings– Move Data Once
DB3
3838
How do I manage and monitor
my streaming platform at scale?
3939
What Does End-to-End Mean?
“Clocks and Cables” Monitoring
How fast is the throughput?
How many CPU cycles are we using?
End-to-End Monitoring
Did you
leave?
Did you
arrive?
4040
Confluent Control Center: Cluster Health & Administration
Cluster health dashboard
• Monitor the health of your Kafka clustersand get alerts if any problems occur
• Measure system load, performance,and operations
• View aggregate statistics or drill downby broker or topic
Cluster administration
• Monitor topic configurations
4141
Confluent Control Center: End-to-end Monitoring
See exactly where your messages are going in your Kafka cluster
4242
Confluent Control Center: Connector Management
4343
Confluent Control Center: Alerting
Alerts
• Configure alerts on incomplete data
delivery, high latency, Kafka connector
status, and more
• Manage alerts for different users and
applications from a web UI
• Manage alerts for different users and
applications from a web UI
User authentication
• Control access to Confluent Control
Center
• Integrates with existing enterprise
authentication systems
4444
Auto Data Balancing
Dynamically move partitions to optimize resource utilization and reliability
• Easily add and remove nodes from your Kafka cluster
• Rack aware algorithm rebalances partitions across a cluster
• Traffic from balancer is throttled when datatransfer occurs
Befo
re
After
Rebalanc
e
4545
Multi-Datacenter Replication
An easy reliable way to run Kafka across datacenters
Improve reliability
• Easily configure & maintain crosscluster replication
Simplify management
• Centralized configuration and monitoring
• Replicate entire cluster or a subset of topics
• Automatic replication of topic configuration
• Use Kafka’s SASL for Kerberos,Active Directory
• SSL encryption between datacenters
4646
Get Started with Apache Kafka Today!
https://www.confluent.io/downloads/
THE place to start with Apache Kafka!
Thoroughly tested and
quality assured
More extensible developer
experience
Easy upgrade path to
Confluent Enterprise
4747
Thank You