©2016CouchbaseInc.
Big Data with NoSQL, Hadoop, Spark and Kafka
Will Gardella, Director of Product Management
2
©2016CouchbaseInc. 3
Will GardellaDirector of Product [email protected]
@WillGardella
IMAGE GOES HERE
©2016CouchbaseInc.©2016CouchbaseInc.
Agenda
• The Big Data Big Picture
• Spark & Hadoop
• Kafka
• Couchcbase Analytics (Sneak Peek)
5
©2016CouchbaseInc.©2016CouchbaseInc.
Where does “big” data come from?
6
MobileWeb/Cloud Internet of Things
©2016CouchbaseInc.©2015CouchbaseInc. 7
COUCHBASECONFIDENTIAL
Couchbase is addressing the requirements of Digital Economy businesses
©2016CouchbaseInc.©2016CouchbaseInc.
NoSQL versus Hadoop
NoSQL Hadoop NoSQL Hadoop
Overlap Compliment
NoSQL or Hadoop? NoSQL and Hadoop.
©2016CouchbaseInc.©2016CouchbaseInc.
Big Data at a Glance
Couchbase Spark Hadoop
Use cases • Operational• Web / Mobile
• Analytics• Machine Learning
• Analytics• Machine Learning
Processing mode• Online • Ad Hoc
• Ad Hoc • Batch• Streaming (+/-)
• Batch• Ad Hoc (+/-)
Low latency = < 1 ms ops Seconds Seconds
Performance Highly predictable Variable Variable
Users are typically…Millions of customers 100’s of analysts or
data scientists100’s of analysts or data scientists
Memory-centric Memory-centric Disk-centric
Big data = 10s of Terabytes Petabytes Petabytes
ANALYTICALOPERATIONAL
©2016CouchbaseInc.©2016CouchbaseInc.
Couchbase + Spark use cases
11
Operations Analysis
§ Recommendations§ Next gen data warehousing§ Predictive analytics§ Fraud detection
§ Catalog § Customer 360 + IOT§ Personalization§ Mobile applications
©2016CouchbaseInc.©2016CouchbaseInc.
Use Case 1: Operationalize Analytics / ML
Examples: recommend content and products, spot fraud or spam• Data scientists train machine learning models
• Load results into Couchbase so end users can interact with them online
Hadoop
Machine Learning Models
Data Warehouse
Historical Data
©2016CouchbaseInc.©2016CouchbaseInc.
Use Case 2: Spark connects to everything
13
DCPKVN1QLViews
Adapted from: Databricks – Not Your Father’s Database https://www.brighttalk.com/webcast/12891/196891
©2016CouchbaseInc.©2016CouchbaseInc.
Lambda Architecture
1
4
5
DATA
SERVE
QUERY
New Data Stream Analysis
All Data Precompute Views (Map Reduce)
Process Stream
Incremental Views
BatchRecompute
Real-TimeIncrement
Batch Layer
Serving Layer
Speed Layer
2 BATCH
3 SPEED
©2016CouchbaseInc.©2016CouchbaseInc.
Lambda Architecture
1
4
5
DATA
SERVE
QUERY
New Data Stream Analysis
All Data Precompute Views (Map Reduce)
Process Stream
Incremental Views
BatchRecompute
Real-TimeIncrement
Batch Layer
Serving Layer
Speed Layer
2 BATCH
3 SPEED
©2016CouchbaseInc.©2016CouchbaseInc.
Database Change Protocol (DCP)
• Innovative protocol for data sync in Couchbase Server• Efficient data sync, memory to memory• Removes slower disk-IO from the data sync path• Improves latencies to replication for data durability
• Powers data replication & XDCR for HA / DR, maintains indexes, and more
• Big data connectors use this as a fast sync mechanism
16
©2016CouchbaseInc.©2016CouchbaseInc.
Couchbase Spark Connector 2.0
• Spark 2 support• Structured streaming• New Databricks cloud analytics support
• Efficiency• Improved DCP handling memory allocation creates less garbage
• Easier management• Tolerates Couchbase cluster topology changes (e.g. add nodes & rebalance)• …except rollbacks
17
©2016CouchbaseInc.©2016CouchbaseInc.
Couchbase Spark 2.0 Connector
Features• Automatic cluster & resource management• Create RDDs from KV, N1QL, Views• Create DStreams from DCP feeds• Persist RDDs and DStreams• Support SparkSQL, Datasets, DataFrames, and Structured Streaming
©2016CouchbaseInc.©2015CouchbaseInc. 20
You might need Kafka if…
Photo Credit: Cory Doctorow https://www.flickr.com/photos/doctorow/14638938602
©2016CouchbaseInc.©2016CouchbaseInc.
Kafka as an industrial data sharing “backbone”
• Before Kafka After Kafka
©2016CouchbaseInc.©2016CouchbaseInc.
Couchbase & Kafka Use Cases
• Couchbase as the Master Database• Changes in the bucket update data elsewhere
• Triggers / Event Handling• Handle events like deletions / expirations
externally • E.g. expiration & replicated session tokens
• Real-time Data Integration• Extract from Couchbase, transform and load data
into another system
• Real-time Data Processing• Extract from a bucket, process in real-time and
load back to another Couchbase bucket
©2016CouchbaseInc.©2016CouchbaseInc.
Couchbase Kafka Connector 3.0 (DP now – GA Q4 2016)
Available Now: 2.0 GA
• Kafka Producer or Consumer
• Stream events
• Filters
• Transform events
• Sample Producer & Consumer
• Improved DCP – less garbage collection, more memory efficient
23
Code:https://github.com/couchbase/couchbase-kafka-connector/
3.0 (DP now - GA Q4 2016)§ Adopts Kafka Connect (Apache Kafka 0.9+)§ Dynamic topology support / rebalance
Future§ Rollback handling
©2016CouchbaseInc.©2015CouchbaseInc. 24
NewinApacheKafka0.9
• One service to manage
• Unified connector config, control, monitoring, metrics
• Easy to set up as a self-service system for developers, ETL team
• Confluent dashboards visualize the complete data pipeline
©2016CouchbaseInc.©2016CouchbaseInc.
Lamba + Hadoop + Spark + Storm + Kafka
New Data Stream Merged View
All Data Precompute Views (Map Reduce)
Process Stream
Incremental Views
BatchRecompute
Real-TimeIncrement
Merge
Batch Layer
Serving Layer
Speed Layer
©2016CouchbaseInc.©2016CouchbaseInc.
Sneak Peek: Couchbase Analytics (DP1)
27
One stop shopping for both operations and analytics
Couchbase QueryOptimized for operational (narrow) queries
Many queries Each touches a little data
Couchbase Analytics
Fewer queries Each touches a lot of data
Optimized for analytical (big) queries
©2016CouchbaseInc. 28
What is Couchbase Analytics?
• Extend Couchbase Platform to power real-time analytics
• Ad-hoc queries (“Ask me anything!”)
• Workload isolation
• Independent scaling
• Common programming model & data model
• Unified management
• Fast data synchronization
Data Query Index Search AnalyticsTransport
Unified Administration
Unified Declarative Programming Interface
©2016CouchbaseInc.©2016CouchbaseInc.
Operations Analytics
Couchbase Analytics and friends
BatchOnline“Hurry! The user is waiting!” “Better cache this in Couchbase…”
Key Value CB Query CB Analytics Spark Hadoop
𝜇s ms 30s Minutes+
1 record Trillions of records
Start up overhead
Job-based
Parallel query