Couchase and Hadoop
Perry Krug
Sr. Solutions Architect
Agenda• View basics
• Lifecycle of a view
• Index definition, build, and query phase
• Indexing details
• Replica indexes, failover and compaction
• Primary and Secondary indexes
• View best practices
• Couchbase and Elastic Search
• Couchbase and Hadoop
pol·y·glot / päli glät/ˈ ˌAdjective: Knowing or using several languages.Noun: A person who knows several languages.Synonyms: multilingual
per·sist·ence /p r sist ns/ə ˈ əNoun: The continued or prolonged existence
of something.Synonyms: perseverance - tenacity - pertinacity –
stubbornness
Couchbase Views – The basics• Define materialized views on JSON documents and then query
across the data set
• Using views you can define• Primary indexes
• Simple secondary indexes (most common use case)
• Complex secondary, tertiary and composite indexes
• Aggregations (reduction)
• Indexes are eventually indexed
• Queries are eventually consistent with respect to documents
• Built using Map/Reduce technology • Map and Reduce functions are written in Javascript
View LifecycleDefine -> Build -> Query
5
Buckets & Design docs & Views•C
reate design documents on a bucket
•Create views within a design documentBUCKET 1
Design document 1
View 1View 1
View 2View 2
View 3View 3
Design document 2
View 4View 4
View 5View 5
Design document 3
View 6View 6
View 7View 7
BUCKET 2
Couchbase Server Cluster
Distributed Indexing and Querying
User Configured Replica Count = 1
Active
Doc 5
Doc 2
Doc
Doc
Doc
Server 1
REPLICA
Doc 3
Doc 1
Doc 7
Doc
Doc
Doc
App Server 1
COUCHBASE Client LibraryCOUCHBASE Client Library
Cluster Map
COUCHBASE Client LibraryCOUCHBASE Client Library
Cluster Map
App Server 2
Doc 9
• Indexing work is distributed amongst nodes
• Parallelize the effort
• Each node has index for data stored on it
• Queries combine the results from required nodes
Active
Doc 3
Doc 1
Doc
Doc
Doc
Server 2
REPLICA
Doc 6
Doc 4
Doc 9
Doc
Doc
Doc
Doc 8
Active
Doc 4
Doc 6
Doc
Doc
Doc
Server 3
REPLICA
Doc 2
Doc 5
Doc 8
Doc
Doc
Doc
Doc 7
Query
Create Index / View
3333 22
Eventually indexed Views – Data flow2
Managed Cache
Dis
k Q
ueue
Disk
Replication Queue
App Server
Couchbase Server Node
Doc 1Doc 1
Doc 1
To other node
View engine
Doc 1
DEFINE Index / View Definition in JavaScript
CREATE INDEX City ON Brewery.City;
BUILD Distributed Index Build Phase
• Optimized for lookups, in-order access and aggregations
• View reads are from disk (different performance profile than GET/SET)
• Views built against every document on every node
Group them in a design document
• Views are automatically kept up to date
QUERY Dynamic Queries with Optional Aggregation
• Eventually consistent with respect to document updates• Efficiently fetch a document or group of similar documents • Queries will use cached values from B-tree inner nodes when possible• Take advantage of in-order tree traversal with group_level queries
Query ?startkey=“J”&endkey=“K”{“rows”:[{“key”:“Juneau”,“value”:null}]}
Simple Primary and Secondary Indexing
Example Document Document
ID
Define a primary index on the bucket• Lookup the document ID / key by key, range, prefix, suffix
Index definition
Define a secondary index on the bucket
• Lookup an attribute by value, range, prefix, suffix
Index definition
Find documents by a specific attribute
• Lets find beers by brewery_id!
The index definition
ValueKey
The result set: beers keyed by brewery_id
Query PatternBasic Aggregations
Use a built-in reduce function with a group query
• Lets find average abv for each brewery!
Group reduce (reduce by unique key)
Query PatternTime-based Rollups
Find patterns in beer comments by time
{ "type": "comment", "about_id": "beer_Enlightened_Black_Ale", "user_id": 525, "text": "tastes like college!", "updated": "2010-07-22 20:00:20"}{ "id": "f1e62"}
timestamp
Query with group_level=2 to get monthly rollups
group_level=3 - daily results - great for graphing
Query PatternLeaderboard
Aggregate value stored in a document• Lets find the top-rated beers!
{ "brewery": "New Belgium Brewing", "name": "1554 Enlightened Black Ale", "abv": 5.5, "description": "Born of a flood...", "category": "Belgian and French Ale", "style": "Other Belgian-Style Ales", "updated": "2010-07-22 20:00:20", “ratings” : { “jchris” : 5, “scalabl3” : 4, “damienkatz” : 1 }, “comments” : [ “f1e62”, “6ad8c” ]}
ratings
Sort each beer by its average rating• Lets find the top-rated beers!
average
Couchbase and Elastic Search
Full Text Search
{ "name": "Abbey Belgian Style Ale", "description": "Winner of four World Beer Cup medals and eight medals at the Great American Beer Fest, Abbey Belgian Ale is the Mark Spitz of New Belgium’s lineup – but it didn’t start out that way."}
Search Across Full JSON Body
Search term: abbey
{ "name": "Abbey Belgian Style Ale", "description": "Winner of four World Beer Cup medals and eight medals at the Great American Beer Fest, Abbey Belgian Ale is the Mark Spitz of New Belgium’s lineup – but it didn’t start out that way."}
Search Across Full JSON Body
Search term: abbey
Faceted Search
Categories
Items with Counts
Range Facets
Learning Portal – Proof of Concept
Couchbase and Hadoop
Cloudera, etc.
Operational vs. Analytic Databases
Couchbase
AnalyticAnalyticDatabasesDatabases
Get insights from Get insights from datadata
Real-time, Real-time, Interactive DatabasesInteractive Databases
Fast access Fast access to datato data
NoSQL
What is Sqoop?
Sqoop is a tool designed to transfer data between Hadoop and [OLTP] databases. You can use Sqoop to import data from [an OLTP] database management system (RDBMS) such as MySQL or Oracle [or Couchbase] into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back.
sqoop.apache.org
Traditional ETL
Application DataData
T
What is Sqoop?
A different paradigm
Data
ApplicationData
What is Sqoop?
A very scalable different paradigm
Data
Application
Data
Application
Data
Application
Data
Where did the Transform go?
Application
Data
TTT TTT TTT TTT
What is Sqoop?
Couchbase Import and Export
$ sqoop import –-connect http://localhost:8091/pools --table DUMP
$ sqoop import –-connect http://localhost:8091/pools --table BACKFILL_5
$ sqoop export --connect http://localhost:8091/pools
--table DUMP –export-dir DUMP
•For Imports, table must be:– DUMP: All keys currently in Couchbase– BACKFILL_n: All key mutations for n minutes
•Specified –username maps to bucket– By default set to “default” bucket
Hadoop and Couchbase – Ad Targeting
click streamevents
profiles, campaigns
profiles, real time campaign statistics
40 milliseconds to respond with the decision.
2
3
1
Moving Parts
Content & Recommendation Targeting
Moving Parts
Thank you
Couchbase NoSQL Document Database