USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATIONTools to build your big data application
Ameya Kanitkar
Ameya Kanitkar – That’s me!
• Big Data Infrastructure Engineer @ Groupon, Palo Alto
USA (Working on Deal Relevance & Personalization
Systems)
[email protected]://www.linkedin.com/in/ameyakanitkar@aktwits
Agenda
Basics of Hadoop & HBase
How you can use Hadoop & HBase for big data
application
Case Study: Deal Relevance and Personalization
Systems at Groupon with Hadoop & HBase
Big Data Application Examples
Recommendation Systems
Ad targeting
Personalization Systems
BI/ DW
Log Analysis
Natural Language Processing
So what is Hadoop?
General purpose framework for processing huge
amounts of data.
Open Source
Batch / Offline Oriented
Hadoop - HDFS
Open Source Distributed File System.
Store large files. Can easily be accessed via application
built on top of HDFS.
Data is distributed and replicated over multiple machines
Linux Style commands eg. ls, cp, mv, touchz etc
Hadoop – HDFS
Example:
hadoop fs –dus /data/
185453399927478 bytes =~ 168 TB
(One of the folders from one of our hadoop cluster)
Hadoop – Map Reduce
Application Framework built on top of HDFS to process
your big data
Operates on key-value pairs
Mappers filter and transform input data
Reducers aggregate mapper output
Example
• Given web logs, calculate landing page conversion rate
for each product
• So basically we need to see how many impressions each
product received and then calculate conversion rate of for
each product
Map Reduce Example
Map 1: Process Log File:Output: Key (Product ID), Value
(Impression Count)
Map 2: Process Log File:Output: Key (Product ID), Value
(Impression Count)
Map N: Process Log File:Output: Key (Product ID), Value
(Impression Count)
Reducer: Here we receive all data for a given product. Just run
simple for loop to calculate conversion rate.
(Output: Product ID, Conversion Rate
Map Phase Reduce Phase
Recap
We just processed terabytes of data, and calculated
conversion rate across millions of products.
Note: This is batch process only. It takes time. You can
not start this process after some one visits your website.
How about we generate recommendations in batch process and serve them in real time?
HBase
Provides real time random read/ write access over HDFS
Built on Google’s ‘Big Table’ design
Open Sourced
This is not RDBMS, so no joins. Access patterns are
generally simple like get(key), put(key, value) etc.
Row Cf:<qual> Cf:<qual> …. Cf:<qual>
Row 1 Cf1:qual1 Cf1:qual2
Row 11 Cf1:qual2 Cf1:qual22 Cf1:qual3
Row 2 Cf2:qual1
Row N
Dynamic Column Names. No need to define columns upfront.
Both rows and columns are (lexicological) sorted
Row Cf:<qual> ….
user1 Cf1:click_history:{actual_clicks_data}
Cf1:purchases:{actual_purchases}
user11 Cf1:purchases:{actual_purchases}
user20 Cf1:mobile_impressions:{actual mobile impressions}
Cf1:purchases:{actual_purchases}
Note: Each row has different columns, So think about this as a hash map rather than at table with rows and columns
Putting it all together
Analyze Data(Map Reduce)
Generate Recommendations
(Map Reduce)
Store data in HDFS
Serve Real Time Requests(HBase)
Web
Mobile
Do offline analysis in Hadoop, and serve real time requests with HBase
Use Case: Deal Relevance & Personalization @ Groupon
What are Groupon Deals?
Our Relevance Scenario
Users
Our Relevance Scenario
Users
How do we surface relevant deals ?
Deals are perishable (Deals expire or are sold out)
No direct user intent (As in traditional search advertising)
Relatively Limited User Information
Deals are highly local
Two Sides to the Relevance Problem
AlgorithmicIssues
How to findrelevant deals forindividual usersgiven a set of
optimization criteria
ScalingIssues
How to handlerelevance for
all users acrossmultiple
delivery platforms
Developing Deal Ranking Algorithms
• Exploring Data
• Understanding signals, finding
patterns
• Building Models/Heuristics
• Employ both classical machine
learning techniques and heuristic
adjustments to estimate user
purchasing behavior
• Conduct Experiments
• Try out ideas on real users and
evaluate their effect
Data Infrastructure
20132011 2012
20+
400+
2000+
Growing Deals Growing Users
100 Million+ subscribers
We need to store data like, user click history, email records, service logs etc. This tunes to billions of data points and TB’s of data
Deal Personalization Infrastructure Use Cases
• Deliver Personalized Emails
• Deliver Personalized Website & Mobile
Experience
Offline System Online System
Personalize billions of emails for hundredsof millions of users
Personalize one of the most populare-commerce mobile & web app
for hundreds of millions of users & page views
Architecture
HBase OfflineSystem
HBase for Online System
Real TimeRelevance
RelevanceMap/Reduce
Replication
Data Pipeline
• We can now maintain different SLA on online and offline systems
• We can tune HBase cluster differently for online and offline systems
HBase Schema DesignUser ID Column Family 1 Column Family 2Unique Identifier for Users
User History and Profile Information
Email History For Users
• Most of our data access patterns are via “User Key”• This makes it easy to design HBase schema• The actual data is kept in JSON
Overwrite user history and profile info
Append email history for each day as a separate columns. (On avg each
row has over 200 columns)
Cluster Sizing
Hadoop + HBase Cluster
100+ machine Hadoop cluster, this runs heavy
map reduce jobsThe same cluster also hosts 15 node HBase
cluster
Online HBase Cluster
HBaseReplication
10 Machine dedicated HBase cluster to serve real time SLA
• Machine Profile• 96 GB RAM (HBase
25 GB)• 24 Virtual Cores
CPU• 8 2TB Disks
• Data Profile• 100 Million+
Records• 2TB+ Data• Over 4.2 Billion Data
Points
Questions?
Thank You!
(We are hiring!)www.groupon.com/techjobs