Date post: | 20-Aug-2015 |
Category: |
Technology |
Upload: | cloudera-inc |
View: | 3,008 times |
Download: | 0 times |
CONFIDENTIALCONFIDENTIAL
Matchmaking in the Cloud: Amazon Web Services
and Apache Hadoop at eHarmony
Ben Hardy, Senior Software Engineer
2CONFIDENTIAL
You’ll learn how eHarmony:– Used EC2 and Hadoop to develop a scalable solution for
our large, real-world data problem– Overcame the limitations of our existing infrastructure– Reaped significant cost savings with this choice
Also find out about new opportunities and challenges
2
Why You’re Here
3CONFIDENTIAL
Online subscription-based matchmaking service
Launched in 2000
Available in United States, Canada, Australia and
United Kingdom
On average, 236 members in US marry every day*
More than 20 million registered users
3
About eHarmony
* based on survey conducted by Harris Interactive in 2007.
4CONFIDENTIAL
We match couples using detailed compatibility models
Models are based on decades of research and clinical experience in psychology
Variety of user attributes• Demographic• Psychographic• Behavioral
New models constantly being tested and developed Model evaluation is the gorilla in the room
4
The Science of Matching
5CONFIDENTIAL 5
Computational Requirements
Tens of GB of matches, scores and constantly
changing user features are archived daily
TBs of data currently archived and growing
Want to support 10x our current user base
All possible matches = O(n2) problem
Support a growing set of models that may be
– arbitrarily complex
– computationally and I/O expensive
6CONFIDENTIAL
Current architecture is multi-tiered with a relational
back-end
Scoring is DB join intensive
Data needs constant archiving
– Matches, match scores, user attributes at time of match
creation
– Model validation is done at a later time across many days
Need a non-DB solution better suited towards big
data crunching6
Scaling Challenges
7CONFIDENTIAL
Good fit for our problem– Need to process entire match pool (n2)
– Data easily partitioned
Hadoop provides– Horizontally scalable parallel processing
– Work distribution
– Distributed Storage
– Fault tolerance
– Job monitoring
Hadoop is an Apache project
7
Hadoop Addresses Scaling Needs
8CONFIDENTIAL
Computing on AWS
Elastic Cloud Computing (EC2) enables horizontal
scaling by adding servers on demand
Elastic MapReduce
– Hosted Hadoop framework on top EC2 and S3
– Simplifies end-to-end processing on cloud
– Pricing is in addition to EC2
Simple Storage Service (S3)
– provides cheap unlimited storage
– Highly configurable security using ACLs
8
9CONFIDENTIAL
AWS Pricing Model
Pay-per-use elastic model
Choice of server type
Lets you get up and running quickly and cheaply
Highly cost effective alternative to doing it in
house
Allows rapid horizontal scaling on demand
9
10CONFIDENTIAL
Architecture
10
Data Warehouse
Amazon Cloud
HadoopJobs
User data dump
S3upload
download
Result keystore
input
outputupdate
Elastic MapReduce
Data Warehouse
11CONFIDENTIAL
MapReduce Overview
Applications are modeled as a series of maps and
reductions
In map phase, values are assigned to keys
Shuffle and sort
In reduce phase, values are combined for each key
Example - Word Count
– Counts the frequency of words
– Modeled as one Map and one Reduce
– Data as key -> values11
12CONFIDENTIAL
Model Validation with MapReduce
Complex application uses a series of 3 MapReduce
jobs
Match Scoring procedure for pairs of users:
– Join match data with left-side User attributes into one line
– Join above with right-side User attributes, calculate
resulting match score
– Group match scores by user
Temporary files in HDFS hold results between jobs
12
13CONFIDENTIAL
Data Flow
13
Match Info
Users (Left Side) Users (Right Side)
Join Join & Score Group by User
ResultsTemp Files
3 MapReduce
Jobs:
14CONFIDENTIAL
AWS Elastic MapReduce
Only need to think in terms of Elastic MapReduce
job flow
EC2 cluster is managed for you behind the scenes
Each job flow has one or more steps
Each step is a Hadoop MapReduce process
Each step can read and write data directly from
and to S3 or HDFS
Based on Hadoop 0.18.3
14
15CONFIDENTIAL
Elastic MapReduce for eHarmony
Vastly simplified our Hadoop processing
– No need to explicitly allocate, start and shutdown EC2
instances
– No need to explicitly manipulate master node
Status of a job flow and all its steps are accessible
by a REST service
15
16CONFIDENTIAL
Simple Job Control
Cluster control and job management reduced to a single local command
Uses Amazon’s EMR Ruby script
Uses jar and conf files stored on S3
16
elastic_mapreduce.rb --create --name #{JOB_NAME} --num-instances #{NODES} --instance-type #{INST_TYPE} --key_pair #{KEY} --log-uri #{LOGDIR} --jar #{JAR} --main-class #{JOIN_CLASS} --arg -xconf --arg #{CONF}/join-config.xml --jar #{JAR} --main-class #{SCORER_CLASS} --arg -xconf --arg #{CONF}/scorer-config.xml --jar #{JAR} --main-class #{COMBINER_CLASS} --arg -xconf --arg #{CONF}/combiner-config.xml
17CONFIDENTIAL
Development & Test Environments
Cheap to set up and experiment on Amazon
Quick setup
– Number of servers is controlled by a config variable
Can test identical setup to production
Performance testing easy with big cluster
Integration test easy with small cluster and input
data subset.
Separate development and test accounts
recommended17
18CONFIDENTIAL
Performance by Instance Type
18
Minutes
19CONFIDENTIAL
Total Execution Time
19
20CONFIDENTIAL
Administration Tools
AWS Console
ElasticFox for EC2 Firefox plugin
Hadoop status web pages
Aws/s3 Ruby gem with irb shell
Tim Kay’s AWS command line tool for S3
S3Fox for S3 Firefox plugin
20
21CONFIDENTIAL
AWS Management Console
Useful for Elastic MapReduce
Start or Terminate job flow
Track execution of jobs in a job flow
Useful for vanilla EC2 as well
Start and stop clusters, nodes
Get machine addresses to view Hadoop status
21
22CONFIDENTIAL 22
23CONFIDENTIAL
AWS Management Console
23
EC2 Console Dashboard
24CONFIDENTIAL 24
25CONFIDENTIAL 25
26CONFIDENTIAL
Hadoop DFS – Monitor Disk Usage
26
27CONFIDENTIAL
Challenges
The overall process depends on the success of
each stage
Assume every stage is unreliable
Need to build retry/abort logic to handle failures
27
28CONFIDENTIAL
Challenges – Elastic MapReduce
Hard to debug – produces hundreds of log files in
an S3 bucket
Hanged node can be stopped with AWS Console
Probably better to debug using normal EC2 cluster
28
29CONFIDENTIAL
Challenges – S3 (Simple Storage Service) S3 web service calls can time out
Extra logic required to validate file is correctly uploaded to and downloaded from S3
We retry once on failure
29
30CONFIDENTIAL
Challenges – Data Shuffling
We currently spend as much time moving data around as actually running Hadoop
Network bandwidth does not scale as Hadoop and EC2.
New scaling challenge is to reduce the data shuffle time and error recovery.
Try to do your processing near the data
30
31CONFIDENTIAL
Future Directions: Hadoop Streaming
Great for rapid prototyping
Develop using Unix text processing tools and pipes
Can use any language – Perl, Ruby etc
Recommended to wrap scripts in a container
Tests are easily run outside of Hadoop
Has hastened our internal adoption of Hadoop
31
32CONFIDENTIAL
Future Directions: Data Analysis in the Cloud
Daily reporting: use Hadoop instead of depending
on data warehouse.
Statistical analyses:
– Big aggregations, stratifications, distribution discovery
– Median/Mean score per user
– Analyze users by location
– Preparing data for analysis in packages like R
32
33CONFIDENTIAL
Data Analysis with Hive
Language very similar to SQL Once set up by devs, analysts can quickly become
proficient Errors rare, usually from bad input data Flexible enough to handle complex tasks
– Loading data into key/value maps– User defined functions usually not required
Hive community is very active and supportive Running on EC2 using Amazon SupportedHive Elastic Hive can read and write data in S3 buckets
33
34CONFIDENTIAL
Data Analysis with Pig
Apache Hadoop subproject
High-level language on top of Hadoop
Procedural language for describing data flow and
filtering
Extremely flexible
Faster to write than Java, but slower to run
Hard to debug
34
35CONFIDENTIAL
Lessons Learned
EC2/S3/EMR are cost effective.
Easy to write unit tests for MapReduce.
Hadoop community support is great.
Easier to control process using Ruby than Bash
Dev tools really easy to work with and just work
right out of the box
Ensuring end-to-end reliability poses biggest
challenges
35
36CONFIDENTIAL
Any questions?
Ask away
36