Hw09 Matchmaking In The Cloud

CONFIDENTIALCONFIDENTIAL

Matchmaking in the Cloud: Amazon Web Services

and Apache Hadoop at eHarmony

Ben Hardy, Senior Software Engineer

2CONFIDENTIAL

You’ll learn how eHarmony:– Used EC2 and Hadoop to develop a scalable solution for

our large, real-world data problem– Overcame the limitations of our existing infrastructure– Reaped significant cost savings with this choice

Also find out about new opportunities and challenges

2

Why You’re Here

3CONFIDENTIAL

Online subscription-based matchmaking service

Launched in 2000

Available in United States, Canada, Australia and

United Kingdom

On average, 236 members in US marry every day*

More than 20 million registered users

3

About eHarmony

* based on survey conducted by Harris Interactive in 2007.

4CONFIDENTIAL

We match couples using detailed compatibility models

Models are based on decades of research and clinical experience in psychology

Variety of user attributes• Demographic• Psychographic• Behavioral

New models constantly being tested and developed Model evaluation is the gorilla in the room

4

The Science of Matching

5CONFIDENTIAL 5

Computational Requirements

Tens of GB of matches, scores and constantly

changing user features are archived daily

TBs of data currently archived and growing

Want to support 10x our current user base

All possible matches = O(n2) problem

Support a growing set of models that may be

– arbitrarily complex

– computationally and I/O expensive

6CONFIDENTIAL

Current architecture is multi-tiered with a relational

back-end

Scoring is DB join intensive

Data needs constant archiving

– Matches, match scores, user attributes at time of match

creation

– Model validation is done at a later time across many days

Need a non-DB solution better suited towards big

data crunching6

Scaling Challenges

7CONFIDENTIAL

Good fit for our problem– Need to process entire match pool (n2)

– Data easily partitioned

Hadoop provides– Horizontally scalable parallel processing

– Work distribution

– Distributed Storage

– Fault tolerance

– Job monitoring

Hadoop is an Apache project

7

Hadoop Addresses Scaling Needs

8CONFIDENTIAL

Computing on AWS

Elastic Cloud Computing (EC2) enables horizontal

scaling by adding servers on demand

Elastic MapReduce

– Hosted Hadoop framework on top EC2 and S3

– Simplifies end-to-end processing on cloud

– Pricing is in addition to EC2

Simple Storage Service (S3)

– provides cheap unlimited storage

– Highly configurable security using ACLs

8

9CONFIDENTIAL

AWS Pricing Model

Pay-per-use elastic model

Choice of server type

Lets you get up and running quickly and cheaply

Highly cost effective alternative to doing it in

house

Allows rapid horizontal scaling on demand

9

10CONFIDENTIAL

Architecture

10

Data Warehouse

Amazon Cloud

HadoopJobs

User data dump

S3upload

download

Result keystore

input

outputupdate

Elastic MapReduce

Data Warehouse

11CONFIDENTIAL

MapReduce Overview

Applications are modeled as a series of maps and

reductions

In map phase, values are assigned to keys

Shuffle and sort

In reduce phase, values are combined for each key

Example - Word Count

– Counts the frequency of words

– Modeled as one Map and one Reduce

– Data as key -> values11

12CONFIDENTIAL

Model Validation with MapReduce

Complex application uses a series of 3 MapReduce

jobs

Match Scoring procedure for pairs of users:

– Join match data with left-side User attributes into one line

– Join above with right-side User attributes, calculate

resulting match score

– Group match scores by user

Temporary files in HDFS hold results between jobs

12

13CONFIDENTIAL

Data Flow

13

Match Info

Users (Left Side) Users (Right Side)

Join Join & Score Group by User

ResultsTemp Files

3 MapReduce

Jobs:

14CONFIDENTIAL

AWS Elastic MapReduce

Only need to think in terms of Elastic MapReduce

job flow

EC2 cluster is managed for you behind the scenes

Each job flow has one or more steps

Each step is a Hadoop MapReduce process

Each step can read and write data directly from

and to S3 or HDFS

Based on Hadoop 0.18.3

14

15CONFIDENTIAL

Elastic MapReduce for eHarmony

Vastly simplified our Hadoop processing

– No need to explicitly allocate, start and shutdown EC2

instances

– No need to explicitly manipulate master node

Status of a job flow and all its steps are accessible

by a REST service

15

16CONFIDENTIAL

Simple Job Control

Cluster control and job management reduced to a single local command

Uses Amazon’s EMR Ruby script

Uses jar and conf files stored on S3

16

elastic_mapreduce.rb --create --name #{JOB_NAME} --num-instances #{NODES} --instance-type #{INST_TYPE} --key_pair #{KEY} --log-uri #{LOGDIR} --jar #{JAR} --main-class #{JOIN_CLASS} --arg -xconf --arg #{CONF}/join-config.xml --jar #{JAR} --main-class #{SCORER_CLASS} --arg -xconf --arg #{CONF}/scorer-config.xml --jar #{JAR} --main-class #{COMBINER_CLASS} --arg -xconf --arg #{CONF}/combiner-config.xml

17CONFIDENTIAL

Development & Test Environments

Cheap to set up and experiment on Amazon

Quick setup

– Number of servers is controlled by a config variable

Can test identical setup to production

Performance testing easy with big cluster

Integration test easy with small cluster and input

data subset.

Separate development and test accounts

recommended17

18CONFIDENTIAL

Performance by Instance Type

18

Minutes

19CONFIDENTIAL

Total Execution Time

19

20CONFIDENTIAL

Administration Tools

AWS Console

ElasticFox for EC2 Firefox plugin

Hadoop status web pages

Aws/s3 Ruby gem with irb shell

Tim Kay’s AWS command line tool for S3

S3Fox for S3 Firefox plugin

20

21CONFIDENTIAL

AWS Management Console

Useful for Elastic MapReduce

Start or Terminate job flow

Track execution of jobs in a job flow

Useful for vanilla EC2 as well

Start and stop clusters, nodes

Get machine addresses to view Hadoop status

21

22CONFIDENTIAL 22

23CONFIDENTIAL

AWS Management Console

23

EC2 Console Dashboard

24CONFIDENTIAL 24

25CONFIDENTIAL 25

26CONFIDENTIAL

Hadoop DFS – Monitor Disk Usage

26

27CONFIDENTIAL

Challenges

The overall process depends on the success of

each stage

Assume every stage is unreliable

Need to build retry/abort logic to handle failures

27

28CONFIDENTIAL

Challenges – Elastic MapReduce

Hard to debug – produces hundreds of log files in

an S3 bucket

Hanged node can be stopped with AWS Console

Probably better to debug using normal EC2 cluster

28

29CONFIDENTIAL

Challenges – S3 (Simple Storage Service) S3 web service calls can time out

Extra logic required to validate file is correctly uploaded to and downloaded from S3

We retry once on failure

29

30CONFIDENTIAL

Challenges – Data Shuffling

We currently spend as much time moving data around as actually running Hadoop

Network bandwidth does not scale as Hadoop and EC2.

New scaling challenge is to reduce the data shuffle time and error recovery.

Try to do your processing near the data

30

31CONFIDENTIAL

Future Directions: Hadoop Streaming

Great for rapid prototyping

Develop using Unix text processing tools and pipes

Can use any language – Perl, Ruby etc

Recommended to wrap scripts in a container

Tests are easily run outside of Hadoop

Has hastened our internal adoption of Hadoop

31

32CONFIDENTIAL

Future Directions: Data Analysis in the Cloud

Daily reporting: use Hadoop instead of depending

on data warehouse.

Statistical analyses:

– Big aggregations, stratifications, distribution discovery

– Median/Mean score per user

– Analyze users by location

– Preparing data for analysis in packages like R

32

33CONFIDENTIAL

Data Analysis with Hive

Language very similar to SQL Once set up by devs, analysts can quickly become

proficient Errors rare, usually from bad input data Flexible enough to handle complex tasks

– Loading data into key/value maps– User defined functions usually not required

Hive community is very active and supportive Running on EC2 using Amazon SupportedHive Elastic Hive can read and write data in S3 buckets

33

34CONFIDENTIAL

Data Analysis with Pig

Apache Hadoop subproject

High-level language on top of Hadoop

Procedural language for describing data flow and

filtering

Extremely flexible

Faster to write than Java, but slower to run

Hard to debug

34

35CONFIDENTIAL

Lessons Learned

EC2/S3/EMR are cost effective.

Easy to write unit tests for MapReduce.

Hadoop community support is great.

Easier to control process using Ruby than Bash

Dev tools really easy to work with and just work

right out of the box

Ensuring end-to-end reliability poses biggest

challenges

35

36CONFIDENTIAL

Any questions?

Ask away

36

CONFIDENTIALCONFIDENTIAL

Thank you

Ben Hardy, Senior Software [email protected]

Date post:	20-Aug-2015
Category:	Technology
Upload:	cloudera-inc
View:	3,008 times
Download:	0 times

Hw09 Matchmaking In The Cloud

Technology