+ All Categories
Home > Documents > Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

Date post: 26-Jun-2015
Category:
Upload: yahoo-developer-network
View: 1,106 times
Download: 0 times
Share this document with a friend
Popular Tags:
17
Yahoo! Online Content Optimization using Hadoop Shail Aditya [email protected] Hadoop Summit 2011
Transcript
Page 1: Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

Yahoo!

Online Content Optimization using Hadoop

Shail [email protected]

Hadoop Summit 2011

Page 2: Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

Yahoo!

What do we do ?

Deliver right CONTENT to the right USER at the right TIME”

o Effectively and “pro-actively” learn from user interactions with content that are displayed to maximize our objectives

A new scientific discipline at the interface ofo Large scale Machine Learning and Statisticso Multi-objective optimization in the presence of uncertaintyo User understandingo Content understanding

Page 3: Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

Yahoo!

Content Relevance at Yahoo!

Important

Editors

Popular

Personal / Social

Editorial10s of Items

ScienceMillions of Items

Page 4: Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

Yahoo!

Content Ranking Problems

Most PopularMost engaging overall based on objective metrics

Related Items and Context-Sensitive ModelsBehavioral Affinity: People who did X, did YMost engaging in this page/section/property/device/referral context?

Deep PersonalizationMost relevant to me based on my deep interests (entities, sources, categories, keywords)

X Y

Real-time Dashboard

Voice and Business Rules

Revenue Optimization

Light PersonalizationMore relevant to me based on my age, gender, location, and property usage

Most Popular + Per User HistoryRotate stories I’ve already seen

Layout OptimizationWhich modules/ad units should be shown to this user in this context?

Page 5: Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

Yahoo!

Yahoo Frontpage

Today Module(Light personalization)

PersonalAssistant

(LightPersonalization)

Trending Now (Most popular)

National News(Most Popular +

User History bucket)

Deals (most popular)

Page 6: Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

Yahoo!

Recommendation: A Match-making Problem

OpportunityUsers, queries,

pages, …

Item InventoryArticles, web page,

ads, …

Use an automated algorithm to select item(s) to show

Get feedback (click, time spent,..) Refine the models

Repeat (large number of times)Measure metric(s) of interest

(Total clicks, Total revenue,…)

• Recommendation problems• Search: Web, Vertical• Online advertising• …

Page 7: Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

Yahoo!

Problem Characteristics : Today module

Traffic obtained from a controlled randomized experimentThings to note: a) Short lifetimes b) temporal effects c) often breaking news story

Page 8: Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

Yahoo!

Scale: Why use Hadoop?

• Million events per second (user view/click, content update)

• Hundreds of GB data collected and modeled per run

• Millions of items in pool

• Millions of user profiles

• Tens of thousands of Features (Content and/or User)

Page 9: Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

Yahoo!

Data Flow

Optimization EngineContent feed with biz rules

Explore~1%

Exploit~99%

Near Real-timeFeedback

Content Metadata

Dashboard Optimized Module

Real-timeInsights

Rules Engine

Page 10: Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

Yahoo!

How it happens ?

At time ‘t’ User ‘u’ (user attr: age, gen, loc) interacted withContent ‘id’ atPosition ‘o’Property/Site ‘p’ Section - sModule – mInternational - i’

UserEvents

ItemMetadata

Modeling

ITEM Model

USER Model

Content ‘id’Has associated metadata ‘meta’ meta = {entity, keyword, geo, topic, category}

FeatureGeneration

Additional Content & UserFeature Generation

Item BASE M F ATTR CAT_Sports

id1 0.8 +1.2 -1.5 -0.9 1.0

id2 -0.9 -0.9 +2.6 +0.3 1.0

Item BASE M F ATTR CAT_Sports

u1 0.8 1 1 0.2

u2 -0.9 1 -1.2

STORE: PNUTS

5 minlatency

RankingB-Rules

Request

5 – 30 minlatency

SLA 50 ms – 200 ms

STORE: HBASE

Page 11: Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

Yahoo!

Technology Stack

Analytics and Debugging

Ingest

Page 12: Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

Yahoo!

Modeling Framework

Global state provided by HBase

Hadoop processing via a collection of PIG UDFs

Different flows for modeling or stages assembled in PIG

o OLR, Clustering, Affinity, Regression Models, Decompositions

(Cholesky…)

o Timeseries models (generally trends – extract of user activity on

content)

Configuration based behavior for various stages of modeling

o Type of Features to be generated

o Type of joins to perform – User / Item / Feature

Input : DFS and/or HBase

Output: DFS and/or HBase

Page 13: Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

Yahoo!

HBase

ITEM Model• Stores item related features• Stores ITEM x USER FEATURES model • Stores parameters about item like view count, click count, unique user count.• 10 of Millions of Items• Updated every 5 minutes

USER Model• Store USER x CONTENT FEATURES model for each individual user by either a Unique ID• Stores summarized user history – Essential for Modeling in terms of item decay• Millions of profiles• Updated every 5 to 30 minutes

TERM Model• Inverts the Item Table and stores statistics for the terms. • Used to find the trending features and provide baselines for user features• Millions of terms and hundreds of parameters tracked• Updates every 5 minutes

Page 14: Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

Yahoo!

Grid Edge Services

Keeps MR jobs lean and mean Provides ability to control non-gridifyable solutions to be deployed easily

Have different scaling characteristics (E.g. Memory, CPU)

Provide gateway for accessing external data sources in M/R

Map and/or Reduce step interact with Edge Services using standard client

Examples

Categorization

Geo Tagging

Feature Transformation

Page 15: Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

Yahoo!

Analytics and Debugging

Provides ability to debug modeling issues near-real time

Run complex queries for analysis

Easy to use interface

PM, Engineers, Research use this cluster to get near-real time insights

10s of Modeling monitoring and Reporting queries every 5 minute

We use HIVE

Page 16: Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

Yahoo!

Learnings

PIG & HBase has been best combination so far

Made it simple to build different kind of science models

Point lookup using HBase has proven to be very useful

Modeling = Matrices

HBase provides a natural way to represent and access them

Edge Services

Have provided simplicity to whole stack

Management (Upgrades, Outage) has been easy

HIVE has provided us a great way for analyzing the results

PIG was also considered

Page 17: Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

Yahoo!

Thank you

Shail [email protected]

Deliver right CONTENT to the right USER at the right TIME”


Recommended