Date post: | 26-May-2015 |
Category: |
Technology |
Upload: | bill-graham |
View: | 5,088 times |
Download: | 1 times |
Thursdays9:00 ET/PT
Using a Hadoop Data Pipeline to Build a Graph of Users and Content Hadoop Summit - June 29, 2011
Bill Graham
About me
• Principal Software Engineer
• Technology, Business & News BU (TBN)
• TBN Platform Infrastructure Team
• Background in SW Systems Engineering and Integration Architecture
• Contributor: Pig, Hive, HBase
• Committer: Chukwa
About CBSi – who are we?
GAMES & MOVIES TECH, BIZ & NEWS SPORTSENTERTAINMENT
MUSIC
About CBSi - scale
• Top 10 global web property
• 235M worldwide monthly uniques1
• Hadoop Ecosystem– CDH3, Pig, Hive, HBase, Chukwa, Oozie, Sqoop, Cascading
• Cluster size:– Currently workers: 35 DW + 6 TBN (150TB)– Next quarter: 100 nodes (500TB)
• DW peak processing: 400M events/day globally
1 - Source: comScore, March 2011
Abstract
At CBSi we’re developing a scalable, flexible platform to provide the ability to aggregate large volumes of data, to mine it for meaningful relationships and to produce a graph of connected users and content. This will enable us to better understand the connections between our users, our assets, and our authors.
The Problem
• User always voting on what they find interesting– Got-it, want-it, like, share, follow, comment, rate, review, helpful
vote, etc.
• Users have multiple identities– Anonymous– Registered (logged in)– Social– Multiple devices
• Connections between entities are in silo-ized sub-graphs
• Wealth of valuable user connectedness going unrealized
The Goal
• Create a back-end platform that enables us to assemble a holistic graph of our users and their connections to:– Content– Authors– Each other– Themselves
• Better understand how our users connect to our content
• Improved content recommendations
• Improved user segmentation and content/ad targeting
Requirements
• Integrate with existing DW/BI Hadoop Infrastructure
• Aggregate data from across CBSi and beyond
• Connect disjointed user identities
• Flexible data model
• Assemble graph of relationships
• Enable rapid experimentation, data mining and hypothesis testing
• Power new site features and advertising optimizations
The Approach
• Mirror data into HBase
• Use MapReduce to process data
• Export RDF data into a triple store
CMS Publishing
Social/UGC Systems
DW Systems
CMS Systems
Data Flow
HDFS
MapReduce• Pig• ImportTsv
HBasetransform
& load
TripleStore
Site Activity Streama.k.a. Firehose (JMS)
Content Tagging Systems
bulk load
atomic writes
RDF
Site SPARQL
NOSQL Data Models
ColumnFamily
Key-value stores
Document databases
Graph databases
Da
ta s
ize
Data complexityCredit: Emil Eifrem, Neotechnology
Conceptual Graph
anonId
Asset
Asset
regId
Asset
is also
follow
like
follow
Activity firehose (real-time)
Product
Author
Brand
is also
is also
is also
Story
Author
authored by
CMS (batch + incr.)
tag
tagged withtagged with
Tags (batch)
SessionId
PageEvent
PageEvent
had session
contains
contains
DW (daily)
HBase Schema
user_info table
Row Key Column Families
ALIAS: EVENT:
user id col. name value col. name value
ANON-<id1> URS-<id1> <ts>
ANON-<id1> LIKE-<ts> <json>
ANON-<id1> SHARE-<ts> <json>
URS-<id1> ANON-<id1> <ts>
HBase Loading
• Incremental– Consuming from a JMS queue == real-time
• Batch– Pig’s HBaseStorage == quick to develop & iterate– HBase’s ImportTsv == more efficient
Generating RDF with Pig
• RDF1 is an XML standard to represent subject-predicate-object relationships
• Philosophy: Store large amounts of data in Hadoop, be selective of what goes into the triple store
• For example:– “first class” graph citizens we plan to query on– Implicit to explicit (i.e., derived) connections
− Content recommendations− User segments− Related users− Content tags
• Easily join data to create new triples with Pig
• Run SPARQL2 queries, examine, refine, reload1 - http://www.w3.org/RDF, 2 - http://www.w3.org/TR/rdf-sparql-query
Example Pig RDF Script
Create RDF triples of users to social events:
RAW = LOAD 'hbase://user_info' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('event:*', '-loadKey true’)AS (id:bytearray, event_map:map[]);
-- Convert our maps to bags so we can flatten them out A = FOREACH RAW GENERATE id, FLATTEN(mapToBag(event_map)) AS (social_k,
social_v);
-- Convert the JSON events into maps B = FOREACH A GENERATE id, social_k, jsonToMap(social_v) AS social_map:map[]; -- Pull values from map C = FOREACH B GENERATE id, social_map#'levt.asid' AS asid,
social_map#'levt.xastid' AS astid, social_map#'levt.event' AS event, social_map#'levt.eventt' AS eventt, social_map#'levt.ssite' AS ssite, social_map#'levt.ts' AS eventtimestamp ;
EVENT_TRIPLE = FOREACH C GENERATE GenerateRDFTriple(
'USER-EVENT', id, astid, asid, event, eventt, ssite, eventtimestamp ) ;
STORE EVENT_TRIPLE INTO 'trident/rdf/out/user_event' USING PigStorage ();
Example SPARQL query
Recommend content based on Facebook “liked” items:
SELECT ?asset1 ?tagname ?asset2 ?title2 ?pubdt2 WHERE { # anon-user who Like'd a content asset (news item, blog post) on Facebook <urn:com.cbs.dwh:ANON-Cg8JIU14kobSAAAAWyQ> <urn:com.cbs.trident:event:LIKE> ?x . ?x <urn:com.cbs.trident:eventt> "SOCIAL_SITE” . ?x <urn:com.cbs.trident:ssite> "www.facebook.com" . ?x <urn:com.cbs.trident:tasset> ?asset1 . ?asset1 a <urn:com.cbs.rb.contentdb:content_asset> . # a tag associated with the content asset ?asset1 <urn:com.cbs.cnb.bttrax:tag> ?tag1 . ?tag1 <urn:com.cbs.cnb.bttrax:tagname> ?tagname .
# other content assets with the same tag and their title ?asset2 <urn:com.cbs.cnb.bttrax:tag> ?tag2 . FILTER (?asset2 != ?asset1) ?tag2 <urn:com.cbs.cnb.bttrax:tagname> ?tagname . ?asset2 <http://www.w3.org/2005/Atom#title> ?title2 . ?asset2 <http://www.w3.org/2005/Atom#published> ?pubdt2 . FILTER (?pubdt2 >= "2011-01-01T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>) } ORDER BY DESC (?pubdt2) LIMIT 10
Conclusions I - Power and Flexibility
• Architecture is flexible with respect to:– Data modeling– Integration patterns– Data processing, querying techniques
• Multiple approaches for graph traversal– SPARQL– Traverse HBase– MapReduce
Conclusions II – Match Tool with the Job
• Hadoop - scale and computing horsepower
• HBase – atomic r/w access, speed, flexibility
• RDF Triple Store – complex graph querying
• Pig – rapid MR prototyping and ad-hoc analysis
• Future:– HCatalog – Schema & table management– Oozie or Azkaban – Workflow engine– Mahout – Machine learning– Hama – Graph processing
Conclusions III – OSS, woot!
If it doesn’t do what you want, submit a patch.
Questions?