Post on 05-Dec-2014
description
transcript
Data Management Platform on Hadoop
Srikanth SundarrajanVenkatesh Seetharam
(Incubating)
whoami
Principal ArchitectInMobi
Apache Hadoop Contributor
Hadoop Team @Yahoo!
Srikanth Sundarraj
an Architect/DeveloperHortonworks
Apache Hadoop Contributor
Data Management @ Yahoo!
Venkatesh
Seetharam
Agenda
2 Falcon Overview
1 Motivation
3 Case Studies
4 Questions & Answers
MOTIVATION
Data Processing Landscape
External data source
Acquire (Import)
Data Processing (Transform/Pipeline)
Eviction Archive
Replicate(Copy)
Export
Core ServicesProcess
• Late data management• Relays
Data management
• Acquisition• Replication• Retention
Operability
• SLA• Lineage
Process Management – Relays
picture courtersy: http://istockphoto.com/
Late Data Management
picture courtersy: http://iwebask.com
Data Retention As Service
picture courtersy: http://vimeo.com/
Data Replication As Service
picture courtersy: http://boylesmedia.com
Data Acquisition As Service
picture courtersy: http://wmpu.org
Operability – Dashboard
picture courtersy: http://www.opentrack.ch/
FALCON OVERVIEW
Holistic Declaration of Intent
picture courtersy: http://bigboxdetox.com
Entity Dependency Graph
Hadoop / Hbase … Cluster
External data
source
feed Process
depends depends
depends
depends
High Level Architecture
Apache Falcon
Oozie
Messaging
HCatalog
Hadoop
Entity
Entity status
Process status / notification
CLI/REST
JMS
Config store
Feed Schedule
Cluster xml
Feed xml Falcon
Falcon config store / Graph
Retention / Replication workflow
Oozie Scheduler HDFS
JMS Notification per action
Catalog service
Instance Management
Process Schedule
Cluster/feed xml
Process xml
Falcon
Falcon config store / Graph
Process workflow
Oozie Scheduler HDFS
JMS Notification per available
feed
Catalog service
Instance Management
Physical Architecture
Falcon Colo 1
Falcon Colo 2
Falcon Colo 3
Scheduler
Scheduler
Scheduler
Falcon – PrismGlobal view
CASE STUDY Multi Cluster Failover
Multi Cluster – Failover
> Falcon manages workflow, replication or both.> Enables business continuity without requiring full data reprocessing.> Failover clusters require less storage and CPU.
Staged Data
Cleansed Data
Conformed Data
Presented Data
Staged Data
Presented Data
BI and Analytics
Primary Hadoop Cluster
Failover Hadoop Cluster
Re
plic
atio
n
Retention Policies
Staged Data
Retain 5 Years
Cleansed Data
Retain 3 Years
Conformed Data
Retain 3 Years
Presented Data
Retain Last Copy Only
> Sophisticated retention policies expressed in one place.> Simplify data retention for audit, compliance, or for data re-processing.
CASE STUDY Distributed Processing
Example: Digital Advertising @ InMobi
Hadoop @ InMobiAbout InMobi
Worlds leading independent mobile advertising company
Hadoop usage at InMobi ~ 6 Clusters > 1PB of storage > 5TB new data ingested each day > 20TB data crunched each day > 200 nodes in HDFS/MR clusters & > 40 nodes in Hbase > 175K hadoop jobs / day > 60K Oozie workflows / day 300+ Falcon feed definitions 100+ Falcon process definitions
Processing – Single Data Center
Ad Request data
Impression render event
Click event
Conversion event
Continuous Streaming (minutely)
Hourly summary
Enrichment (minutely/5 minutely)
Summarizer
Global Aggregation
Ad Request data
Impression render event
Click event
Conversion event
Continuous
Streaming (minutely)
Hourly summa
ry
Enrichment (minutely/5 minutely) Summarizer
Ad Request data
Impression render event
Click event
Conversion event
Continuous
Streaming (minutely)
Hourly summa
ry
Enrichment (minutely/5 minutely) Summarizer
……..
Dat
a C
ente
r 1
Dat
a C
ente
r N
Consumable global
aggregate
HIGHLIGHTS
Future
Security
Embed Pig/Hive scripts
Data Acquisition – file-based
Monitoring/Management Dashboard
1
2
3
4
Summary
Questions?Apache Falcon
http://falcon.incubator.apache.orgmailto: dev@falcon.incubator.apache.org
Srikanth Sundarrajansriksun@apache.org#sriksun
Venkatesh Seetharamvenkatesh@apache.org#innerzeal