Date post: | 10-Aug-2015 |
Category: |
Data & Analytics |
Upload: | krishna-gade |
View: | 86 times |
Download: | 0 times |
Krishna Gade
• Data Engineering at Pinterest
• Search and Data platforms at Twitter and Bing
• Follow @krishnagade
Who am I?
Why do we care about data?
How is Hadoop helping us to harness the power of the data?
What are some of the tools we built on top of Hadoop Platform?
Why do we care about data?
How is Hadoop helping us to harness the power of data?
What are some of the tools we built on top of Hadoop Platform?
Why do we care about data?
How is Hadoop helping us to harness the power of the data?
What are some of the tools we built on top of Hadoop Platform?
Data at Pinterest
• 50 Billion Pins• 1 Billion boards• 40 PB of data on S3• 3 PB processed every day• 2000 node Hadoop cluster• 200 engineers
Pinterest Data Architecture
App
events
Kafka
SecorSkyline
Pinball
Redshift
Pinalytics
Features
Qubole (Hadoop)
Singer
•Ephemeral clusters
•Access control layer
•Shared data store
•Easy deployment
Hadoop Platform Requirements
•Isolated multi-tenancy
•Elasticity
•Support multiple clusters
Decoupling compute & storage
Hadoop Cluster 1
Transient HDFS
Hadoop Cluster 2
Transient HDFS
S3 Persistent Store
Multi-layered Packaging
Mapreduce JobsHadoop Jars/Libs
Job/User level Configs
Software Packages/LibsConfigs (OS/Hadoop)
Misc Sys Admin
OSBootstrap Script
Core SW
Runtime Staging(on S3)
Automated Configuration
(Masterless Puppet)
Baked AMI
Executor Abstraction Layer
Hive Metastore
HDFS/S3
Qubole
Managed Hadoop
EMR
Executor
Pinball
Dev Server
•API for simplified executor abstraction
•Advanced support for spot instances
•Baked AMI customization
Why Qubole?
•Hadoop & Spark as managed services
•Tight integration with Hive
•Graceful cluster scaling
Confidential
● Scale:o 50 Billion Pinso Hundreds of workflowso Thousands of jobso 500+ jobs in a workflowo 3 petabytes processed daily
● Support:o Hadoop, Cascading, Hive, Spark …
Scale of Processing
job
workflow
Confidential
Why Pinball?● Requirements
o Simple abstractionso Extensible in futureo Reliable stateless computingo Easy to debugo Scales horizontallyo Can be upgraded w/o aborting workflowso Rich features like auto-retries, per-job emails, overrun
policies…
● Optionso Apache Oozie, Azkaban, Luigi
Confidential
● Workflow o A directed graph
of nodes called jobs
● Edgeo Run after
dependence● Node
o Job is a node
Workflow Model
Confidential
Job State● Job state is captured in a token● Tokens are named hierarchically
Master
Job Token
version: 123name: /workflow/w1/jobowner: worker_0expiration: 1234567data: JobTemplate(....)
Confidential
● Master keeps the state● Workers claim and execute tasks● Horizontally scalable
Master Worker Interaction
Worker Master Persistent Store
1: request 2: update
3: ack
Confidential
Master
● Entire state is kept in memory● Each state update is synchronously
persisted before master replies to client● Master runs on a single thread – no
concurrency issues
Confidential
Open Source
Git repo: https://github.com/pinterest/pinball
Mailing list:https://groups.google.com/forum/#!forum/pinball-users
Why do we care about insights?
How is Hadoop helping us to harness the power of data?
What are some of the tools we built on top of Hadoop Platform?
Confidential
Architecture
45
BackendThrift Services and Hbase databases
WebappRich UI Components
ReporterGenerates formatted data
MetricsCustomized optimizations
1
2
3
4
Main Components
Confidential
Visualizations• Highcharts• Time-series updated automatically ● daily
Customizability• Dashboards• Built-in or user-defined reports
User Interface
47
Confidential
Pinomaly• Anomalous metric tracking• Email alerts
Reporting• Formatted dashboards• PDF printing• Duplicated weekly
Metric Manipulation• Metric Composer• Global operations (segmentation,● rollup/aggregation, etc).
User Interface
48
Confidential
Date, seg1, seg2, ... => value• Store the value for every possible segmentation• On-the-fly aggregation
E.g.• 2015-01-01, US, Male => 1• 2015-01-01, US, Female => 2• 2015-01-01, UK, Male => 3• 2015-01-01, UK, Female => 4• 2015-01-01, UK, * => 7• 2015-01-01, *, Male => 4
Data Model
51
Confidential
Backend Architecture
53
PinalyticsThrift
Service
2. readMetrics()
5. metrics
HBase
Region Server 1
Region Server N
Region Server 2
Region1 CP
Region2 CP
Region3 CP
Region4 CP
Region5 CP
RegionM CP
Metric table
WebappServer
3. Scan &Aggregate
1. request
4. Region aggregation
Confidential
Horizontal Scalability• No app-level sharding
Flexibility in Aggregation• FuzzyRowFilter• Coprocessor
Tables• Report metadata• Reports
HBase
54
Confidential
Composite row key• METRIC|TIME|SEG1|SEG2|...
Filters rows given a row key and a fuzzy row• 0: match the byte, 1: don’t match the byte
E.g. MAU of male users on 2015-01-01• Start row: MAU|2015-01-01|• End row: MAU|2015-01-01||• Row Key: MAU|2015-01-01|--|M-• Fuzzy filter: 000|0000000000|11|00
Fuzzy Row Filter
55
Confidential
• Region-local aggregation with coprocessor
• Final aggregation at the Thrift service
• Reduces Network I/O
• Low Latency
HBase Coprocessor
56
Confidential
Flexible python client library for generating reports• Arbitrary metrics and segments
Easy-to-access data• Data is automatically copied to s3• Hive external table is generated
Reporter
58
Confidential
WAU, WARC and MAU segmented by gender and countryclass DemoWAUReport(PinalyticsWideReport):
_METRIC_NAMES = ['wau', 'warc', 'mau']_SEGKEY_NAMES = ['gender', 'country']_QUERY_TEMPLATE = """
SELECT dt, gender, country, wau, warc, mauFROM activity_metrics WHERE dt>='2015-01-01';"""
• Sample query output[‘2015-01-01’, ‘male’, ‘US’, 102, 53, 110]
Reporter Example
60
Confidential
• Pre-compute a lot of core metrics
• Standard segmentation
● - Gender, Country, App● - Spam-filtering
Core Metrics
62
• Activity• Event counts• Retention• Signups
Confidential
70
Internal Tools MatterSolving problems inside of our company
400 Unique users
800 Page views per day
1500 Custom charts created and updated daily