+ All Categories
Home > Technology > DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Date post: 21-Jan-2017
Category:
Upload: hakka-labs
View: 246 times
Download: 1 times
Share this document with a friend
47
Scalable and Reliable Logging at Pinterest Krishna Gade [email protected] Yu Yang [email protected]
Transcript
Page 1: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Scalable and Reliable Logging at Pinterest

Krishna [email protected]

Yu [email protected]

Page 2: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Agenda

• What is Pinterest?

• Logging Infrastructure Deep-dive

• Managing Log Quality

• Summary & Questions

Page 3: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

What is Pinterest?

Page 4: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

What is Pinterest?

Pinterest is a discovery engine

Page 5: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

What is the weather in SF today?

Page 6: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Page 7: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

What is central limit theorem?

Page 8: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Page 9: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Page 10: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

What do I cook for dinner today?

Page 11: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

What’s my style?

Page 12: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Where shall we travel this summer?

Page 13: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Pinterest is solving this

discovery problem

Page 14: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Page 15: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Page 16: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Humans +

Algorithms

Page 17: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Page 18: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Page 19: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Page 20: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Kafka

App

Data Architecture

Singer

Qubole (Hadoop, Spark)

Merced

Pinball Skyline

Redshift

Pinalytics

Product

Storm Stingray

A/B Testing

Page 21: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Logging Infrastructure

Page 22: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Logging Infrastructure Requirements

• High availability

• Minimum data loss

• Horizontal scalability

• Low latency

• Minimum operation overhead

Page 23: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Pinterest Logging Infrastructure• thousands of hosts

• >120 billion messages, tens of terabytes per day

• Kafka as the central message transportation hub

• >500 Kafka brokers

• home-grown technologies for logging to Kafka and moving data from Kafka to cloud storage

AppServers

events

Kafka

Cloud storage

Page 24: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Logging Infrastructure v1

events

Kafka 0.7Host

app

app

app

data uploader

Real-time consumers

Page 25: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Problems with Kafka 0.7 pipelines

• Data loss

• Kafka 0.7 broker failure —> data loss

• high back pressure —> data loss

• Operability

• broker replacement —> reboot all dependent services to pick up the latest broker list

Page 26: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Challenges with Kafka that supports replication

• Multiple copies of messages among brokers

• cannot copy message directly to S3 to guarantee exact once persistence

• Cannot randomly pick Kafka brokers to write to

• Need to find the leader of each topic partition

• Handle various corner cases

Page 27: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Logging Infrastructure v2

events

Kafka 0.8Host

app

log files

Singer Secor/Merced

Sanitizer

Real-time consumers

Page 28: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Logging Agent Requirement

• reliability

• high throughput, low latency

• minimum computation resource usage

• support various log file format (text, thrift, etc.)

• fairness scheduling

Page 29: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Singer Logging Agent• Simple logging mechanism

• applications log to disk

• Singer monitors file system events and uploads logs to Kafka

• Isolate applications from Singer agent failures

• Isolate applications from Kafka failures

• >100MB/second for log files in thrift

• Production Environment Support

• dynamic configuration detection

• adjustable log uploading latency

• auditing

• heartbeat mechanism

Host

app

log files

Singer

Page 30: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Singer InternalsSinger Architecture

LogStream monitor

Configuration watcher

Reader Writer

Log repository

Reader Writer

Reader Writer

Reader Writer

Log configuration

LogStream processors A - 1

A -2

B - 1

C - 1

Log configuration

Staged Event Driven Architecture

Page 31: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Running Kafka in the Cloud• Challenges

• brokers can die unexpectedly

• EBS I/O performance can degrade significantly due to resource contention

• Avoid virtual hosts co-location on the same physical host

• faster recovery

Page 32: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Running Kafka in the Cloud• Initial settings

• c3.2xlarge + EBS

• Current settings

• d2.xlarge

• local disks help to avoid EBS contention problem

• minimize data on each broker for faster recovery

• availability zone aware topic partition allocation

• multiple small clusters (20-40 brokers) for topic isolation

Page 33: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Scalable Data Persistence

33

• Strong consistency: each message is saved exactly once

• Fault tolerance: any worker node is allowed to crash

• Load distribution

• Horizontal scalability

• Configurable upload policies

events

Kafka 0.8

Secor/Merced

Page 34: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Secor

34

• Uses Kafka high level consumer

• Strong consistency: each message is saved exactly once

• Fault tolerance: any worker node is allowed to crash

• Load distribution

• Configurable upload policies

events

Kafka 0.8

Secor

Page 35: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Challenges with consensus-based workload distribution

• Kafka consumer group rebalancing can prevent consumer from making progress

• It is difficult to recover when high-level consumer lags behind on some topic partitions

• Manual tuning is required for workload distribution of multiple topics

• Inconvenient to add new topics

• Efficiency

Page 36: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Merced• central workload

distribution

• master creates tasks

• master and workers communicate through zookeeper

Page 37: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Merced

Page 38: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Log Quality

Page 39: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Log Quality

Log quality can be broken down into two areas:

• log reliability - Reliability is fairly easy to measure: did we lose any data?

• log correctness - Correctness, on the other hand, is much more difficult as it requires the interpretation of data.

Page 40: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Challenges• Instrumentation is an after-thought for most feature

developers

• Features can get shipped breaking existing logging or no logging

• Once an iOS or Android release is out, it will keep generating bad data for weeks

• Data quality bugs are harder to find and fix compared to code quality

Page 41: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Tooling

Page 42: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Anomaly Detection• Started with a simple model based

on the assumption that daily changes are normally distributed.

• Revised that model until it has only a few alerts, mostly real and important.

• Hooked it up to a daily email to our metrics avengers.

Page 43: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

How did we do?• Interest follows went up after we started emailing

recommended interests to follow

• Push notifications about board follows broke

• Signups from Google changed as we ran experiments

• Our tracking broke when we released a new repin experience

• Our tracking of mobile web signups changed

Page 44: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Auditing LogsManual audits will have their limitations, especially with regards to coverage but will catch critical bugs.

• However, we need two things: • Repeatable process that can scale • Tooling required to support the process

• Regression Audit • Maintain a playbook of "core logging actions" • Use tooling to verify the output of the actions

• New Feature Audit • Gather requirements for analysis and produce a list of events that

need to be captured with the feature • Instrument the application • Test the logging output using existing tooling

Page 45: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Summary• Invest in your logging infra pretty early on.

• Kafka has matured a lot and with some tuning works well in the Cloud.

• Data quality is not free, need to proactively ensure it.

• Invest in automated tools to detect quality issues both pre- and post-release.

• Culture building and education go a long way.

Page 46: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Thank you!

Btw, we’re hiring :)

Page 47: DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Questions?


Recommended