+ All Categories
Home > Technology > Reliable and Scalable Data Ingestion at Airbnb

Reliable and Scalable Data Ingestion at Airbnb

Date post: 13-Apr-2017
Category:
Upload: dataworks-summithadoop-summit
View: 714 times
Download: 3 times
Share this document with a friend
43
Reliable and Scalable Data Ingestion at Airbnb KRISHNA PUTTASWAMY & JASON ZHANG 1
Transcript
Page 1: Reliable and Scalable Data Ingestion at Airbnb

Reliable and Scalable Data Ingestion at Airbnb

KRISHNA PUTTASWAMY & JASON ZHANG

1

Page 2: Reliable and Scalable Data Ingestion at Airbnb

Best travel experiences powered by data products

Inform decision making based on data and insights from data

2

Page 3: Reliable and Scalable Data Ingestion at Airbnb

• ML applications

-Fraud detection, Search ranking, etc.• User activity

-Growth, matching, etc.• Experimentation, monitoring, etc.

Events Lead to Insights

3

Events

Insights

Production Data Warehouse

Page 4: Reliable and Scalable Data Ingestion at Airbnb

• JSON events without schemas• Over 800+ event types• Easy to break events during evolution/code changes• Lack of monitoring

Lead to:• Too many data outages, data loss incidents• Lack of trust on data systems

Challenges1.5 Years Ago

Page 5: Reliable and Scalable Data Ingestion at Airbnb

Data Quality Failure

CEO dashboard and Bookings dashboards were regularly broken.

1.5 Years Ago

Page 6: Reliable and Scalable Data Ingestion at Airbnb

Data Quality Failure

ERF was unstable and experimentation culture was weak

Hi team,

This is partly a PSA to let you know ERF dashboard data hasn't been up to date/accurate for several weeks now. Do not rely on the ERF dashboard for information about your experiment.

1.5 Years Ago

Page 7: Reliable and Scalable Data Ingestion at Airbnb

Events Data Ingestion Must be Reliable

7

Page 8: Reliable and Scalable Data Ingestion at Airbnb

• Timeliness

-Land on time; be predictable• Completeness

-All data should land in the warehouse• Data Quality

- Identify anomalous behavior

Reliability GuaranteesTargeted

8

Page 9: Reliable and Scalable Data Ingestion at Airbnb

Kafka

Camus

EZSplit

HDFS

Ruby

Java

Javascript

Mobiles

DataPipelines

Data Products

RESTProxy

Kafka

Client

Kafka

Client

Kafka

Client

Invalid data

Stuckprocesse

Buffer overflow

Node failures

Hostnetwork

Brokererrors

Distributed Systems

9

Page 10: Reliable and Scalable Data Ingestion at Airbnb

• More users, activity, bookings, etc.

• Need lightweight techniques that are themselves not bottlenecks

Rapid Growth in Events Data

1/8/14%

2/8/14%

3/8/14%

4/8/14%

5/8/14%

6/8/14%

7/8/14%

8/8/14%

9/8/14%

10/8/14%

11/8/14%

12/8/14%

1/8/15%

2/8/15%

3/8/15%

4/8/15%

5/8/15%

6/8/15%

7/8/15%

8/8/15%

9/8/15%

10

Page 11: Reliable and Scalable Data Ingestion at Airbnb

• How many events were actually emitted?• How many must have been emitted?• What should be in the correct data?• How to catch subtle anomalies in data?

No Ground Truth

11

Page 12: Reliable and Scalable Data Ingestion at Airbnb

E2E Audit

Schema Enforcement

Anomaly Detection

Component Level Audit

Realtime Ingestion

Phases of Rebuilding Data

Ingestion

Page 13: Reliable and Scalable Data Ingestion at Airbnb

Phase 1: Audit each component

13

Page 14: Reliable and Scalable Data Ingestion at Airbnb

Instrumentation, monitoring, alerting on each component

• Process health• Count of input/output events• Week-over-week comparison

Guarding Against Component Failures

14

Page 15: Reliable and Scalable Data Ingestion at Airbnb

Kafka

Camus

EZSplit

HDFS

Ruby

Java

Javascript

Mobiles

DataPipelines

Data Products

RESTProxy

Kafka

Client

Kafka

Client

Kafka

Client

Stuckprocesse

Buffer overflow

Node failures

Hostnetwork

Brokererrors

Pipeline bug

15

Page 16: Reliable and Scalable Data Ingestion at Airbnb

Phase 2: Audit E2E system

16

Page 17: Reliable and Scalable Data Ingestion at Airbnb

Hardening each component is not sufficient• Account for new failure modes • Quantify aggregate event loss• Narrow down the source of loss• Need end-to-end and out-of-band checks on the full

pipeline

E2E System Auditing

17

Page 18: Reliable and Scalable Data Ingestion at Airbnb

Canary Service• A standalone service that sends events at a known rate• Compare events landed in warehouse and alert on loss• Simple, reliable, and accurate

18

Page 19: Reliable and Scalable Data Ingestion at Airbnb

DB as Proxy for Ground Truth

• Compare DB mutations with corresponding events emitted• DB serves as a ground truth for events with 1:1 mapping

19

Page 20: Reliable and Scalable Data Ingestion at Airbnb

Audit Pipeline Overview

• Need to quantify event loss and ensure SLA is not violated• Attach a header to each event when it enters the pipeline:

REST proxy, Java, and Ruby• Header contains host, process, sequence, and uuid• Group sequence by (host, process) in warehouse: quantify

event loss, and attribute loss to hosts• Extend to multi-hop sequence: easy to attribute loss to

internal component in the pipeline

20

Page 21: Reliable and Scalable Data Ingestion at Airbnb

Event Schema for Audit Metadata

Page 22: Reliable and Scalable Data Ingestion at Airbnb

Kafka

Camus

EZSplit

HDFS

Ruby

Java

Javascript

Mobiles

DataPipelines

Site-facing

Services

RESTProxy

Kafka

Client

Kafka

Client

Kafka

Client

124

124

3

canary service

database snapshot

22

Page 23: Reliable and Scalable Data Ingestion at Airbnb

Phase 3: Schema enforcement

23

Page 24: Reliable and Scalable Data Ingestion at Airbnb

• JSON events without schemas• Easy to break events during evolution/code changes• Over 800+ event types• Lack of monitoring

Lead to:• Too many data outages, data loss incidents• Lack of trust on data systems

Challenges1.5 Years Ago

Page 25: Reliable and Scalable Data Ingestion at Airbnb

25

Data Incidents

Page 26: Reliable and Scalable Data Ingestion at Airbnb

Schema Enforcement

• Schema tech stack: Thrift• Libraries for sending thrift objects from different clients:

Java, Ruby, JS, and Mobile• Who should define schemas: data scientist or product

engineer• Development workflow: schema evolution, and bridge

producer and consumer schemas• Self-serve

26

Page 27: Reliable and Scalable Data Ingestion at Airbnb

Thrift Schema Repository

Why Thrift?• Easy syntax• Good performance in Ruby• Ubiquitous

Advantages of schema repo?• Great Catalyst for communication, documentation, etc• it ships jar and gems

• Will developers hate you for this? no

Page 28: Reliable and Scalable Data Ingestion at Airbnb

• Standard Field in the event schema• Managed Explicitly• use Semantic Versioning:

1.0.0 = MODEL . REVISION . ADDITION

MODEL is a change which breaks the rules of backward compatibility.

Example: changing the type of a field.

REVISION is a change which is backward compatible but not forward compatible.

Example: adding a new field to a union type.

ADDITION is a change which is both backward compatible and forward compatible.

Example: adding a new optional field.

Schema Evolution

Page 29: Reliable and Scalable Data Ingestion at Airbnb

Example of Thrift Event

because the event is your API

Page 30: Reliable and Scalable Data Ingestion at Airbnb

30

Example

Page 31: Reliable and Scalable Data Ingestion at Airbnb

31

Example Schema Mapping in the Warehouse

Page 32: Reliable and Scalable Data Ingestion at Airbnb

Phase 4: Anomaly detection

32

Page 33: Reliable and Scalable Data Ingestion at Airbnb

A Bad Date Picker

33

• On 9/22/2015, we launched a new Datepicker experiment on P1

• Half of users received new_datepicker treatment, the other half were control

• It was shut off by 9/29/2015, and metrics recovered

Page 34: Reliable and Scalable Data Ingestion at Airbnb

Diagnosis

34

• We realize a 14% drop in “searches with dates” after about 7 days

• The scope of the impact was unclear; we just know a subset of locales were affected

• The root cause analysis depends heavily on vigilance and a bit of guesswork / luck

• Drilling down by Country revealed an interesting pattern

Page 35: Reliable and Scalable Data Ingestion at Airbnb

Diagnosis

35

• Drilling down into source = P1, we see a stronger pattern• Something qualitatively worse is happening in IT, GB, and

CA• “Affected locales: en-GB, it, en-AU, en-CA, en-NZ, da, zh-tw,

ms-my and probably some more”• How did we know to try P1?• How to know which countries to slice by?

Page 36: Reliable and Scalable Data Ingestion at Airbnb

Curiosity

36

• Let’s automate this process!• It’s hard to know which dimension combinations matter…

...so try as many of them as we reasonably can, in an intelligent way• Drill-down into dimension combinations that are

Specific enough to be informative,

Yet still contribute meaningfully to top-level aggregate

Page 37: Reliable and Scalable Data Ingestion at Airbnb

Method

37

• Retrieve time series data from a source (GROUP BY time, dimension)• Detect any anomalies in each dimension value’s time series• Explore across the dimension space to compare values against each other• Prune the set of dimension values using anomalies / exploration• Drill-down into remaining dimensions for pruned values

Page 38: Reliable and Scalable Data Ingestion at Airbnb

Phase 5: Realtime ingestion

38

Page 39: Reliable and Scalable Data Ingestion at Airbnb

Streaming Ingestion Pipelineend to end

Page 40: Reliable and Scalable Data Ingestion at Airbnb

HBase Row Key

• Event key = event_type.event_name.event_uuid.

Ex: air_event.canaryevent.016230ae-a3d8-434e• Shard id = Hash (Event key) % Shard_num• Shard key = Region_start_keys[Shard_id].

Ex: 0000000• Row key = Shard_key.Event_key.

Ex: 0000000.air_events.canaryevent.016230-a3db-434e

40

Page 41: Reliable and Scalable Data Ingestion at Airbnb

Dedup and Repartition

41

SparkExecutor1

HBaseRegion1

Executor2

ExecutorN

Region2

RegionM

RegionK

Page 42: Reliable and Scalable Data Ingestion at Airbnb

Hive Hbase Connector

42

CREATE EXTERNAL TABLE `search_event_table` ( `rowkey` string COMMENT 'from deserializer', `event_bytes` binary COMMENT 'from deserializer')ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.airbnb.HBaseSerDe'STORED BY 'org.apache.hadoop.hive.hbase.airbnb.HBaseStorageHandler'WITH SERDEPROPERTIES ( ‘hbase.timerange.hourly.boundary'='true', // for current hour 'hbase.columns.mapping'=':key, b:event_bytes', ‘hbase.key.pushdown’=‘jitney_event.search_event', ‘hbase.timestamp.min’=‘…', // arbitrary time range start ‘hbase.timestamp.max’=‘…') // arbitrary time range end

Page 43: Reliable and Scalable Data Ingestion at Airbnb

• Ingest over 5B events with less than 100 events/day loss• We can alert on data loss in real time (loss > 0.01%)• We can quantify which machine/service lead to how much loss• We can identify even subtle anomalies in the data

Conclusions

43


Recommended