+ All Categories
Home > Technology > DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Date post: 16-Apr-2017
Category:
Upload: hakka-labs
View: 820 times
Download: 8 times
Share this document with a friend
30
Parquet at Datadog How we use Parquet for tons of metrics data Doug Daniels, Director of Engineering
Transcript
Page 1: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Parquet at DatadogHow we use Parquet for tons of metrics data

Doug Daniels, Director of Engineering

Page 2: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Outline• Monitor everything• Our data / why we chose Parquet• A bit about Parquet• Our pipeline• What we see in production

Page 3: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Datadog is a monitoring service for large scale cloud applications

Page 4: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Collect EverythingIntegrations for 100+ components

Page 5: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Monitor Everything

Page 6: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Alert on Critical Issues Collaborate to Fix them Together

Monitor Everything

Page 7: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

We collect a lot of data

Page 8: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

We collect a lot of data…

the biggest and most important of which is

Page 9: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Metric timeseries data

timestamp 1447020511

metric system.cpu.idle

value 98.16687

Page 10: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

We collect hundreds of billionsof these per day…and growing every week

Page 11: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

And we do massive computation on them

Page 12: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

• Statistical analysis• Machine learning• Ad-hoc queries• Reporting and aggregation• Metering and billing

Page 13: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

One size does not fit all.

Page 14: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

ETL and aggregation Pig / Hive

ML and iterative algorithms Spark

Interactive SQL Presto

We want the best frameworkfor each job

Page 15: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

How do we do that?Duplicating data storageWriting redundant glue codeCopying data definitions and schema

Page 16: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

1. Separate Compute and Storage• Amazon S3 as data system-of-record• Ephemeral, job-specific clusters• Write storage once, read everywhere

Page 17: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

2. Standard Data Format• Supported by major frameworks• Schema-aware• Fast to read• Strong community

Page 18: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
Page 19: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Parquet is a column-oriented data storage format

Page 20: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

What we love about Parquet• Interoperable!• Stores our data super efficiently• Proven at scale on S3• Strong community

Page 21: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Quick Parquet primer

Column A

Row Group 0

Page 0

Page 1

Page 2

Column B

Page 0

Page 1

File Meta Data

Footer

Row Group 0 Metadata

Column B Metadata

…Column A Metadata

Page 22: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Efficient storage and fast reads• Space efficiencies (per page)• Type-specific encodings: run-length, delta, …• Compression• Query efficiencies (support varies by framework)• Projection pushdown (skip columns)• Predicate pushdown (skip row groups)• Vectorized read (many rows at a time)

Page 23: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Broad ecosystem support

Page 24: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Our Parquet pipeline

Kafka

- Buffer- Sort- Dedupe- Upload

Go

Hadoop Spark Presto

PrestoS3FileSystemEMRFS

- Partition- Write Parquet- Update Metastore

Luigi/Pig

Metadata

Hive Metastore

csv-gzAmazon S3 Parquet

Page 25: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

What we see in production

Page 26: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Excellent storage efficiency• For just 5 columns:• 3.5X less storage than gz-compressed CSV• 2.5X less than internal query-optimized columnar format

Page 27: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

…a little too efficient• One 80MB parquet file with 160M rows / row group• Creates long-running map tasks• Added PARQUET-344 to limit rows per row group• Want to switch this to limit by uncompressed size

Page 28: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Slower read performance with AvroParquet

Runtime for our test job (mins)

0 min

10 min

20 min

30 min

40 min

CSV + gz

AvroParq

uet +

gz

AvroParq

uet +

snap

py

Parque

t + gz

• Tried reading schema w/ AvroReader• Saw 3x slower reads with

AvroParquet (YMMV) on jobs• Using HCatalog reader + hive

metastore for schema in production

Page 29: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Our Parquet configuration• Parquet block size (and dfs block size): 128 MB• Page size: 1 MB• Compression: gzip• Schema Metadata: pig (we actually use hive metastore)

Page 30: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Thanks!Want to work with us on Spark, Hadoop, Kafka, Parquet, Presto, and more?

DM me @ddaniels888 or [email protected]


Recommended