Date post: | 16-Apr-2017 |
Category: |
Technology |
Upload: | hakka-labs |
View: | 820 times |
Download: | 8 times |
Parquet at DatadogHow we use Parquet for tons of metrics data
Doug Daniels, Director of Engineering
Outline• Monitor everything• Our data / why we chose Parquet• A bit about Parquet• Our pipeline• What we see in production
Datadog is a monitoring service for large scale cloud applications
Collect EverythingIntegrations for 100+ components
Monitor Everything
Alert on Critical Issues Collaborate to Fix them Together
Monitor Everything
We collect a lot of data
We collect a lot of data…
the biggest and most important of which is
Metric timeseries data
timestamp 1447020511
metric system.cpu.idle
value 98.16687
We collect hundreds of billionsof these per day…and growing every week
And we do massive computation on them
• Statistical analysis• Machine learning• Ad-hoc queries• Reporting and aggregation• Metering and billing
One size does not fit all.
ETL and aggregation Pig / Hive
ML and iterative algorithms Spark
Interactive SQL Presto
We want the best frameworkfor each job
How do we do that?Duplicating data storageWriting redundant glue codeCopying data definitions and schema
1. Separate Compute and Storage• Amazon S3 as data system-of-record• Ephemeral, job-specific clusters• Write storage once, read everywhere
2. Standard Data Format• Supported by major frameworks• Schema-aware• Fast to read• Strong community
Parquet is a column-oriented data storage format
What we love about Parquet• Interoperable!• Stores our data super efficiently• Proven at scale on S3• Strong community
Quick Parquet primer
Column A
Row Group 0
Page 0
Page 1
Page 2
Column B
Page 0
Page 1
File Meta Data
Footer
Row Group 0 Metadata
Column B Metadata
…Column A Metadata
Efficient storage and fast reads• Space efficiencies (per page)• Type-specific encodings: run-length, delta, …• Compression• Query efficiencies (support varies by framework)• Projection pushdown (skip columns)• Predicate pushdown (skip row groups)• Vectorized read (many rows at a time)
Broad ecosystem support
Our Parquet pipeline
Kafka
- Buffer- Sort- Dedupe- Upload
Go
Hadoop Spark Presto
PrestoS3FileSystemEMRFS
- Partition- Write Parquet- Update Metastore
Luigi/Pig
Metadata
Hive Metastore
csv-gzAmazon S3 Parquet
What we see in production
Excellent storage efficiency• For just 5 columns:• 3.5X less storage than gz-compressed CSV• 2.5X less than internal query-optimized columnar format
…a little too efficient• One 80MB parquet file with 160M rows / row group• Creates long-running map tasks• Added PARQUET-344 to limit rows per row group• Want to switch this to limit by uncompressed size
Slower read performance with AvroParquet
Runtime for our test job (mins)
0 min
10 min
20 min
30 min
40 min
CSV + gz
AvroParq
uet +
gz
AvroParq
uet +
snap
py
Parque
t + gz
• Tried reading schema w/ AvroReader• Saw 3x slower reads with
AvroParquet (YMMV) on jobs• Using HCatalog reader + hive
metastore for schema in production
Our Parquet configuration• Parquet block size (and dfs block size): 128 MB• Page size: 1 MB• Compression: gzip• Schema Metadata: pig (we actually use hive metastore)
Thanks!Want to work with us on Spark, Hadoop, Kafka, Parquet, Presto, and more?
DM me @ddaniels888 or [email protected]