Presto at Twitter

Presto at TwitterFrom Alpha to Production

Bill Graham - @billgrahamSailesh Mittal - @saileshmittal

March 22, 2016Facebook Presto Meetup

● Scheduled jobs: Pig

● Ad-hoc jobs: Pig

Previously at Twitter

● Pig out

● Scalding in

Then

● Scheduled jobs: Scalding

● Ad-hoc queries for engineers: Scalding REPL

● Ad-hoc queries for non-engineers: ?

● Low-latency queries: ?

Then

● Scheduled jobs: Scalding

● Ad-hoc queries: Presto

● Low-latency queries: Presto

Now

● Qualitative comparison early 2015

● Considered: Presto, SparkSQL, Impala, Drill, and Hive-on-Tez

● Selected Presto

○ Maturity: high

○ Customer feedback: high

○ Ease of deploy: high

○ Community: strong, open

○ Nested data: yes

○ Language: Java

Evaluation

● Cloudera

● HortonWorks

● Yahoo

● MapR

● Rocana

● Stripe

● Playtika

Evaluation

● Facebook

● Dropbox

● Neilson

● TellApart

● Netflix

● JD.com

Thanks to those we consulted with

● Deployment

● Integration

● Monitoring/Alerting

● Log Collection

● Authorization

● Stability

Alpha to Beta to Production

● 192 bare-metal workers

● 76GB RAM

● 24 cores

● 2 x 1 GbE NIC

Cluster

● Publish to internal maven repo

● Python + pssh

● Brittle

Deployment

● Building a dedicated mesos cluster

○ 200 nodes

○ 128GB ram

○ 56 cores

○ 10 GbE

● One worker per container per host

● Consistent support model within Twitter

Mesos/Aurora

Hive Metastore

HDFS

Presto

MySQL

DAL

Data Pipeline,Scalding,

ETL

Event Queue

Integration

● Internal system called viz

● Plugin on each node

● curl JMX stats and send

● Load spiky by nature, alerts hard

Monitoring & Alerting

Monitoring & Alerting

● Internal system called loglens

● Java LogHandler adapters

● Airlift integration challenges

● Using Python log tailing adapter

Log Collection

Log Collection

● User-level auth required

● UGI.proxyUser when accessing HDFS (PR #4382 and Teradata PR #105)

● Manage access via LDAP groups per cluster that presto can proxy as

● HMS cache complicates things

● HMS file-based auth on writes only

Authorization

https://github.com/facebook/presto/pull/4382

https://github.com/Teradata/presto/pull/105

https://github.com/Teradata/presto/pull/105

● Hadoop client memory leaks (user x query x FileSystems)

● GC Pressure on coordinator

● Implemented FileSystem cache (user x FileSystems)

Authorization Challenges

java.lang.OutOfMemoryError: unable to create new native thread

● Queries failing on the coordinator

● Coordinator is thread-hungry, up to 1500 threads

● Default user process limit is 1024

$ ulimit -u

1024

● Increase ulimit

Stability #1

Encountered too many errors talking to a worker node

● Outbound network spikes hitting caps (300 Mb/s)

● Coordinator sending plan was costly (Fixed in PR #4538)

● Tuned timeouts

● Increased Tx cap

Stability #2



Encountered too many errors talking to a worker node

● Timeouts still being hit

● Correlated GC pauses with errors

● Tuned GC

● Changed to G1CG collector - BAM!

● 10s of seconds -> 100s of millis

Stability #3

G1 Garbage Collector

Stability #3

https://twitter.com/billgraham/status/707991498306883584

No worker nodes available

● Happens sporadically

● Network, HTTP responses, GC all look good

● Problem: workers saturating NICs (2 Gb/sec)

● Solution #1: reduce task.max-worker-threads

● Solution #2: Larger NICs

Stability #4

● Distributed log collection

● Metrics tracking

● Measure and Tune JVM pauses

● G1 Garbage Collector

● Measure network/NIC throughput vs capacity

Lessons Learned

● MySQL connector with per-user auth

● Support for LZO/Thrift

● Improvements for Parquet nested data structures

Future Work

Q&A

Date post:	16-Apr-2017
Category:	Software
Upload:	bill-graham
View:	2,993 times
Download:	0 times

Presto at Twitter

Software