Date post: | 16-Apr-2017 |
Category: |
Software |
Upload: | bill-graham |
View: | 2,993 times |
Download: | 0 times |
Presto at TwitterFrom Alpha to Production
Bill Graham - @billgrahamSailesh Mittal - @saileshmittal
March 22, 2016Facebook Presto Meetup
● Scheduled jobs: Scalding
● Ad-hoc queries for engineers: Scalding REPL
● Ad-hoc queries for non-engineers: ?
● Low-latency queries: ?
Then
● Qualitative comparison early 2015
● Considered: Presto, SparkSQL, Impala, Drill, and Hive-on-Tez
● Selected Presto
○ Maturity: high
○ Customer feedback: high
○ Ease of deploy: high
○ Community: strong, open
○ Nested data: yes
○ Language: Java
Evaluation
● Cloudera
● HortonWorks
● Yahoo
● MapR
● Rocana
● Stripe
● Playtika
Evaluation
● Dropbox
● Neilson
● TellApart
● Netflix
● JD.com
Thanks to those we consulted with
● Deployment
● Integration
● Monitoring/Alerting
● Log Collection
● Authorization
● Stability
Alpha to Beta to Production
● Building a dedicated mesos cluster
○ 200 nodes
○ 128GB ram
○ 56 cores
○ 10 GbE
● One worker per container per host
● Consistent support model within Twitter
Mesos/Aurora
● Internal system called viz
● Plugin on each node
● curl JMX stats and send
● Load spiky by nature, alerts hard
Monitoring & Alerting
● Internal system called loglens
● Java LogHandler adapters
● Airlift integration challenges
● Using Python log tailing adapter
Log Collection
● User-level auth required
● UGI.proxyUser when accessing HDFS (PR #4382 and Teradata PR #105)
● Manage access via LDAP groups per cluster that presto can proxy as
● HMS cache complicates things
● HMS file-based auth on writes only
Authorization
● Hadoop client memory leaks (user x query x FileSystems)
● GC Pressure on coordinator
● Implemented FileSystem cache (user x FileSystems)
Authorization Challenges
java.lang.OutOfMemoryError: unable to create new native thread
● Queries failing on the coordinator
● Coordinator is thread-hungry, up to 1500 threads
● Default user process limit is 1024
$ ulimit -u
1024
● Increase ulimit
Stability #1
Encountered too many errors talking to a worker node
● Outbound network spikes hitting caps (300 Mb/s)
● Coordinator sending plan was costly (Fixed in PR #4538)
● Tuned timeouts
● Increased Tx cap
Stability #2
Encountered too many errors talking to a worker node
● Timeouts still being hit
● Correlated GC pauses with errors
● Tuned GC
● Changed to G1CG collector - BAM!
● 10s of seconds -> 100s of millis
Stability #3
No worker nodes available
● Happens sporadically
● Network, HTTP responses, GC all look good
● Problem: workers saturating NICs (2 Gb/sec)
● Solution #1: reduce task.max-worker-threads
● Solution #2: Larger NICs
Stability #4
● Distributed log collection
● Metrics tracking
● Measure and Tune JVM pauses
● G1 Garbage Collector
● Measure network/NIC throughput vs capacity
Lessons Learned
● MySQL connector with per-user auth
● Support for LZO/Thrift
● Improvements for Parquet nested data structures
Future Work