Hadoop Summit 2014 : Benchmarking Apache Hive at Yahoo Scale

transcript

Benchmark ing H ive a t Yahoo Sca le

P R E S E N T E D B Y M i t h u n R a d h a k r i s h n a n J u n e 4 , 2 0 1 4⎪

2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a

About myself

HCatalog Committer, Hive contributor› Metastore, Notifications, HCatalog APIs› Integration with Oozie, Data Ingestion

Other odds and ends› DistCp

mithun@apache.org

2014 Hadoop Summit, San Jose, California

About this talk

Introduction to “Yahoo Scale” The use-case in Yahoo The Benchmark The Setup The Observations (and, possibly, lessons) Fisticuffs

The Y!Grid

16 Hadoop Clusters in YGrid› 32500 Nodes› 750K jobs a day

Hadoop 0.23.10.x, 2.4.x Large Datasets

› Daily, hourly, minute-level frequencies› Terabytes of data, 1000s of files, per dataset instance

Pig 0.11 Hive 0.10 / HCatalog 0.5

› => Hive 0.12

Data Processing Use cases

Pig for Data Pipelines› Imperative paradigm› ~45% Hadoop Jobs on Production Clusters

• M/R + Oozie = 41%

Hive for Ad hoc queries› SQL› Relatively smaller number of jobs

• *Major* Uptick

Use HCatalog for Inter-op

6 Yahoo Confidential & Proprietary

Hive is Currently the Fastest Growing Product on the Grid

Mar-13 Apr-13 May-13 Jun-13 Jul-13 Aug-13 Sep-13 Oct-13 Nov-13 Dec-13 Jan-14 Feb-14 Mar-14 Apr-14 May-140

5,000,000

10,000,000

15,000,000

20,000,000

25,000,000

30,000,000

All Jobs Hive (% of all jobs)

2.4 million Hive jobs

Business Intelligence Tools

{Tableau, MicroStrategy, Excel, … } Challenges:

› Security• ACLs, Authentication, Encryption over the wire, Full-disk Encryption

› Bandwidth• Transporting results over ODBC

› Query Latency• Query execution time• Cost of query “optimizations”• “Bad” queries

The Benchmark

TPC-h› Industry standard (tpc.org/tpch)› 22 queries› dbgen –s 1000 –S 3

• Parallelizable

Reynold Xin’s excellent work:› https://github.com/rxin› Transliterated queries to suit Hive 0.9

Relational Diagram

PARTKEY

CONTAINER

COMMENT

RETAILPRICE

PARTKEY

SUPPKEY

AVAILQTY

SUPPLYCOST

COMMENT

SUPPKEY

ADDRESS

NATIONKEY

ACCTBAL

COMMENT

ORDERKEY

PARTKEY

SUPPKEY

LINENUMBER

RETURNFLAG

LINESTATUS

SHIPDATE

COMMITDATE

RECEIPTDATE

SHIPINSTRUCT

SHIPMODE

COMMENT

CUSTKEY

ORDERSTATUS

TOTALPRICE

ORDERDATE

ORDER-PRIORITY

SHIP-PRIORITY

COMMENT

CUSTKEY

ADDRESS

ACCTBAL

MKTSEGMENT

COMMENT

PART (P_)SF*200,000

PARTSUPP (PS_)SF*800,000

LINEITEM (L_)SF*6,000,000

ORDERS (O_)SF*1,500,000

CUSTOMER (C_)SF*150,000

SUPPLIER (S_)SF*10,000

ORDERKEY

NATIONKEY

EXTENDEDPRICE

DISCOUNT

QUANTITY

NATIONKEY

REGIONKEY

NATION (N_)25

COMMENT

REGIONKEY

COMMENT

REGION (R_)5

The Setup

› 350 Node cluster• Xeon boxen: 2 Slots with E5530s => 16 CPUs• 24GB memory

– NUMA enabled

• 6 SATA drives, 2TB, 7200 RPM Seagates• RHEL 6.4• JRE 1.7 (-d64)• Hadoop 0.23.7+/2.3+, Security turned off• Tez 0.3.x• 128MB HDFS block-size

› Downscale tests: 100 Node cluster• hdfs-balancer.sh

The Prep

Data generation:› Text data: dbgen on MapReduce› Transcode to RCFile and ORC: Hive on MR

• insert overwrite table orc_table partition( … ) select * from text_table;

› Partitioning:• Only for 1TB, 10TB cases• Perils of dynamic partitioning

› ORC File:• 64MB stripes, ZLIB Compression

Observat ions

13 2014 Hadoop Summit, San Jose, California

100 GB

› 18x speedup over Hive 0.10 (Textfile)• 6-50x

› 11.8x speedup over Hive 0.10 (RCFile)• 5-30x

› Average query time: 28 seconds• Down from 530 (Hive 0.10 Text)

› 85% queries completed in under a minute

› 6.2x speedup over Hive 0.10 (RCFile)• Between 2.5-17x

› Average query time: 172 seconds• Between 5-947 seconds• Down from 729 seconds (Hive 0.10 RCFile)

› 61% queries completed in under 2 minutes› 81% queries completed in under 4 minutes

› 6.2x speedup over Hive 0.10 (RCFile)• Between 1.6-10x

› Average query time: 908 seconds (426 seconds excluding outliers)• Down from 2129 seconds with Hive 0.10 RCFile

– (1712 seconds excluding outliers)› 61% queries completed in under 5 minutes› 71% queries completed in under 10 minutes› Q6 still completes in 12 seconds!

Explaining the speed-ups

Hadoop 2.x, et al. Tez

› (Arbitrary DAG)-based Execution Engine› “Playing the gaps” between M&R

• Temporary data and the HDFS› Feedback loop› Smart scheduling› Container re-use› Pipelined job start-up

Hive › Statistics› “Vector-ized” Execution

ORC› PPD

ORC File Layout Data is composed of multiple streams per

column

Index allows for skipping rows (default to every 10,000 rows), keeping position in each stream, and min-max for each column

Footer contains directory of stream locations, and the encoding for each column

Integer columns are serialized using run-length encoding

String columns are serialized using dictionary for column values, and the same run length encoding

Stripe footer is used to find the requested column’s data streams and adjacent stream reads are merged

ORC UsageCREATE TABLE addresses ( name string, street string, city string, state string, zip int ) STORED AS orc TBLPROPERTIES ("orc.compress"= "ZLIB");LOCATION ‘/path/to/addresses’;

ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT orc

SET hive.default.fileformat = orcSET hive.exec.orc.memory.pool = 0.50 (ORC writer is allowed 50% of JVM heap size by default)

ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde’INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’ OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';

Key Default Comments

orc.compress ZLIB high-level compression (one of NONE, ZLIB, Snappy)

orc.compress.size 262,144 (256 KB) number of bytes in each compression chunk

orc.stripe.size 67,108,864 (64 MB) number of bytes in each stripe. Each ORC stripe is processed in one map task (try 32 MB to cut down on disk I/O)

orc.row.index.stride 10,000 number of rows between index entries (must be >= 1,000). A larger stride-size increases the probability of not being able to skip the stride, for a predicate.

orc.create.index true whether to create row indexes. This is for predicate push-down (bloom-filters). If data is frequently accessed/filtered on a certain column, then sorting on the column and using index-filters makes column filters work faster

Configuring ORC

set hive.merge.mapredfiles=true set hive.merge.mapfiles=true set orc.stripe.size=67,108,864

› Half the HDFS block-size• Prevent cross-block stripe-read• Tangent: DistCp

set orc.compress=???› Depends on size and distribution› Snappy compression hasn’t been explored

YMMV› Experiment

Conclusions

Y!Grid sticking with Hive

Familiarity› Existing ecosystem

Community Scale Multitenant Coming down the pike

› CBO› In-memory caching solutions atop HDFS

• RAMfs a la Tachyon?

We’re not done yet

SQL compliance Scaling up the metastore

performance Better BI Tool integration Faster transport

› HiveServer2 result-sets

References

The YDN blog post:› http

://yahoodevelopers.tumblr.com/post/85930551108/yahoo-betting-on-apache-hive-tez-and-yarn

Code:› https://github.com/mythrocks/hivebench (TPC-h scripts, datagen, transcode utils)› https://github.com/t3rmin4t0r/tpch-gen (Parallel TPC-h gen)› https://github.com/rxin/TPC-H-Hive (TPC-h scripts for Hive)› https://issues.apache.org/jira/browse/HIVE-600 (Yuntao’s initial TPC-h JIRA)

Thank You@mithunrk

mithun@apache.org

We are hiring!

Stop by Kiosk P9 or reach out to us at bigdata@yahoo-inc.com.

I ’m glad you asked.

Sharky comments

Testing with Shark 0.7.x and Shark 0.8› Compatible with Hive Metastore 0.9› 100GB datasets : Admirable performance› 1TB/10TB: Tests did not run completely

• Failures, especially in 10TB cases• Hangs while shuffling data• Scaled back to 100 nodes -> More tests ran through, but not completely

› nReducers: Not inferred

Miscellany› Security› Multi-tenancy› Compatibility

Hadoop Summit 2014 : Benchmarking Apache Hive at Yahoo Scale

Data & Analytics