Apache Hive on ACIDAlan GatesHive PMC MemberCo-founder HortonworksMay 2016
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
History
Hive only updated partitions– INSERT...OVERWRITE rewrote an entire partition– Forced daily or even hourly partitions– Could add files to partition directory, file compaction was manual
What about concurrent readers?– Ok for inserts, but overwrite caused races– There is a zookeeper lock manager, but…
No way to delete or update rows No INSERT INTO T VALUES…
– Breaks some tools
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Do You Need ACID?
Hadoop and Hive have always…– Just said no to ACID– Perceived as tradeoff for performance
But, your data isn’t static– It changes daily, hourly, or faster– Sometimes it needs restated (late arriving data) or facts change (e.g. a user’s physical address)– Loading data into Hive every hour is so 2010; data should be available in Hive as soon as it arrives
We saw users implementing ad hoc solutions– This is a lot of work and hard to get right– Hive should support this as a first class feature
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
When Should You Use Hive’s ACID?
NOT OLTP!!! Updating a Dimension Table
– Changing a customer’s address
Delete Old Records– Remove records for compliance
Update/Restate Large Fact Tables– Fix problems after they are in the warehouse
Streaming Data Ingest– A continual stream of data coming in– Typically from Flume or Storm
NOT OLTP!!!
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SQL Changes for ACID
Since Hive 0.14 New DML
– INSERT INTO T VALUES(1, ‘fred’, ...);– UPDATE T SET (x = 5[, ...]) [WHERE ...]– DELETE FROM T [WHERE ...]– Supports partitioned and non-partitioned tables, WHERE clause can specify partition but not required
Restrictions– Table must have format that extends AcidInputFormat
• currently ORC• work started on Parquet (HIVE-8123)
– Table must be bucketed and not sorted• can use 1 bucket but this will restrict write parallelism
– Table must be marked transactional
• create table T(...) clustered by (a) into 2 buckets stored as orc TBLPROPERTIES ('transactional'='true');• Existing ORC tables that are bucketed can be marked transactional via ALTER
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ingesting Data Into Hive From a Stream
Data is flowing in from generators in a stream Without this, you have to add it to Hive in batches, often every hour
– Thus your users have to wait an hour before they can see their data
New interface in hive.hcatalog.streaming lets applications write small batches of records and commit them– Users can now see data within a few seconds of it arriving from the data generators
Available for Apache Flume and Apache Storm
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Design
HDFS does not allow arbitrary writes– Store changes as delta files– Stitched together by client on read
Writes get a transaction ID– Sequentially assigned by metastore
Reads get highest committed transaction & list of open/aborted transactions– Provides snapshot consistency– No exclusive locks required
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Not HBase
Good– Handles compactions for us– Already has similar data model with LSM
Bad– When we started this there were no transaction managers for HBase, this requires transactions– Hfile is column family based rather than columnar– HBase focused on point lookups and range scans
• Warehousing requires full scans
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Stitching Buckets Together
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDFS Layout
Partition locations remain unchanged– Still warehouse/$db/$tbl/$part
Bucket Files Structured By Transactions– Base files $part/base_$tid/bucket_*– Delta files $part/delta_$tid_$tid/bucket_*
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Input and Output Formats
Created new AcidInput/OutputFormat– Unique key is original transaction id, bucket, row id
Reader returns correct version of row based on transaction state Also added raw API for compactor
– Provides previous events as well
ORC implements new API– Extends records with change metadata
• Add operation (d, u, i), latest transaction id, and key
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Transaction Manager
Existing lock managers– In memory - not durable– ZooKeeper - requires additional components to install, administer, etc.
Locks need to be integrated with transactions– commit/rollback must atomically release locks
We sort of have this database lying around which has ACID characteristics (metastore) Transactions and locks stored in metastore Uses metastore DB to provide unique, ascending ids for transactions and locks
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Transaction & Locking Model
DML statements are auto-commit Snapshot isolation
– Reader will see consistent data for the duration of a query
Current transactions can be displayed using SHOW TRANSACTIONS Three types of locks
– shared read– shared write (can co-exist with shared read, but not other shared write)– exclusive
Operations require different locks– SELECT, INSERT – shared read (inserts cannot conflict because there is no primary key)– UPDATE, DELETE – shared write– DROP, INSERT OVERWRITE – exclusive
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Compaction
Each transaction (or batch of transactions in streaming) creates a new delta directory Too many files = NameNode and poor read performance due to fan in on merge Need to automatically compact files
– Initiated by metastore server, run as MR jobs in the cluster– Can be manually initiated by user via ALTER TABLE COMPACT
Minor compaction merges many deltas into one– Run when there are more than 10 delta directories (configurable)
Major compaction merges deltas with base and rewrites base– Run when size of the deltas > 10% of the size of the base (configurable)
Old files kept around until all readers are done with their snapshots, then cleaned up– Compaction and data read/writes can be done in parallel with no need to pause the world
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Issues Found and (Some) Fixed Not GA ready in Hive 1.2 or 2.0, hope to have GA ready by 1.3 and 2.1 Deadlocks in the RDBMS
– The way the Hive metastore used the RDBMS caused a lot of deadlocks – greatly improved
Usability– SHOW COMPACTIONS and SHOW LOCKS did not give users/admins enough information to successfully determine who
was blocking whom or what was getting compacted – improved, some work still to do here
Resilience– System was easy to knock over when clients did silly things (like open 1M+ transactions) – improved, though I am sure
there are still some ways to kill it– Initially compactor threads only run in 1 metastore instance – resolved, now can run in multiple instances
Correctness– Streaming ingest did not enforce proper bucket spraying – resolved– Initial versions of the compactor had a race condition that resulted in record loss – resolved– Adding a column to a table or changing a column’s type caused read time errors - resolved– Updates can get lost when overlapping transactions update the same partition – HIVE-13395
Performance– Some work done here (e.g. making predicate push down work, efficient split combinations)– Much still to be done
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Next: MERGE
Standard SQL, added in SQL 2003 Problem, today each UPDATE requires a scan of the partition or table
– There is no way to apply separate updates in a batch
Allows upserts Use case:
– bring in batch from transactional/front end systems– Apply as insert or updates (as appropriate) in one read/write pass
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Future Work
Multi-statement transactions (BEGIN, COMMIT, ROLLBACK) Integration with LLAP
– Figure out how MVCC works with LLAP’s caching– Build a write path through LLAP
Lower the user burden– Make the bucketing automatic so the user does not have to be aware of it– Allow user to determine sort order of the table– Eventually remove the transactional/non-transactional distinction in tables
Improve monitoring and alerting facilities– Make is easier for an admin to determine when the system is in trouble, e.g. the compactor is not
running or is failing on every run, there are too many open transactions, etc.
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You